<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://dataviz.chalkbeat.org/feed.xml" rel="self" type="application/atom+xml" /><link href="https://dataviz.chalkbeat.org/" rel="alternate" type="text/html" /><updated>2025-11-11T21:17:33+00:00</updated><id>https://dataviz.chalkbeat.org/feed.xml</id><title type="html">Chalkbeat Data Team</title><subtitle>Documentation, guides, and technical explainers from the Chalkbeat data team.</subtitle><entry><title type="html">Bugfixes and performance improvements</title><link href="https://dataviz.chalkbeat.org/2025/11/11/bug-fixes-and-performance-improvements.html" rel="alternate" type="text/html" title="Bugfixes and performance improvements" /><published>2025-11-11T00:00:00+00:00</published><updated>2025-11-11T00:00:00+00:00</updated><id>https://dataviz.chalkbeat.org/2025/11/11/bug-fixes-and-performance-improvements</id><content type="html" xml:base="https://dataviz.chalkbeat.org/2025/11/11/bug-fixes-and-performance-improvements.html"><![CDATA[<p>When we want to announce the <a href="https://projects.votebeat.org/2025/how-ranked-choice-voting-works-pet-mascot-election/">official Votebeat pet mascot results</a>, or walk readers through <a href="https://projects.chalkbeat.org/2024/interactive-map-chicago-school-board-districts/">Chicago’s new school board districts</a>, we turn to a project scaffold that provides powerful tools for static site generation and deployment. The core of this template, which we use at Civic News to produce our “widescreen” storytelling projects, is now more than ten years old. It has powered countless projects for me at three different newsrooms, been used by other outlets around the country, and survived several sea changes in the JavaScript ecosystem.</p>

<p>I am not of the opinion that software must be updated frequently and promptly. Sometimes it’s just finished! Many of the tools that we use at Civic News have barely changed since the 80s, and they continue to do the job very well. Indeed, given how often commercial software “upgrades” these days introduce AI, subscriptions, or other anti-features, I would often argue that it’s good to have stable codebases that are rarely, if ever, updated.</p>

<p>So I have been reluctant to deeply renovate the interactive template just for the pleasure of doing so. But <a href="https://arstechnica.com/security/2025/10/npm-flooded-with-malicious-packages-downloaded-more-than-86000-times/">recent attacks</a> on the NPM ecosystem, which powers the template’s underlying pipeline, have highlighted the value of reducing a project’s dependency tree. Its task runner, <a href="https://gruntjs.com">Grunt</a>, also predates a lot of JavaScript syntax features like async/await, and duplicates a lot of functionality that’s now simply built into the Node runtime.</p>

<p>To make a long story short, we have not completely rewritten the template. But we have migrated it to a new foundation, and done some cleanup on its legacy dependencies, in a way that we hope will set it up to be used for another ten years. You can find the code in our <a href="https://github.com/chalkbeat/create-interactive/">new git repository</a>, and we’ll be continuing to iterate on it there. But for people who are interested in the decisions we’ve made for this revision (and those that we’re still evaluating), read on.</p>

<h2 id="step-one-from-grunt-to-heist">Step one: from Grunt to Heist</h2>

<p>Grunt is a good tool by accident: originally designed when “plugins” were the hottest thing in JavaScript development, it was largely superseded in most development projects by Webpack when React swallowed the front-end culture whole. The interactive template never really used Grunt’s plugin library, except for the local dev server. But we did appreciate its <a href="https://en.wikipedia.org/wiki/Forth_(programming_language)">Forth-like</a> composition paradigm, in which smaller commands could be combined into larger tasks, and then those tasks could be run as a complete build pipeline.</p>

<p>Grunt hasn’t been substantially updated in a decade, and as mentioned above, it now has some quirks that have made it one of the clunkier bits of the template–nothing that required urgent replacement, but always a source of annoyance. For example, tasks are defined in Grunt as synchronous functions by default, and it doesn’t know how to handle functions marked with the async keyword, or to await the Promise objects they return. This leads to some <a href="https://github.com/Chalkbeat/interactive-template/blob/018f85f803ef170bad49bda715fab80739d31f86/root/tasks/publish.js#L104-L168">awkward wrapper code</a> in task files.</p>

<p>Grunt also included a lot of code to paper over old Node deficiencies. Its API covered recursive file copy (added in Node v16) or synchronous read/write (largely irrelevant after the async file system calls became stable in Node v11), and it offered argument parsing for command line flags (which we now have built-in via <code class="language-plaintext highlighter-rouge">utils.parseArgs()</code>). None of this should really be necessary in 2025.</p>

<p><a href="https://github.com/thomaswilburn/heist/">Heist</a> is essentially an update of Grunt’s task orchestration, formalizing the “context object” pattern while jettisoning everything else. With the requirements trimmed back to just the functionality that modern Node doesn’t provide itself–running tasks in order, and composing them into meta-tasks–Heist ends up being about 8KB of code, plus one dependency for file matching.</p>

<p>Rewriting Grunt tasks to run in Heist was relatively easy (mostly consisting of adding <code class="language-plaintext highlighter-rouge">async</code> and replacing “grunt” with “heist”). Some of the I/O heavy operations are now easier to read, since they don’t need to be written callback-style. They’re also all standard ES modules, instead of CommonJS–module types being a constant source of training frustration for staffers newer to JavaScript.</p>

<h2 id="step-two-from-grunt-init-to-npm-init">Step two: from <code class="language-plaintext highlighter-rouge">grunt-init</code> to <code class="language-plaintext highlighter-rouge">npm init</code></h2>

<p>Even when Grunt was being actively maintained, its <code class="language-plaintext highlighter-rouge">grunt-init</code> setup utility was rarely used and practically deprecated. Nevertheless, it served us well for years, providing mechanisms for copying the template files to a new directory, updating them with specific values, and running post-install scripts. This kind of functionality was a standard feature of post-React frameworks, with “create-react-app” and its equivalents taking up the same role. Eventually, npm itself paved the cowpaths with a <a href="https://docs.npmjs.com/cli/v8/commands/npm-init">standard pathway</a> for initializers.</p>

<p>It makes sense to move our template over to that standard, even if it provides less functionality out of the box. But building a template via a language’s package manager also raises a serious question: how hard should it be to update the template?</p>

<p>See, by default, npm installs project scaffolding from the npm registry, same as any other library. Based on the potential attack surface for large businesses that depend on it, publishing changes to npm packages has (rightfully) gotten more restrictive over the last few years. But we are not a large business: we will set up a project maybe three times a year and only in a static build context, so we don’t have the same security requirements as the typical web app shop. And we want it to be accessible enough that interns and “lonely coders” in external newsrooms feel comfortable contributing.</p>

<p>Luckily, there is an escape hatch for npm: you can install packages (including scaffolding) from a git repo instead. We already use this for some of our smaller tools that didn’t seem worth deploying to npm proper, such as the <a href="https://github.com/thomaswilburn/cantrip">Cantrip</a> codename generator. For people who are used to running initializers from other frameworks, it may look a little odd, but I feel much more comfortable with how this integrates into a journalism context, and it also hews closer to the original grunt-init setup (which also involved cloning a repo, albeit to a local profile directory).</p>

<p>If anything, this part of the migration really emphasized the ways that the gravity well of big tech has distorted open source. <code class="language-plaintext highlighter-rouge">npm init</code>, like so many features, is designed primarily for the needs of corporations distributing tools to thousands or millions of developers, plus automated pipelines that might be installing and running that code hundreds of times a day. Especially as these companies have cozied up to power, not to mention breaking the open source covenant in order to train their AI tools on free code, it seems worthwhile to plan ahead for independent infrastructure that is responsive to journalists’ needs instead.</p>
<h2 id="step-three-from-less-to-css">Step three: from Less to CSS</h2>

<p>The last part of the update process was to clear out some of the browser-side accommodations that are no longer needed. For example, given the removal of Internet Explorer from support, we don’t really need to transpile JavaScript anymore with something like Babel–Safari still sets the upper bound on the syntax we can use, but <a href="https://bugs.webkit.org/show_bug.cgi?id=242740">more in terms of reliability</a> than raw support, and the ceiling is plenty high for our purposes. Removing Babel makes our Rollup bundle output a little cleaner, and it trims out a chunk of our Node modules folder.</p>

<p>On the other hand, you have something like Less, the CSS pre-processor that we’ve traditionally used for adding support for nesting, variables, and unit math. These are all features that CSS now supports natively, but they get overridden by the Less versions. Notably, we lose dynamic functions like <code class="language-plaintext highlighter-rouge">min()</code> and <code class="language-plaintext highlighter-rouge">max()</code> from CSS because the Less compiler outputs the final result at build time, and there’s no option to disable the way Less flattens nested styles. Working around these conflicts was increasingly annoying, and sometimes impossible.</p>

<p>So while I was rewriting all the tasks anyway, I swapped out Less for PostCSS, which still runs compilation but only for “future” CSS: features that have been standardized (and thus will not have syntax conflicts), but aren’t yet in all browsers. We also use it to combine files loaded via <code class="language-plaintext highlighter-rouge">@import</code> into a single CSS bundle, since that’s still a pain point for performance reasons.</p>

<p>In both JavaScript and CSS tooling, the eventual plan is to reach the point where we don’t need something like Rollup or PostCSS at all. We’re almost there for JavaScript, as native ES modules can handle everything except for loading external libraries. CSS imports are trickier, since they currently block rendering when placed in the page’s head and don’t support parallel downloads, but hope springs eternal.</p>

<h2 id="i-love-it-when-a-plan-comes-together">I love it when a plan comes together</h2>

<p>In practice, other than changing <code class="language-plaintext highlighter-rouge">grunt</code> to <code class="language-plaintext highlighter-rouge">heist</code> on the command line, the experience of using this version of the template is largely indistinguishable from the original. It wasn’t broken, and we didn’t fix it. Instead, the goal was to remove baggage that made it harder to train new users and might stymie outside contributors. Things like CommonJS modules and <code class="language-plaintext highlighter-rouge">require()</code>, which were stumbling blocks for people who hadn’t been around for 20 years of Node, are now gone. And there are now fewer non-standard transformations between the client-side code we write and what the browser sees.</p>

<p>Behind the scenes, the improvements are more noticeable. Compared to the pre-Heist version, the updated interactive template has five fewer top-level dependencies, and I think we may be able to knock off a few more. Its <code class="language-plaintext highlighter-rouge">node_modules</code> folder is about two-thirds the size, at 66MB, much of which is API libraries for talking to Google and AWS. It shaves a second or two off a full static build, making it roughly 25-30% faster, and more selective build commands see greater improvements due to reduced startup time in individual tasks.</p>

<p>Most importantly, I think this helps solidify the template’s foundation moving forward. It has always been a good tool–in my biased opinion, the best scaffolding I’ve ever used for static news builds. But these changes remove many of the caveats around its initial setup requirements and dated runtime, and let it take advantage of modern JavaScript syntax and practices.</p>

<p>Here’s to another ten years!</p>]]></content><author><name>Thomas Wilburn</name></author><summary type="html"><![CDATA[Introducing updates to our venerable interactive template]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://dataviz.chalkbeat.org/assets/images/2025-11-11-bugfixes/heist.jpg" /><media:content medium="image" url="https://dataviz.chalkbeat.org/assets/images/2025-11-11-bugfixes/heist.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">How we analyzed disparities in Arizona’s federal-only voter rolls</title><link href="https://dataviz.chalkbeat.org/2024/12/16/arizona.html" rel="alternate" type="text/html" title="How we analyzed disparities in Arizona’s federal-only voter rolls" /><published>2024-12-16T00:00:00+00:00</published><updated>2024-12-16T00:00:00+00:00</updated><id>https://dataviz.chalkbeat.org/2024/12/16/arizona</id><content type="html" xml:base="https://dataviz.chalkbeat.org/2024/12/16/arizona.html"><![CDATA[<p>Last year, Votebeat reporters documented how Arizona’s federal-only voter laws could be <a href="https://www.votebeat.org/arizona/2023/12/18/arizona-federal-only-voters-concentrated-college-campuses-proof-of-citizenship/">preventing college students from voting</a>. After publication, Secretary of State Adrian Fontes also suggested that the laws disproportionately target voters living on tribal lands, too.</p>

<p>With the November election in sight, we wanted to revisit the federal-only voters list and verify whether these disparities persisted. However, that task poses a number of technical problems, especially if we want to answer the question statewide.</p>

<p>Arizona administers elections at the county level. That means that certain types of voting data, such as precinct geographic boundaries, have to be obtained directly from the county governments, some of whom don’t make it publicly available. Even when data is available, it may be in dissimilar formats that require reprocessing before it can be combined across counties.</p>

<p>That means if we want to ask basic questions like, “Who’s likelier to end up on the federal-only voter list?” our options are limited — especially on a tight deadline, with just a few days between the election and the release of finalized voter rolls.</p>

<p>To conduct this analysis, we acquired digital map files (these are often referred to as part of a Geographic Information System, or GIS) from each county individually. Then, we standardized and combined those files with a graphical, open-source toolkit called QGIS.</p>

<p>To calculate disparities at the precinct level, we analyzed a few different combinations of data. The bulk of our analysis compared registered voters to our federal-only voter list and precinct maps. This gave us an estimate of how many eligible voters lived in each precinct. Then, we used a geospatial database called PostGIS to overlay the tribal census tracts, which don’t neatly overlap with precinct boundaries, with that registration data.</p>

<p>Spoiler: We did find that <a href="https://www.votebeat.org/arizona/2024/12/13/arizona-voter-citizenship-proof-law-shows-groups-struggling-to-provide-it/">voters living on precincts that contain tribal census tracts and college campuses were much likelier to be on the federal-only voter list</a>.</p>

<h2 id="step-one-lets-make-a-state-map">Step one: Let’s make a state map</h2>

<p>First, we contacted every county’s geospatial team if they had one, or their recorder’s office if they didn’t. About one-third just sent us GIS maps of their voting precincts, and waived the fees. (Thank you!) The others tried to charge us, or didn’t respond. We had to apply some <a href="https://dataviz.chalkbeat.org/2023/10/23/scraping-sans-selenium.html">creative methods</a> to obtain publicly available GIS files maintained by official county sources.</p>

<p><img src="/assets/images/2024-12-13-arizona/qgis.png" alt="Colorful image of Arizona's 15 counties on a mapping software." /></p>

<p>Files in hand, we ran them all through QGIS’s “<a href="https://www.qgistutorials.com/en/docs/3/handling_invalid_geometries.html">verify geometries</a>” check and fixed a few shape issues. In two cases, this meant manually fixing incorrect precinct boundaries whose shapes were deleted during the automated repair process. This is why it’s important to check geometry against geographic attributes at the end of any processing step.</p>

<h3 id="turning-precinct-parts-into-precincts-using-dissolve">Turning precinct parts into precincts using ‘dissolve’</h3>

<p>A common issue when using heterogeneous voting boundaries: Precincts may have smaller precinct parts. Some of our maps included precinct parts, and others didn’t. These sub-precinct boundaries were too granular for our analysis, and also not very useful since we didn’t have them for the whole state. So we “dissolved” them (a GIS term of art for combining multiple shapes into one) to generate full precinct shapes instead, based on the precinct attribute. If the parts were missing a separate precinct column, generally we were able to procedurally extract the precinct name or number from the precinct part name.</p>

<p>At this point, we made more manual fixes with the vertex tool to clean up messy lines that might have resulted from the dissolve.</p>

<p><img src="/assets/images/2024-12-13-arizona/vertex.png" alt="Colorful close-up image of vertex dots on lines inside of a geographic shape on a mapping software." /></p>

<h3 id="combining-shapes-and-columns">Combining shapes and columns</h3>

<p>OK, so let’s say we now want to combine the shapes. Except … they’re all from different counties, which means every single one has a different attribute schema based on whatever the counties felt like naming their columns. We want one column for all of the county names, all of the precinct names, and any other associated data.</p>

<p>Bad news: QGIS, the graphical shapefile editor, really hates that.</p>

<p>Worse, it doesn’t really accommodate easy combination or editing, once you’ve imported a shapefile. Frequent QGIS users may have encountered the software’s supposed “renaming” features (or the lack thereof). Creating new columns for a shapefile or two, then merging them, isn’t too much of a burden. But unifying <em>15</em> shapefiles with entirely different attribute schemas? Ick.</p>

<p><a href="https://github.com/mbloch/mapshaper">Mapshaper</a> saves the day on this one.</p>

<p>You can run this handy utility on the command line locally or use mapshaper’s online UI. Personally, I preferred the ‘Console’ view on <a href="mapshaper.org">mapshaper.org</a>, as it gave me the ability to make sure my schema made sense as I went along — and to actually see the attribute field updates as I make them.</p>

<p>Mapshaper’s documentation can also be spotty, so it gives you a chance to try things and make mistakes until they work.</p>

<p><img src="/assets/images/2024-12-13-arizona/mapshaper.png" alt="Screenshot of a series of incorrect console commands and an image of Arizona's voting precincts on Mapshaper." /></p>

<p>Make sure to load the full shapefile package, which should include <code class="language-plaintext highlighter-rouge">.shp</code> and an associated <code class="language-plaintext highlighter-rouge">.dbf</code> file that contains the attribute information, as well as the <code class="language-plaintext highlighter-rouge">.prj</code> data.</p>

<p>To start, reproject all of the shapefiles into your preferred coordinates system.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">-proj</span> wgs84
</code></pre></div></div>

<p>Renaming fields is easy:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">newname</span><span class="o">=</span>oldname
</code></pre></div></div>

<p>You can also rename multiple fields at once.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">-rename-fields</span> <span class="nv">PNum</span><span class="o">=</span>CODE,PShort<span class="o">=</span>Precinct,id<span class="o">=</span>OBJECTID_1
</code></pre></div></div>

<p>We also added a county text ID to keep track of the files.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>each <span class="nv">county</span><span class="o">=</span><span class="s1">'Yuma'</span>
</code></pre></div></div>

<p>Finally, we filtered down to the fields we’d like to keep.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">-filter-fields</span> <span class="nb">id</span>,PNum,PShort,PName,Shape_STAr,Shape_STLe,county
</code></pre></div></div>

<p>I also created a standard concatenation between county and precinct names, so that we had a unique ID for each precinct in the geospatial analysis.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>each <span class="nv">PName</span><span class="o">=</span><span class="nb">id</span>+<span class="s2">" "</span>+PNum
</code></pre></div></div>

<p>At some point, you’ll need a file that has all the same fields in a consistent order. But you don’t need to worry about creating <code class="language-plaintext highlighter-rouge">NULL</code> fields in mapshaper; the columns will be standardized when you merge the layers and export into a <code class="language-plaintext highlighter-rouge">.shp</code> format.</p>

<p>There <em>are</em> ways to use the Mapshaper console to merge your files and layers. I preferred to export each county as a separate <code class="language-plaintext highlighter-rouge">.geojson</code> and continue running Mapshaper locally across all the files, instead:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>npx mapshaper <span class="nt">-i</span> geojson/<span class="k">*</span>.json combine-files snap <span class="nt">-o</span> arizona.json <span class="nv">format</span><span class="o">=</span>geojson combine-layers
</code></pre></div></div>

<p>This command instructs Mapshaper to grab all <code class="language-plaintext highlighter-rouge">.json</code> files in one folder, combine them, snap vertices that occupy the same geographic space, and then merge all layers into a single output file.</p>

<p>At this point, you can convert your <code class="language-plaintext highlighter-rouge">.json</code> into a shapefile using your preferred method. (I loaded the statewide file into the Mapshaper web console, exported it as a shapefile, and checked it over in QGIS.)</p>

<p>Note that in your <code class="language-plaintext highlighter-rouge">.json</code> file, your properties will not be in the same order, and any columns that don’t exist in a given county’s <code class="language-plaintext highlighter-rouge">.json</code> will not be created or populated during the merge.</p>

<p>To illustrate this, I started with the tribal census tracts designated by the U.S. Census Bureau — <a href="[https://www2.census.gov/geo/pdfs/partnerships/psap/G-610.pdf](https://www2.census.gov/geo/pdfs/partnerships/psap/G-610.pdf)">which works with tribal officials to develop a map</a>. It seemed important to acknowledge the Nations’ actual boundaries as much as possible, instead of cutting them off at the edge of Arizona’s governmental boundaries. I attempted this process a few times to try to find the best way to illustrate the relationship between the Navajo Nation and the state of Arizona, in particular.</p>

<p><img src="/assets/images/2024-12-13-arizona/ttracts.png" alt="Screenshot of the Census tribal tracts that overlap with Arizona." /></p>

<p>Then I subtracted the land areas from a state map of Arizona and stitched the two geographies together.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>npx mapshaper precincts_for_erase.shp <span class="nt">-erase</span> precincts_to_keep.shp <span class="nt">-o</span> out.shp
</code></pre></div></div>

<p>That left us with something that Datawrapper let me turn into the following map:</p>

<p><img src="/assets/images/2024-12-13-arizona/ttracts_map.png" alt="Screenshot of the state of Arizona overlaid with the Navajo Nation and other tribal areas and reservations." /></p>

<h2 id="step-two-lets-analyze">Step two: Let’s analyze</h2>

<p>Why’d we do all this, again?</p>

<p>Paying for the entire state voter roll was far out of our budget. We had the federal-only voters, but they only represented around 1,500 precincts — there were 0 voters in a chunk of precincts. And we knew there are more than 1,700 precincts.</p>

<p>That meant two things:</p>

<ol>
  <li>We would’ve needed to do this entire analysis using Census Citizen Voting Age Population estimates for each precinct as the denominator.</li>
  <li>The best way to get a somewhat real list of all the precincts in the state initially seemed like it would involve pulling them out of map attribute data — if we wanted to avoid wrangling them out of poorly formatted PDFs or individual county registration records.</li>
</ol>

<p>Luckily, we restrained our request to precinct-level data, instead of person-level data. That brought the cost down significantly. It also gave us a statewide list of precincts. This turned out to be a good thing to have, because the state’s PDF count of precincts, the state’s spreadsheet registration records, the county GIS files, and officials also all gave us slightly different lists of precincts. The map was ultimately a good way to check our totals, even though we could have conducted a precinct-level analysis without it. We also used our Census estimates as a comparison point to check that our registration totals seemed reasonable, and vice-versa.</p>

<p>That just left us with one remaining question we couldn’t directly answer: Were voters living on tribal land more likely to be denied the ability to vote in local elections?</p>

<p>As I’ve mentioned, the major Native nations don’t adhere to U.S. state, county, or precinct boundaries. So, we used the tribal census tracts and calculated the geographic overlay of the precincts in PostGIS, a SQL-based geographic database tool.</p>

<p>We’ve used a similar method in a few other places, and it has served us well.</p>

<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">drop</span> <span class="k">table</span> <span class="n">if</span> <span class="k">exists</span> <span class="n">ttract_fed_intersections</span> <span class="k">cascade</span><span class="p">;</span>

<span class="k">create</span> <span class="k">table</span> <span class="n">ttract_fed_intersections</span> <span class="k">as</span> <span class="p">(</span>
    <span class="k">SELECT</span>
    <span class="n">ttracts</span><span class="p">.</span><span class="n">geoid</span><span class="p">,</span>
    <span class="n">precincts</span><span class="p">.</span><span class="n">PJoin</span> <span class="k">as</span> <span class="n">pct_precinct</span><span class="p">,</span>
    <span class="n">Active</span> <span class="k">as</span> <span class="n">active_fed_only</span><span class="p">,</span>
    <span class="n">act_reg</span> <span class="k">as</span> <span class="n">active_registered</span><span class="p">,</span>
    <span class="n">ST_AREA</span><span class="p">(</span><span class="n">ST_INTERSECTION</span><span class="p">(</span><span class="n">ttracts</span><span class="p">.</span><span class="n">wkb_geometry</span><span class="p">,</span> <span class="n">precincts</span><span class="p">.</span><span class="n">wkb_geometry</span><span class="p">))</span> <span class="o">/</span> <span class="n">ST_AREA</span><span class="p">(</span><span class="n">ttracts</span><span class="p">.</span><span class="n">wkb_geometry</span><span class="p">)</span> <span class="k">AS</span> <span class="n">overlap</span>
    <span class="k">FROM</span>
    <span class="n">ttracts</span><span class="p">,</span>
    <span class="n">precincts</span>
    <span class="k">WHERE</span>
    <span class="n">ST_INTERSECTS</span><span class="p">(</span><span class="n">ttracts</span><span class="p">.</span><span class="n">wkb_geometry</span><span class="p">,</span> <span class="n">precincts</span><span class="p">.</span><span class="n">wkb_geometry</span><span class="p">)</span>   
<span class="p">);</span>
</code></pre></div></div>

<p>This took some time to walk through with quality assurance tests.</p>

<p>We made some assumptions here. For one, since the federal-only voter roll is so small, we treated any federal-only voter on a precinct that contains tribal land as a <em>potential</em> voter on tribal land.</p>

<p>On the flip side, total registered voters was much higher. So we assumed that registered voters were evenly distributed throughout the precinct. If half a precinct was on tribal land, half of the voters were assigned to tribal land, and half were not.</p>

<p>We were worried that this would overestimate the federal-only voters living on tribal land. So we tried a few other methods, and found that one of the alternative methods vastly overestimated the number of total registered voters living on tribal land when we checked those numbers against information from the nations. With that exception, we found similar trends when we tried other methods, so we decided it was a fair approximation.</p>

<h2 id="step-three-spreadsheets-galore">Step three: Spreadsheets galore</h2>

<p>After obtaining six different versions of the various datasets, each of which had a new and fun limitation, we were ready to finish the analysis with a relatively straightforward comparison. We joined that data to the shapes based on precinct names, which were thankfully consistent across all of the state’s data, even though they were inconsistent in the maps. Then, we compared federal-only voters’ various behaviors — registration, active versus inactive, ballots cast — to the comparable statewide registered voters.</p>

<p>With the exception of the final vote tallies, which required a quick rollup in a data notebook tool, that analysis only required standard spreadsheet pivot tables.</p>

<p>We ran into a few issues, almost entirely all by virtue of Arizona’s methods for producing voter records: Each spreadsheet they sent us was a live pull, which meant that if they sent us one group of voters a week before the second group of voters, the numbers weren’t exactly 1:1. Voters’ statuses on or off the list could change, or someone who was on the federal-only list could have verified their citizenship, and so on.</p>

<p>As a byproduct of that, as well as the high cost of acquiring all data on individual voters, we couldn’t obtain person-level data for everyone who cast a ballot. This meant that we didn’t have detailed information on which precinct 545 federal-only voters lived in.</p>

<p>The vast majority of our analysis looked at the voter <em>rolls</em> prior to election day, not at who actually showed up at the polls, so this wasn’t a huge issue. We ended up including them in overall numbers, like the rates of how many federal-only voters cast ballots, but excluding them from geographic analysis of how many voters showed up at the polls in specific precincts.</p>

<h2 id="step-four-retrospective">Step four: Retrospective</h2>

<p>If I had this to do over again, I likely would have moved the bulk of the initial pre-processing <em>out</em> of QGIS — since I essentially had to re-do most of it in a Python script that would execute all of the necessary Mapshaper commands on command line.</p>

<p>Especially if there’s a delay in the reporting process, or an update to the data, an all-Mapshaper project structure is a vast improvement over a QGIS workflow. QGIS has a lot of nice quality-of-life features, but it just doesn’t measure up for documentation, replicability, or iterability. It’s also hard to revisit what you did in QGIS, since (like most graphical tools) its records of steps are not as accessible.</p>

<p>We knew this would be an asset that we reused for future analyses. Our reporters often want to ask questions like this, and we’d already done one story where <em>not</em> having the asset meant that we couldn’t use data to prove a few things we were pretty sure were true. So it’s a good long-term time investment to take the time to do it right, and to have something reusable.</p>]]></content><author><name>Kae Petrin</name></author><summary type="html"><![CDATA[Using unified shapefiles, voter registrations, and PostgreSQL to evaluate voters' likelihood of being blocked from casting a full ballot.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://dataviz.chalkbeat.org/assets/images/2024-12-13-arizona/header.png" /><media:content medium="image" url="https://dataviz.chalkbeat.org/assets/images/2024-12-13-arizona/header.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">State of the Art</title><link href="https://dataviz.chalkbeat.org/2024/06/21/state-of-the-art.html" rel="alternate" type="text/html" title="State of the Art" /><published>2024-06-21T00:00:00+00:00</published><updated>2024-06-21T00:00:00+00:00</updated><id>https://dataviz.chalkbeat.org/2024/06/21/state-of-the-art</id><content type="html" xml:base="https://dataviz.chalkbeat.org/2024/06/21/state-of-the-art.html"><![CDATA[<p>Starting in 2024, Chicago will have a school board with elected officers from districts in the city, as opposed to a board created by mayoral appointment. That’s a big shift. To help residents understand how the districts were set up, and how they related to the schools contained within, Chalkbeat published <a href="https://projects.chalkbeat.org/2024/interactive-map-chicago-school-board-districts/">this interactive map</a>, with reporting by Becky Vevea and development by yours truly.</p>

<p>Although the code that powers the visualization itself is not terribly novel–mostly just standard calls to the Leaflet geospatial library–it does have some interesting patterns behind the scenes. But my philosophy is that every project, no matter how prosaic, is a chance to learn something new. In this case, I wanted to build out some modern state management from scratch.</p>

<p>Outside of React, most front-end libraries on the web have converged on keeping state in observable objects, either supporting deep property access (<a href="https://vuejs.org/guide/essentials/reactivity-fundamentals.html">Vue’s “options” API</a>) or shallow values/references (<a href="https://preactjs.com/guide/v10/signals/">Preact’s signals</a> and Vue’s composition API). These libraries track values automatically, so that a change in one place will propagate up to the UI only when necessary. Compared to a traditional MVC setup, it’s a lot more dynamic, both in configuration and in composition.</p>

<p>Our story isn’t just a scrollytelling narrative, it also has a set of user-configurable filters at the end, so we need to be able to patch the map view arbitrarily as people flip through different options. This makes a Vue-style deep observable a good match for our use case. However, rather than importing a third-party dependency, I wanted to see what was possible just by using the primitives available in the browser.</p>

<h2 id="map-hosting-by-proxy">Map Hosting By Proxy</h2>

<p>An industrial-class observable object is a complicated affair, in part because it needs load-bearing ergonomics: Preact and Vue want you to be able to use these objects comfortably, in a wide variety of scenarios and across different levels of training in a large organization, without incurring performance penalties. They do a lot of optimization to clean up after the developer in pursuit of that goal.</p>

<p>By contrast, I am not working “at scale” the way these frameworks expect. I’m on a very small team on a project with a fairly limited scope. Our architecture doesn’t need to be kaiju-proof and we can tolerate a small number of API quirks. The <a href="https://github.com/Chalkbeat/chi_school_board/blob/29d8986b68aca4e431fd6fdfd9b6277df17f3598/src/js/state.js">reactive store</a> that fits our needs ended up being roughly 50 lines of code, which is short enough that you can keep the whole thing in your head.</p>

<p>To reach that size, the module relies on some built-in JavaScript features. It creates a <a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Proxy">Proxy</a> wrapper to monitor for changes, recursively returning a new Proxy if the returned value is an object (meaning that it automatically monitors sub-properties as well). <a href="https://developer.mozilla.org/en-US/docs/Web/API/queueMicrotask">queueMicrotask</a> debounces changes until the end of the event loop, at which point the store <a href="https://developer.mozilla.org/en-US/docs/Web/API/EventTarget">dispatches an event</a> to listeners.</p>

<p>The final interface is a little more effort than the mainstream equivalents, but it’s still pretty straightforward.</p>

<div class="language-js highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">var</span> <span class="nx">store</span> <span class="o">=</span> <span class="k">new</span> <span class="nx">ReactiveStore</span><span class="p">({</span>
  <span class="na">value</span><span class="p">:</span> <span class="mi">12</span><span class="p">,</span>
  <span class="na">nested</span><span class="p">:</span> <span class="p">{</span>
    <span class="na">name</span><span class="p">:</span> <span class="dl">"</span><span class="s2">Thomas</span><span class="dl">"</span>
  <span class="p">}</span>
<span class="p">});</span>

<span class="c1">// add a change listener</span>
<span class="c1">// this is probably the clunkiest part</span>
<span class="nx">store</span><span class="p">.</span><span class="nx">addEventListener</span><span class="p">(</span><span class="dl">"</span><span class="s2">update</span><span class="dl">"</span><span class="p">,</span> <span class="nx">e</span> <span class="o">=&gt;</span> <span class="nx">console</span><span class="p">.</span><span class="nx">log</span><span class="p">(</span><span class="nx">e</span><span class="p">.</span><span class="nx">detail</span><span class="p">));</span>

<span class="c1">// only different values will trigger update events</span>
<span class="c1">// does nothing:</span>
<span class="nx">store</span><span class="p">.</span><span class="nx">data</span><span class="p">.</span><span class="nx">value</span> <span class="o">=</span> <span class="mi">12</span><span class="p">;</span>
<span class="c1">// triggers an update</span>
<span class="nx">store</span><span class="p">.</span><span class="nx">data</span><span class="p">.</span><span class="nx">value</span> <span class="o">=</span> <span class="dl">"</span><span class="s2">Hello</span><span class="dl">"</span>
<span class="c1">// logs: { value: "Hello", nested: { name: "Thomas" } }</span>

<span class="c1">// you can update any depth</span>
<span class="nx">store</span><span class="p">.</span><span class="nx">nested</span><span class="p">.</span><span class="nx">name</span> <span class="o">=</span> <span class="dl">"</span><span class="s2">Chalkbeat</span><span class="dl">"</span><span class="p">;</span>
<span class="c1">// logs: { value: "Hello", nested: { name: "Chalkbeat" } }</span>
<span class="c1">// calling object methods will also trigger updates</span>
<span class="nx">store</span><span class="p">.</span><span class="nx">list</span> <span class="o">=</span> <span class="p">[</span><span class="dl">"</span><span class="s2">a</span><span class="dl">"</span><span class="p">];</span>
<span class="c1">// logs: { value: "Hello", nested: { name: "Chalkbeat" }, list: ["a"] }</span>
<span class="nx">store</span><span class="p">.</span><span class="nx">list</span><span class="p">.</span><span class="nx">push</span><span class="p">(</span><span class="dl">"</span><span class="s2">b</span><span class="dl">"</span><span class="p">);</span>
<span class="c1">// logs: { value: "Hello", nested: { name: "Chalkbeat" }, list: ["a", "b"] }</span>

<span class="c1">// it's possible to bypass the proxy as well</span>
<span class="nx">console</span><span class="p">.</span><span class="nx">log</span><span class="p">(</span><span class="nx">state</span><span class="p">.</span><span class="nx">raw</span><span class="p">.</span><span class="nx">nested</span><span class="p">.</span><span class="nx">name</span><span class="p">);</span> <span class="c1">// "Chalkbeat"</span>
<span class="c1">// this will not trigger an update</span>
<span class="nx">state</span><span class="p">.</span><span class="nx">raw</span><span class="p">.</span><span class="nx">nested</span><span class="p">.</span><span class="nx">name</span> <span class="o">=</span> <span class="dl">"</span><span class="s2">Thomas</span><span class="dl">"</span><span class="p">;</span>
</code></pre></div></div>

<p>Not bad for less than a hundred lines of code.</p>

<p>In the actual story, a listener function watches this store and updates the Leaflet map when changes come in, usually in response to a scroll event. The story file (written in ArchieML) only needs to specify what’s different for a given view, which is merged over a default object before that is applied to the state. The map module exposes a function for doing that cleanly:</p>

<div class="language-js highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">export</span> <span class="kd">function</span> <span class="nx">mergeChanges</span><span class="p">(</span><span class="nx">patch</span><span class="p">)</span> <span class="p">{</span>
  <span class="nb">Object</span><span class="p">.</span><span class="nx">assign</span><span class="p">(</span><span class="nx">state</span><span class="p">.</span><span class="nx">data</span><span class="p">,</span> <span class="nx">STATE_DEFAULT</span><span class="p">,</span> <span class="nx">patch</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>

<p>Once our store was in place, it actually simplified the application in places I didn’t expect. For example, since we don’t need all the detailed enrollment and demographic data for schools at the start of the story, we load it asynchronously, merge it with other data, and then add it to the store object. Rendering methods can exit early if the data isn’t there yet, confident that the update event will trigger a re-render when it becomes available.</p>

<h2 id="cause-and-effect">Cause and Effect</h2>

<p>Of course, it’s easy to be a JavaScript primitivist if your needs are primitive. The argument for frameworks has alway been that when your project gets bigger than “a simple to-do app,” you need more robust code for managing complexity. A common example of this is around derived values: if value C is computed from the inputs of A and B, and you change A, how does your program know that C should now be different?</p>

<p>Preact and Vue differ in the exact mechanics, but they both offer a way to set up a derived value with dependency tracking, so that the function knows its implicit inputs and will update in response. These generally run on a “pull” model: they only update when requested, and only recompute their value if their dependencies have changed. Vue also has watch functions that are a “push” model for creating side effects.</p>

<p>Our needs are simple by comparison to a lot of web apps, but we do still need derived values. The map filters at the end of the story are connected to the state object using <a href="https://github.com/Chalkbeat/chi_school_board/blob/29d8986b68aca4e431fd6fdfd9b6277df17f3598/src/js/state-bindings.js">a custom element</a> that creates a two-way binding between the data and the DOM. Mostly these are a 1:1 correspondence. But in some cases, we want a change in one place on the data object to trigger some other alterations. For example, if you change the selected district in the drop-down menu, it should clear the selected school so that we’re not showing unrelated geographic information.</p>

<p>I struggled with this for a little while, with my first implementation being an “update” listener that would check values and enforce these rules out-of-band. This felt awkward and likely to cause synchronization bugs later, as did a second attempt that added listeners to the custom elements themselves.</p>

<p>At some point, however, I realized that my data object could include getter/setter methods that largely mimicked Vue’s computed/watch properties. In a setter, I could cause side effects, and in the getter I could compute derived property values. Here’s an example for districts and schools:</p>

<div class="language-js highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">export</span> <span class="kd">var</span> <span class="nx">state</span> <span class="o">=</span> <span class="k">new</span> <span class="nx">ReactiveStore</span><span class="p">({</span>
  <span class="p">...</span><span class="nx">STATE_DEFAULT</span><span class="p">,</span>
  <span class="c1">// setting the school ID actually sets the selectedSchool</span>
  <span class="kd">set</span> <span class="nx">school</span><span class="p">(</span><span class="nx">id</span><span class="p">)</span> <span class="p">{</span>
    <span class="kd">var</span> <span class="nx">school</span> <span class="o">=</span> <span class="k">this</span><span class="p">.</span><span class="nx">schools</span><span class="p">?.</span><span class="nx">find</span><span class="p">(</span><span class="nx">s</span> <span class="o">=&gt;</span> <span class="nx">s</span><span class="p">.</span><span class="nx">id</span> <span class="o">==</span> <span class="nx">id</span><span class="p">);</span>
    <span class="k">this</span><span class="p">.</span><span class="nx">district</span> <span class="o">=</span> <span class="nx">school</span><span class="p">?.</span><span class="nx">home_district</span> <span class="o">||</span> <span class="dl">""</span><span class="p">;</span>
    <span class="k">this</span><span class="p">.</span><span class="nx">selectedSchool</span> <span class="o">=</span> <span class="nx">school</span> <span class="o">||</span> <span class="kc">false</span><span class="p">;</span>
  <span class="p">},</span>
  <span class="kd">get</span> <span class="nx">school</span><span class="p">()</span> <span class="p">{</span>
    <span class="k">return</span> <span class="k">this</span><span class="p">.</span><span class="nx">selectedSchool</span><span class="p">?.</span><span class="nx">id</span> <span class="o">||</span> <span class="dl">""</span><span class="p">;</span>
  <span class="p">},</span>
  <span class="c1">// setting the district number resets the school filter</span>
  <span class="kd">set</span> <span class="nx">district</span><span class="p">(</span><span class="nx">number</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="nx">number</span> <span class="o">!=</span> <span class="k">this</span><span class="p">.</span><span class="nx">_district</span><span class="p">)</span> <span class="p">{</span>
      <span class="k">this</span><span class="p">.</span><span class="nx">_district</span> <span class="o">=</span> <span class="nx">number</span><span class="p">;</span>
      <span class="k">this</span><span class="p">.</span><span class="nx">selectedSchool</span> <span class="o">=</span> <span class="kc">false</span><span class="p">;</span>
    <span class="p">}</span>
  <span class="p">},</span>
  <span class="kd">get</span> <span class="nx">district</span><span class="p">()</span> <span class="p">{</span>
    <span class="k">return</span> <span class="k">this</span><span class="p">.</span><span class="nx">_district</span><span class="p">;</span>
  <span class="p">}</span>
<span class="p">});</span>
</code></pre></div></div>

<p>This feels better to me than my earlier approach, as it puts the rules inline with the data itself, instead of having an external function try to intercept changes after they’ve already been committed to the store. We could also memoize these, just as Vue and Preact do, if they were a performance issue. But since these generally operate in “UI time,” where changes have to come from a human interaction and rarely involve more than one cycle through the dependency chain, it proved unnecessary.</p>

<p>More importantly, this doesn’t rely on any framework knowledge. You don’t have to be a Preact or Vue developer to see what it’s doing–you only have to understand JavaScript itself, which is a significantly more transferable skill.</p>

<h2 id="the-frame-works">The Frame Works</h2>

<p>Conventional wisdom in front-end development is that creating your own framework is a Turing tarpit: once the initial enthusiasm wears off, you’re left with additional burdens for maintenance, training, and hiring that you wouldn’t have if you simply used a well-known library. I don’t know if I agree with this completely, given the frantic pace of change in JavaScript’s “well-known library” culture. Our module is built around browser primitives that are stable and well-documented, like Proxy and events, or standard language features like getter functions. It’s not like I’m reinventing observability from scratch.</p>

<p>The particular environment of news development also gives us some advantages here that larger, longer-lived web apps don’t have. Our code is frozen and served as a static artifact, so we don’t need to really worry about upgrades or ongoing maintenance. And since we’re working with tools that live outside the framework ecosystem (such as D3 or Leaflet), we’d be spending a lot of time hitting the escape hatches in anything off-the-shelf anyway.</p>

<p>Ultimately, the value of creating our own solution is less about the space or efficiency savings, and more about retaining better awareness of what we’re importing, and why. It’s also a good opportunity to ask whether a pattern <em>actually works</em> or if it’s just something that tickles our lizard brains.</p>

<p>In this case, I think the reactive store pattern proves successful, if only because it dramatically simplifies the process of two-way binding between our filter controls, our scroll blocks, and the map state. Since the backing data just looks like a regular JavaScript object to any code that touches it, we don’t have to pass around references to read/write functions or individual signals, and we can contain any side effects from UI inside the store object.</p>

<p>Using a library for this module wouldn’t be wrong, per se. But I think Baldur Bjarnason puts it well when <a href="https://www.baldurbjarnason.com/2024/the-deskilling-of-web-dev-is-harming-us-all/">he writes</a> “Framework skills are <em>perishable</em>, but are easily just as complicated as the foundation layers of the web platform and it takes just as much – if not more – effort to keep them up to date.” If our needs are straightforward–and let’s be clear, most of us not working at a billion dollar tech company have pretty straightforward needs–maybe we can get by just fine with our own implementation, backed by web standards, instead of importing a “<a href="https://opensource.com/article/17/2/hidden-costs-free-software">free as in puppy</a>” library, no matter how minimal it seems at the time.</p>]]></content><author><name>Thomas Wilburn</name></author><summary type="html"><![CDATA[Scrollytelling powered by custom observables]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://dataviz.chalkbeat.org/assets/images/2024-06-21-state-of-the-art/header.jpg" /><media:content medium="image" url="https://dataviz.chalkbeat.org/assets/images/2024-06-21-state-of-the-art/header.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Data Crimes of 2023</title><link href="https://dataviz.chalkbeat.org/2023/12/19/data-crimes-2023.html" rel="alternate" type="text/html" title="Data Crimes of 2023" /><published>2023-12-19T00:00:00+00:00</published><updated>2023-12-19T00:00:00+00:00</updated><id>https://dataviz.chalkbeat.org/2023/12/19/data-crimes-2023</id><content type="html" xml:base="https://dataviz.chalkbeat.org/2023/12/19/data-crimes-2023.html"><![CDATA[<p>As data journalists, we’re used to patching together a workflow from whatever tools we can lay our hands on, but there’s a fine line between Macgyver and MacGruber, and sometimes we cross it. Since we think it’s a good thing to share not only our triumphs, but also our hall of shame, here are some of our most duct-taped solutions to reporting problems from 2023.</p>

<p>(Suspiciously, almost all of these are related to processing dashboards.)</p>

<h2 id="thomas-deleting-16000-empty-columns-in-excel">Thomas: Deleting 16,000 empty columns in Excel</h2>

<p>Most people don’t know that Excel supports a maximum of 16,384 columns (that’s 2^14, for my real binary fans). It’s hard to imagine a scenario in which you would ever need that many, which is why it took me by surprise when I ran a set of spreadsheet files through a CSV converter in order to merge them, and discovered that the file size had ballooned from ~1MB to half a gigabyte, the vast majority of which was trailing commas.</p>

<p>When I opened the files in Excel, they looked normal: columns A through Q had the data that I would expect, and there was nothing to the right of Q. However, when I moused over the header and tried to resize the last column, I discovered that what I thought was a normal header row at the top of the table actually extended all the way out to near-infinity (with labels like “Cell 13678”), and it had simply been hidden by setting the column width to zero past that point:</p>

<p><img src="/assets/images/2023-12-19-data-crimes-2023/image-0.png" alt="An Excel header that jumps from column Q to column XFD" />
<em>oh no</em></p>

<p>Excel can handle these files just fine, because it doesn’t care about sparse ranges, but any of the tools that I normally use for combining data (such as <code class="language-plaintext highlighter-rouge">openpyxl</code> or <code class="language-plaintext highlighter-rouge">xlsx2csv</code>) try to get the full width of the sheet, see these hidden header cells extending off into the distance, and work themselves into a frenzy adding empty commas to every line of actual data.</p>

<p>I tried a few processing tricks to handle these files, before finally turning to my tool of last resort: Visual Basic for Applications, the macro language that’s built into Excel itself. Making a new workbook that contained the filenames of all the input files, I then ran this subroutine in the script editor to open each one, delete the trailing headers, and re-save the file.</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">Public</span> <span class="nc">Sub</span> <span class="nf">trimSheets</span><span class="o">()</span>
<span class="nc">Dim</span> <span class="n">i</span> <span class="nc">As</span> <span class="nc">Integer</span>
<span class="nc">Dim</span> <span class="n">wb</span> <span class="nc">As</span> <span class="nc">Workbook</span>
<span class="nc">Dim</span> <span class="n">file</span> <span class="nc">As</span> <span class="nc">String</span>
<span class="nc">Dim</span> <span class="n">sheet</span> <span class="nc">As</span> <span class="nc">Worksheet</span>
<span class="nc">For</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">1</span> <span class="nc">To</span> <span class="mi">13</span>
    <span class="n">file</span> <span class="o">=</span> <span class="nc">Cells</span><span class="o">(</span><span class="n">i</span><span class="o">,</span> <span class="mi">1</span><span class="o">).</span><span class="na">Value</span>
    <span class="nc">Set</span> <span class="n">wb</span> <span class="o">=</span> <span class="nc">Workbooks</span><span class="o">.</span><span class="na">Open</span><span class="o">(</span><span class="n">file</span><span class="o">)</span>
    <span class="nc">Set</span> <span class="n">sheet</span> <span class="o">=</span> <span class="n">wb</span><span class="o">.</span><span class="na">ActiveSheet</span>
    <span class="n">sheet</span><span class="o">.</span><span class="na">Range</span><span class="o">(</span><span class="s">"R1:XFD10"</span><span class="o">).</span><span class="na">Delete</span>
    <span class="n">wb</span><span class="o">.</span><span class="na">Save</span>
    <span class="n">wb</span><span class="o">.</span><span class="na">Close</span>
<span class="nc">Next</span> <span class="n">i</span>
<span class="nc">End</span> <span class="nc">Sub</span>
</code></pre></div></div>

<p>It’s a hideous, utterly unscalable monstrosity, but it did the trick. Feeding these new XLSX files to our standard tools produced CSV files that I could safely combine into a single dataset.</p>

<h2 id="kae-filtering-and-re-aggregating-state-report-cards">Kae: Filtering and re-aggregating state report cards</h2>

<p>Normally, data dashboards suck, but they have one thing going for them: all the data is formatted the same way over time, with the same column names/headers and consistent output.</p>

<p>Not so for <em>one</em> state. The download files boast 12 separate tabs — which change names from year-to-year — and 1,000+ columns per tab. Those also change names. Fun.</p>

<p>On the plus side, this problem led to a nifty piece of reusable code.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dataset</span> <span class="o">=</span> <span class="p">[</span><span class="s">'General'</span><span class="p">]</span>
<span class="n">df</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">all_df</span> <span class="o">=</span> <span class="p">[]</span>

<span class="k">for</span> <span class="n">data</span> <span class="ow">in</span> <span class="n">dataset</span><span class="p">:</span>
   <span class="n">all_files</span> <span class="o">=</span> <span class="n">glob</span><span class="p">.</span><span class="n">glob</span><span class="p">(</span><span class="n">output_dir</span> <span class="o">+</span> <span class="s">'*'</span> <span class="o">+</span> <span class="n">data</span> <span class="o">+</span> <span class="s">'*.csv'</span><span class="p">,</span><span class="n">recursive</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
   <span class="k">print</span><span class="p">(</span><span class="n">all_files</span><span class="p">)</span>
   <span class="n">all_df</span> <span class="o">=</span> <span class="p">[]</span>
   <span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="n">all_files</span><span class="p">:</span>
       <span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="n">sep</span><span class="o">=</span><span class="s">','</span><span class="p">)</span>
       <span class="n">bad_columns</span> <span class="o">=</span> <span class="p">{</span><span class="s">"Student Enrollment - Total"</span><span class="p">:</span> <span class="s">"# Student Enrollment"</span><span class="p">,</span>
                      <span class="s">'# Grade 9 Total'</span> <span class="p">:</span> <span class="s">'# Grade 9'</span><span class="p">}</span>
       <span class="n">df</span><span class="p">.</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">re</span><span class="p">.</span><span class="n">sub</span><span class="p">(</span><span class="s">'Number of '</span><span class="p">,</span><span class="s">'# '</span><span class="p">,</span><span class="n">x</span><span class="p">),</span> <span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
       <span class="n">df</span><span class="p">.</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="n">bad_columns</span><span class="p">,</span> <span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
       <span class="n">all_df</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
   <span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">concat</span><span class="p">(</span><span class="n">all_df</span><span class="p">,</span> <span class="n">sort</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
   <span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">[(</span><span class="n">df</span><span class="p">.</span><span class="n">Type</span> <span class="o">==</span> <span class="s">'District'</span><span class="p">)</span> <span class="o">|</span> <span class="p">(</span><span class="n">df</span><span class="p">.</span><span class="n">Type</span> <span class="o">==</span> <span class="s">'Statewide'</span><span class="p">)]</span>
   <span class="n">df</span><span class="p">.</span><span class="n">to_csv</span><span class="p">(</span><span class="n">cleaned_dir</span> <span class="o">+</span> <span class="n">data</span> <span class="o">+</span> <span class="s">'_processed.csv'</span><span class="p">,</span><span class="n">sep</span><span class="o">=</span><span class="s">','</span><span class="p">)</span>
</code></pre></div></div>

<p>Sometimes, it only takes one runthrough to pull all the relevant data.</p>

<p>This <em>particular</em> time, it took four separate versions of this code, because the state apparently stopped rolling up school-level data to the district level in some (but not all) years. I had to sum all the schools vertically, then group/sum all the relevant student group columns horizontally, then recreate the unique district ID from the school ID hash. Oh, and some of the 2018 data was just in a 2023 column with “(2018)” appended to the standard column name.</p>

<p>After all that, a particular school district disputed the state’s numbers, claiming they were off by as much as a factor of 10 in one school year. So I pulled out the calculator on the raw, pre-processing dataset. The state’s numbers may well be wrong! But thankfully, to save having to re-run the whole pipeline, our totals weren’t.</p>

<h2 id="thomas-extracting-aclu-data-through-the-dev-tools-debugger">Thomas: Extracting ACLU data through the dev tools debugger…</h2>

<p>If you’re writing about laws targeting trans students, there are a few datasets available but none of them are particularly great. The easiest to cite in a journalism context is probably the ACLU’s “<a href="https://www.aclu.org/legislative-attacks-on-lgbtq-rights">Mapping Attacks on LGBT Rights</a>” (although, as always, it’s the table we want and not the map). They don’t make this downloadable in a machine-readable format, but it’s scrapable, right?</p>

<p>Well, not really. The page is built in Vue and actually assembled in the browser, with data being passed to the interactive sections via attributes:</p>

<p><img src="/assets/images/2023-12-19-data-crimes-2023/image-1.png" alt="Screenshot of line after line of JSON data" /></p>

<p>There’s literally thousands of lines of JSON being jammed into the middle of the HTML here. Set aside any critique of the architecture, this is pretty annoying.</p>

<p>If our goal was to have regular updates of this dataset, we’d need to write a scraper that loads this page, finds these Vue-specific attributes, convert all the HTML entities back into strings, and hope that none of this implementation changes. That’s doable, and not even particularly difficult. But we really only needed this information once. So instead, I opened the browser dev tools, found a reference to <code class="language-plaintext highlighter-rouge">this.bills</code> in the JavaScript code, set a breakpoint, and just copied the value out of the console.</p>

<p><img src="/assets/images/2023-12-19-data-crimes-2023/image-2.png" alt="Image of the dev tools console" /></p>

<p>I don’t think the ACLU was being intentionally obfuscatory–they just built on the defaults that the framework provided–but perhaps this is a good reminder to us all that there’s not really any such thing as a “secure” JavaScript application if you’re willing to take a crowbar to it in the debugger.</p>

<h2 id="kae-tracking-schools-that-change-operators-and-districts-and-also-operators-and-districts">Kae: Tracking schools that change operators, and districts, and also operators and districts</h2>

<p>One state that we’ve covered likes to raise existential questions about the nature of schools and districts.</p>

<p>For example, what if a school was a school in a district? And then it was also its own district, for no apparent reason? But it’s also still in a public school district that same year?</p>

<p>And then what if a charter school operator took it over, so it <em>actually</em> became its own district? Also, for some reason, it was also still in the public school district? What if this school just disappeared from school-level reporting entirely for a year, then came back? What if, for some reason, the exact same school was <em>two</em> schools simultaneously in the same year, and one was a district and one was a school?</p>

<p>Oh, and what if some grades were also just a separate school, but only in the data, not in the actual school? What if one school was <em>just</em> a school, no district, and some other school was just a district, <em>not</em> a school? What if all of that was also different between Reading test and ELA/Math test reporting?</p>

<p>What then? Huh?</p>

<p>Normally, the “what if a school were a district, actually” problem is bad enough. But when we’re also trying to <em>legitimately</em> track schools over time for accountability purposes, it becomes a whole extra nightmare.</p>

<p>This is a common data question in a few different cities, where special “turnaround” programs shifted struggling district schools into the hands of charter school operators, or shifted them into specific district-run programs with similar purposes. From a data perspective, these are (usually) considered different schools. But from a student experience perspective, it’s often the same building and the same school.</p>

<p>So how do we capture both the incongruence and the continuity — and answer questions about the long-term outcomes of schools changing hands?</p>

<p>Basic questions like, “well, did the charter operator improve student outcomes?” become pretty gnarly to answer. That’s partially because charter schools often have fewer reporting requirements than district schools, and partially because it’s hard to tell what trends are attributable to a new operator, vs. other outside factors. (Like, I don’t know, a pandemic, remote learning, or a change in test format.)</p>

<p>But some states <em>really</em> want to make a problem out of this.</p>

<p>Anyway, the main solution to this is “lots and lots and lots of data crosswalks, utterly unreplicable copy-pasting, and by God you better hope you wrote all this down,” but I’ll let the project folder speak for itself.</p>

<p><img src="/assets/images/2023-12-19-data-crimes-2023/image-5.png" alt="Three folders and seven spreadsheets, with titles like OLD VERSION DON'T USE filtered" /></p>

<h2 id="thomas-unzipping-the-poor-mans-gzip">Thomas: Unzipping the poor man’s GZIP</h2>

<p>As Kae notes above, dashboards are the ruin of many a data journalist’s week. Over time, you develop a stable of tricks that you can use to extract what you need from them, within limits. So it’s almost a pleasure when, from time to time, you run into something puzzling and new.</p>

<p>My usual routine with a dashboard is to immediately look at the network requests it makes, and see if there’s a straightforward endpoint I can simply copy. To my surprise, one state was loading data from pairs of CSV files: one that contained only numbers, and the other being an “index” file with only one column of text strings:</p>

<p><img src="/assets/images/2023-12-19-data-crimes-2023/image-6.png" alt="Two text files both alike in incomprehensibility" /></p>

<p>It took a little while to figure out what was going on, but by searching for the values that we could see in the dashboard page, we eventually realized that any time there’s a whole number in the first file, it represents the value at that row number in the second. So if you see “0” in the data, that actually means “1.8%” pulled from the first row of the index file, “1” translates to “1.9%”, and so on down the list.</p>

<p>The best I can imagine is that this approach is meant to be some sort of compression, like the <a href="https://en.wikipedia.org/wiki/Huffman_coding">Huffman coding</a> that’s used in ZIP and MP3, which encodes the most common values in a binary tree. But there’s no point to this, since browsers <em>already</em> leverage GZIP compression for requests. It doesn’t really seem like it’s meant to hide anything. Maybe the developer just thought it was clever. And I’ll say this: while it’s still a data crime, it did make pulling data from this dashboard feel a little more fun, for once.</p>]]></content><author><name>Kae Petrin</name></author><summary type="html"><![CDATA[Here are the worst things we did to process education data this&nbsp;year.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://dataviz.chalkbeat.org/assets/images/2023-12-19-data-crimes-2023/header.jpg" /><media:content medium="image" url="https://dataviz.chalkbeat.org/assets/images/2023-12-19-data-crimes-2023/header.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Scraping Sans Selenium</title><link href="https://dataviz.chalkbeat.org/2023/10/23/scraping-sans-selenium.html" rel="alternate" type="text/html" title="Scraping Sans Selenium" /><published>2023-10-23T00:00:00+00:00</published><updated>2023-10-23T00:00:00+00:00</updated><id>https://dataviz.chalkbeat.org/2023/10/23/scraping-sans-selenium</id><content type="html" xml:base="https://dataviz.chalkbeat.org/2023/10/23/scraping-sans-selenium.html"><![CDATA[<p>This talk was originally presented at <a href="https://2023.srccon.org/schedule/#_session-scraping-dev-tools">SRCCON 2023</a>.</p>

<div style="position: relative; padding-bottom: 56.25%">
<iframe style="width: 100%; height: 100%; position: absolute;" src="https://www.youtube.com/embed/44fg-fzfxsY?si=0zTuhwtcJKhWDKQM" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen=""></iframe>
</div>

<p><a href="https://projects.chalkbeat.org/2023/uploads/srccon_scraping.mp4">Download video as MP4 (182MB)</a></p>

<p><a href="https://docs.google.com/presentation/d/1ESGy2vsd81wh0lX944Vv1ZNyGFi8Co7TEDMmoED08IE/edit?usp=sharing">View slides</a></p>]]></content><author><name>Thomas Wilburn</name></author><summary type="html"><![CDATA[Originally presented at SRCCON 2023]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://dataviz.chalkbeat.org/assets/images/2023-10-23-scraping-sans-selenium/header.jpg" /><media:content medium="image" url="https://dataviz.chalkbeat.org/assets/images/2023-10-23-scraping-sans-selenium/header.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Showing what’s not there</title><link href="https://dataviz.chalkbeat.org/2023/09/26/bad-data-viz-solutions.html" rel="alternate" type="text/html" title="Showing what’s not there" /><published>2023-09-26T00:00:00+00:00</published><updated>2023-09-26T00:00:00+00:00</updated><id>https://dataviz.chalkbeat.org/2023/09/26/bad-data-viz-solutions</id><content type="html" xml:base="https://dataviz.chalkbeat.org/2023/09/26/bad-data-viz-solutions.html"><![CDATA[<p>Municipal and state data is often imperfect. At its best, it’s riddled with suppression, missing values, or missing important categories or identifiers. At its worst, agencies export and maintain it incorrectly — it’s full of duplication errors, or an Excel formula got dropped or deleted, or there are 1,600 columns with unclear names and no documentation. That’s assuming you can even obtain it in a usable format, instead of trying to drudge meaning out of a PDF of slapped-together JPEGS.</p>

<p>These problems are only compounded the moment that you need to compare anything between cities or states, each with their own distinct problems.</p>

<p>Sometimes, analysis and text are sufficient and give you room for necessary disclaimers about data quality. Others, even <a href="https://onlinejournalismblog.com/2023/04/18/what-is-dirty-data-and-how-do-i-clean-it-a-great-big-guide-for-data-journalists/">extensive cleaning</a>, <a href="https://source.opennews.org/articles/building-cleaner-smarter-spreadsheets/">clever restructuring</a>, or other <a href="https://qz.com/572338/the-quartz-guide-to-bad-data">more intensive troubleshooting</a> can’t change that the data has some fundamental, underlying issues.</p>

<p>If you’re trying to then <em>visualize</em> this data, you run into more problems. A chart, even with a lengthy note, communicates certainty in a way that a caveat-filled paragraph doesn’t. Audiences tend to take an image at face value.</p>

<p>Here are a few ways that Chalkbeat navigates this.</p>

<h2 id="showing-non-comparable-data-on-the-same-topic">Showing non-comparable data on the same topic</h2>

<p>Okay, so you’ve got a bunch of data. It’s pretty interesting data! No one else has put it all together! Assembling it would be novel work that could shed light on an important issue! But there’s no standardized tracking, not every agency even has it, and the agencies that <em>do</em> have it define it all differently.</p>

<p>You can’t let audiences compare it. It’s not comparable.</p>

<p>Which means you can’t put it all on the same axis.</p>

<p>What next?</p>

<h4 id="dont-let-people-compare-the-data">Don’t let people compare the data</h4>

<p>One solution: <a href="https://co.chalkbeat.org/2022/7/12/23203732/denver-bilingual-education-tnli-school-closures-declining-enrollment">small multiples</a>. This is the great way to show the landscape and variability in data without encouraging people to draw conclusions that just don’t exist.</p>

<p>If you want to, you can also add visual flags, notes, and other cues to emphasize just <em>how different</em> the data is.</p>

<p>Here’s one way Chalkbeat used this method to collate teacher retention data, which has been a subject of much debate in the education world:</p>

<side-chain src="https://s3.amazonaws.com/chalkbeatgraphics/dailygraphics/nat-turnover-updates-20230526/index.html"></side-chain>

<p>You’ll notice that instead of laying out the line charts side-by-side, which we do more often when the data is at least <em>slightly</em> consistent but topically different, we require an interactive step to switch between datasets.</p>

<p>This graphic has a few key advantages:</p>
<ul>
  <li>The one-view-at-a-time format prevents audiences from comparing data that doesn’t use the same time-frames, collections processes, and definitions.</li>
  <li>That extra “interaction” step cues audiences that they’re switching into something that <em>is</em> different, even if they don’t understand the full extent of how or why.</li>
  <li>Individual notes let us define and describe the data on a case-by-case basis.</li>
</ul>

<h4 id="show-how-and-why-it-cant-be-compared">Show how and why it can’t be compared</h4>

<p>But if the data is <em>so</em> not-comparable, or has additional collection and methodological issues, you might not even be able to do that.</p>

<p>In that case, it might be worth a graphic that highlights the uneven landscape — and the sheer impossibility of reporting all the data. You could highlight the different criteria for collection, states that don’t release data, school districts where a given survey <em>wasn’t</em> distributed, and so on.</p>

<p>We did that here:</p>

<side-chain src="https://datawrapper.dwcdn.net/OJ9n5/12/index.html"></side-chain>

<p>This graphic doesn’t even dig into the fact that there are half a dozen different definitions of “nonbinary student” in this data, so those states that <em>do</em> collect it may be categorizing different groups of students.</p>

<p>But, it highlights the uneven landscape: a mix of released data, unreleased data, and non-collection at the state level. It also neatly sidesteps the issue that the actual data itself is pretty mediocre.</p>

<p>A similar concept is also applicable anywhere that measured reporting thresholds are different, or where municipal data doesn’t make it up to state or federal reporting <a href="https://www.themarshallproject.org/2023/07/13/fbi-crime-rates-data-gap-nibrs">due to noncompliant collection methods (or just plain shoddy reporting)</a>.</p>

<h2 id="showing-suppression-and-other-missing-data">Showing suppression and other missing data</h2>

<p>When there are systematic flaws in something we want to chart, the solution is often to go with a really limited visual that doesn’t require using the problem data. This often involves abandoning over-time trends and other common visual approaches.</p>

<p>However, if you get a bit creative, you can analyze and visualize the <em>quality</em> of the data instead of the content of the data. This is especially appropriate when the poor-quality data illuminates an accountability failure.</p>

<h4 id="show-what-we-do-have">Show what we do have</h4>

<p>The simplest solution, of course, is to work around the data.</p>

<p>When we wanted to visualize the tiny handful of students reported as nonbinary by state departments of education across the U.S., we ran into a few issues. Different timeframes, for one. But also, because the student populations were so small, the over-time trend data was tiny at the absolute level, and extremely subject to looking like it had huge surges at the proportion level (from, say, 12 to 38 kids — a 217% increase that accounts for less than 0.1% of a state’s K-12 population).</p>

<p>As an extra fun problem, some school districts across the same states started letting students change their gender markers to nonbinary at different times — and others might not even have a way for students to report that data, even though the state has started collecting it. So the actual reporting entities change quickly year-over-year. On the flip side, other states added the option all at once at the state level, but students might not know it existed for a year or so.</p>

<p>This poses extra ethical issues due to widespread misinformation about increases in populations of trans youth. A misstep in visualization could enable bad-faith, out-of-context social media screenshots and contribute to viral misinformation.</p>

<p>So, we removed the trend data in favor of first-year and most-recent-year snapshots.</p>

<side-chain src="https://datawrapper.dwcdn.net/gC74C/4/index.html"></side-chain>

<p>This restriction controls for the major problems with the data, but still gives some idea of the current reporting landscape.</p>

<p>On the other hand, when dealing with student testing data for online schools in Colorado, it simply didn’t make sense to do that — the snapshot data <em>and</em> the trend data were both bad because many of the schools were so new, and eligible students weren’t taking the tests. That meant that even though a lot of the data <em>was</em> reported, it wasn’t really meaningful or representative.</p>

<p>Instead we just chose a metric that every school <em>did</em> have to report for all students: graduation rates.</p>

<side-chain src="https://datawrapper.dwcdn.net/N111N/6/index.html"></side-chain>

<p>This isn’t always an option, of course, but especially when you’re working with big state releases, there may be multiple datasets that answer the same question.</p>

<p>In our case, we just wanted to know, “how do the students at online schools perform compared to students at brick-and-mortar schools?” Any number of datasets could answer that question. We just went with the one that showed the most complete picture of student outcomes, instead of the more traditional test score metric.</p>

<h4 id="show-what-we-dont-have--and-how-prevalent-that-is">Show what we don’t have — and how prevalent that is</h4>

<p>If all else fails, sometimes it’s better to visualize the information that is missing — demonstrating concretely what’s there and what’s not.</p>

<p>On the Colorado online schools story above, we found that we weren’t the only people having problems evaluating the performance of the schools. The schools threw up a bunch of “could not evaluate due to incomplete data” flags for state education department analysts, too.</p>

<p>The state releases categorized performance information each year. This ended up serving as a nice shorthand, which we were able to use to show the full extent of how little insight <em>anyone</em> has into the performance of these schools. It also emphasized that this problem was unique to these schools.</p>

<side-chain src="https://datawrapper.dwcdn.net/xdf1h/11/index.html"></side-chain>

<p>But in other instances, no one is doing that calculation for you. Still — you can do it yourself.</p>

<p>In the case of <a href="https://datavizpublic.in.gov/views/IndianaHighSchoolGraduateEmploymentOutcomesDashboard/EducationMattersDashboard?%3AshowAppBanner=false&amp;%3Adisplay_count=n&amp;%3AshowVizHome=n&amp;%3Aorigin=viz_share_link&amp;%3AisGuestRedirectFromVizportal=y&amp;%3Aembed=y&amp;%3Atoolbar=n#1">this government dashboard in Indiana</a>, that’s what we ended up doing. The Indiana Department of Education started tracking median income of high school graduates in one of several attempts to understand student outcomes. In the dashboard, the data looks straightforward. And it’s presented as a comprehensive income tracking resource.</p>

<p>But if you check the data definitions, there’s a red flag: the median is based only on students who have “sustained employment,” which itself has three definitional requirements.</p>

<p>Okay, so maybe we can show general employment and sustained employment?</p>

<p>Except… how is this data collected? Well, it’s based on workers… employed only in Indiana… and whose employers participate in unemployment insurance. Which means that any graduates who move out-of-state aren’t tracked at all. And not even all workers in Indiana are tracked, on top of the “sustained” vs. more general employment definitions.</p>

<p>So instead, we FOIA’d the raw numbers, did some simple subtraction, then converted to portions show exactly how incomplete the data is:</p>

<side-chain src="https://s3.amazonaws.com/chalkbeatgraphics/dailygraphics/in-wealth-breakouts-20220406/index.html"></side-chain>

<p>It really emphasizes how unrepresentative the median is — and doesn’t even require explaining all the complicated limitations of the dashboard.</p>

<h2 id="knowing-when-somethings-just-too-bad-to-work-with">Knowing when something’s just too bad to work with</h2>

<p>It’s of course also important to consider when data is <em>so</em> bad that you might take on liability by trying to use, fix, or interpret it. And it’s extra important to consider when <em>visualization</em> might lend credibility to trends that speak more to the randomness of bad data than any sort of reality.</p>

<p>Large margins of error caused by incomplete or suppressed data are an obvious one. In education data, for instance, I frequently check the suppressed totals against the unsuppressed totals, and make sure that the portion of unreported student data in any subselection of schools doesn’t exceed a reasonable threshold. If a third of relevant students’ data is omitted, it’s probably not a very useful metric.</p>

<p>But there are subtler signs, too.</p>

<p>If a data collection is new, year-over-year trend data might be <em>complete</em> but subject to the whimsies of inconsistent implementations, slow rollouts, and spotty instruction on how to collect and report the data, <a href="https://www.chalkbeat.org/2022/5/10/23063639/nonbinary-student-federal-civil-rights-data-collection">especially if the data deals with a small subpopulation</a>. If the data definitions changed for only one year — as we see <a href="https://detroit.chalkbeat.org/2022/11/7/23422689/school-attendance-detroit-michigan-students-chronic-absenteeism">frequently with COVID-19 data</a> — it may be reasonable to make that one year visually distinct, but if the data definitions change <em>every</em> year, or multiple years in a row, a visual trend is just noise.</p>

<p>Maybe you’ve assembled data from multiple states or regional agencies that all track <em>roughly</em> the same thing but use totally different definitions, or are staffed inconsistently for in-person data collection like inspections. In that case, the data might be usable on an individual basis, but any assembled dataset could imply comparisons that just aren’t there.</p>

<p>Deciding what’s <em>too bad</em> is often more of an editorial call than an exact science, but the core questions to ask yourself include:</p>

<ol>
  <li>Is so much data missing that any visualization distorts reality?</li>
  <li>Is it possible that the visual will overstate random noise, or accidentally create the impression of a trend that may or may not exist?</li>
  <li>If new agencies, geographies, domains, etc, are being introduced to the tracking, is there any way to normalize for it — e.g. with a per capita calculation or some other metric?</li>
  <li>How much would we need to explain this to an audience? Is it possible to explain it clearly and accurately, or would any data work be so full of caveats that it renders the important information incomprehensible?</li>
  <li>As a journalist, are you comfortable standing by and defending the judgment calls you made to interpret the data?</li>
</ol>

<h2 id="why-and-when-is-this-worth-all-the-time-and-effort">Why and when is this worth all the time and effort?</h2>

<p>It’s also worth considering whether the data’s low quality makes the story <em>extra</em> important.</p>

<p>Sometimes in newsrooms, if you pitch a data story with a million caveats attached, an editor’s response is: <em>We can’t run a story if the data’s no good.</em></p>

<p>It’s a reasonable first impulse, in some senses. But it’s also a major oversight from a journalistic perspective: What’s the point of a free press, if we’re not uncovering new information and shedding light on problems that the public has overlooked?</p>

<p>In an increasingly data-driven age, problems with data <em>are</em> problems for people.</p>

<p>For instance, though the U.S. Census and other data reporting on multiracial people is <a href="https://source.opennews.org/articles/condensed-history-multiracial-identification/">notoriously inconsistent</a>, they’re <a href="https://www.census.gov/library/stories/2023/06/nearly-a-third-reporting-two-or-more-races-under-18-in-2020.html">a growing portion of the U.S.</a> that shouldn’t be ignored. And oversights in data collection can have <a href="https://19thnews.org/2023/05/middle-eastern-north-african-americans-census-data/">significant impacts for individuals</a>, because it can impact research and funding. <a href="https://www.stlpr.org/law-order/2020-07-30/st-louis-police-use-of-force-data">Entirely missing data</a> also means that government oversight measures <a href="https://www.reuters.com/article/usa-pollution-airmonitors-specialreport-idUSKBN28B4RT">can’t be enforced</a>.</p>

<p>Part of the reason we encounter this so often at Chalkbeat is that, in part, we’re a <a href="https://www.chalkbeat.org/pages/about">mission-driven publication</a> that covers public education from an equity-focused lens. That means we’re <em>specifically</em> interested in showing what’s happening to students who have been historically underserved by their public schools. And that often means that our data work tries to look at students, schools, and issues that are omitted or overlooked not just by other journalists, but by the systems that track and create data on schools to start with.</p>

<p>So when there <em>is</em> data that’s flawed, but it’s on <a href="https://www.thenation.com/article/society/trans-solitary-confinement/">especially undercovered groups</a>, or on an issue <a href="https://www.washingtonpost.com/graphics/investigations/police-shootings-database/">that is complicated to track</a>, it can be well worth the time to figure out <em>some</em> way to report within the limitations of the existing data — as long as you keep in mind why it matters, and who it matters to.</p>

<script src="https://projects.chalkbeat.org/sidechain/loader.js"></script>]]></content><author><name>Kae Petrin</name></author><summary type="html"><![CDATA[How to responsibly visualize flawed data — and when to not]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://dataviz.chalkbeat.org/assets/images/2023-09-26-bad-data-viz-solutions/banner.jpg" /><media:content medium="image" url="https://dataviz.chalkbeat.org/assets/images/2023-09-26-bad-data-viz-solutions/banner.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Building an interactive graphic for teacher salaries</title><link href="https://dataviz.chalkbeat.org/2023/08/14/teacher-salaries.html" rel="alternate" type="text/html" title="Building an interactive graphic for teacher salaries" /><published>2023-08-14T00:00:00+00:00</published><updated>2023-08-14T00:00:00+00:00</updated><id>https://dataviz.chalkbeat.org/2023/08/14/teacher-salaries</id><content type="html" xml:base="https://dataviz.chalkbeat.org/2023/08/14/teacher-salaries.html"><![CDATA[<p>Coming into this summer, I knew I wanted to work with Chalkbeat’s <a href="https://github.com/chalkbeat/dailygraphics-next">dailygraphics rig</a>. I thought it would be a great opportunity to learn about building visualizations with code, since I’d primarily used tools like Tableau, Flourish, and Datawrapper in the past.</p>

<p>Building custom visualizations isn’t always a good investment compared to those kinds of out-of-the-box tools. For a lot of local stories, it makes sense to stick with Datawrapper and not reinvent the wheel. However, for stories where we had a lot of data series, like <a href="https://www.chalkbeat.org/2023/6/21/23767632/naep-math-reading-learning-loss-covid-long-term-trend">changes in average NAEP scores</a> for 13-year-olds by subject and racial or ethnic group, then I worked in the rig to have more control over axes, colors, and small multiple grouping.</p>

<p>One story in particular called for something more individualized. At the end of June, Memphis reporter Laura Testino asked the data team to help her visualize changes to <a href="https://tn.chalkbeat.org/2023/8/2/23817328/memphis-shelby-county-schools-new-teacher-salary-schedule-calculator">how Memphis Shelby County Schools pays its teachers</a>. The district moved from a 31-step payscale to an 18-step scale, with new brackets for teachers with an Ed.S. degree.</p>

<p>The goal was to have a graphic that was teacher-friendly and contained all the pertinent information. We decided to make an interactive where teachers could select their current salary range, and the graphic would return the new salary range. Rather than use one of our pre-existing dailygraphics templates, we started from scratch using JavaScript and <a href="https://lit.dev/docs/v1/lit-html/introduction/">lit-html</a>. The latter lets us create a strong link between JavaScript values and the structure or content of the page, where updating one changes the other.</p>

<h2 id="the-basics">The basics</h2>

<p>The first thing we did was set up a box for teachers to input their salary and indicate what kind of degree they had. Both inputs updated a state object with values for the numerical salary and pay ladder. A separate variable for Ed.S. degrees was controlled by a checkbox, and could be true or false. By using a <code class="language-plaintext highlighter-rouge">setState</code> function in the template code, changes would update the values and then re-render the template based on the new state.</p>

<div class="language-html highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">&lt;label</span> <span class="na">for=</span><span class="s">"salary-input"</span><span class="nt">&gt;</span>Salary:<span class="nt">&lt;/label&gt;</span>
<span class="nt">&lt;input</span>
    <span class="na">id=</span><span class="s">"salary-input"</span>
    <span class="na">.value=</span><span class="s">${state.salary}</span>
    <span class="err">@</span><span class="na">input=</span><span class="s">${e</span> <span class="err">=</span><span class="nt">&gt;</span> setState("salary", Number(e.target.value))}
&gt;
<span class="nt">&lt;label</span> <span class="na">for=</span><span class="s">"degree-select"</span><span class="nt">&gt;</span>Degree:<span class="nt">&lt;/label&gt;</span>
<span class="nt">&lt;select</span>
  <span class="na">id=</span><span class="s">"degree-select"</span>
  <span class="na">.value=</span><span class="s">${state.oldDegree}</span>
  <span class="err">@</span><span class="na">input=</span><span class="s">${e</span> <span class="err">=</span><span class="nt">&gt;</span> setState("oldDegree", e.target.value)}
&gt;
    <span class="nt">&lt;option</span> <span class="na">value=</span><span class="s">"bach"</span><span class="nt">&gt;</span>B.A.<span class="nt">&lt;/option&gt;</span>
    <span class="nt">&lt;option</span> <span class="na">value=</span><span class="s">"mast"</span><span class="nt">&gt;</span>M.A.<span class="nt">&lt;/option&gt;</span>
    <span class="nt">&lt;option</span> <span class="na">value=</span><span class="s">"doc"</span><span class="nt">&gt;</span>Doctorate<span class="nt">&lt;/option&gt;</span>
<span class="nt">&lt;/select&gt;</span>
<span class="nt">&lt;input</span> 
  <span class="na">id=</span><span class="s">"eds-check"</span>
  <span class="na">type=</span><span class="s">"checkbox"</span>
  <span class="na">.value=</span><span class="s">${state.eds}</span>
  <span class="err">@</span><span class="na">input=</span><span class="s">${e</span> <span class="err">=</span><span class="nt">&gt;</span> setState("eds", e.target.checked)}&gt;
<span class="nt">&lt;label</span> <span class="na">for=</span><span class="s">"eds-check"</span><span class="nt">&gt;</span>You have an Ed.S.?<span class="nt">&lt;/label&gt;</span>
</code></pre></div></div>

<p>The graphic takes the old salary, then displays the old salary band and new salary band. A <code class="language-plaintext highlighter-rouge">salaryMatch</code> function retrieves numbers from the data provided to us by Memphis Shelby County Schools, and then tries to assign the salary to a range by checking which step was either less than or equal to the input.</p>

<div class="language-js highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">function</span> <span class="nx">salaryMatch</span><span class="p">()</span> <span class="p">{</span>
  <span class="kd">var</span> <span class="nx">oldDegree</span> <span class="o">=</span> <span class="dl">"</span><span class="s2">old_</span><span class="dl">"</span> <span class="o">+</span> <span class="nx">state</span><span class="p">.</span><span class="nx">oldDegree</span><span class="p">;</span>
  <span class="kd">var</span> <span class="nx">newDegree</span> <span class="o">=</span> <span class="dl">"</span><span class="s2">new_</span><span class="dl">"</span> <span class="o">+</span> <span class="nx">state</span><span class="p">.</span><span class="nx">oldDegree</span><span class="p">;</span>
  <span class="k">if</span> <span class="p">(</span><span class="nx">state</span><span class="p">.</span><span class="nx">eds</span> <span class="o">&amp;&amp;</span> <span class="nx">state</span><span class="p">.</span><span class="nx">oldDegree</span> <span class="o">!=</span> <span class="dl">"</span><span class="s2">doc</span><span class="dl">"</span><span class="p">){</span>
    <span class="nx">newDegree</span> <span class="o">=</span> <span class="dl">"</span><span class="s2">new_eds</span><span class="dl">"</span><span class="p">;</span>
  <span class="p">}</span>
  <span class="kd">var</span> <span class="nx">first</span> <span class="o">=</span> <span class="nb">window</span><span class="p">.</span><span class="nx">DATA</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span>
  <span class="kd">var</span> <span class="nx">salary</span> <span class="o">=</span> <span class="nb">Number</span><span class="p">(</span><span class="nx">state</span><span class="p">.</span><span class="nx">salary</span><span class="p">.</span><span class="nx">replace</span><span class="p">(</span><span class="sr">/</span><span class="se">[\$</span><span class="sr">,</span><span class="se">]</span><span class="sr">/g</span><span class="p">,</span> <span class="dl">""</span><span class="p">));</span>
  <span class="kd">let</span> <span class="nx">found</span> <span class="o">=</span> <span class="nb">window</span><span class="p">.</span><span class="nx">DATA</span><span class="p">.</span><span class="nx">findLast</span><span class="p">(</span><span class="nx">d</span> <span class="o">=&gt;</span> <span class="nx">d</span><span class="p">[</span><span class="nx">oldDegree</span><span class="p">]</span> <span class="o">&lt;=</span> <span class="nx">salary</span><span class="p">)</span> <span class="o">||</span> <span class="nx">first</span><span class="p">;</span>
  <span class="kd">let</span> <span class="nx">previous</span> <span class="o">=</span> <span class="p">{</span>
    <span class="na">degree</span><span class="p">:</span> <span class="nx">oldDegree</span><span class="p">,</span>
    <span class="na">step</span><span class="p">:</span> <span class="nx">found</span><span class="p">.</span><span class="nx">old_step</span><span class="p">,</span>
    <span class="na">salary</span><span class="p">:</span> <span class="nx">found</span><span class="p">[</span><span class="nx">oldDegree</span><span class="p">]</span>
  <span class="p">};</span>
  <span class="kd">let</span> <span class="nx">current</span> <span class="o">=</span> <span class="p">{</span>
    <span class="na">degree</span><span class="p">:</span> <span class="nx">newDegree</span><span class="p">,</span>
    <span class="na">step</span><span class="p">:</span> <span class="nx">found</span><span class="p">.</span><span class="nx">new_step</span><span class="p">,</span>
    <span class="na">salary</span><span class="p">:</span> <span class="nx">found</span><span class="p">[</span><span class="nx">newDegree</span><span class="p">]</span>
  <span class="p">};</span>
  <span class="kd">let</span> <span class="nx">raise</span> <span class="o">=</span> <span class="nx">current</span><span class="p">.</span><span class="nx">salary</span> <span class="o">-</span> <span class="nx">previous</span><span class="p">.</span><span class="nx">salary</span><span class="p">;</span>
  <span class="k">return</span> <span class="p">{</span>
    <span class="nx">previous</span><span class="p">,</span>
    <span class="nx">current</span><span class="p">,</span>
    <span class="nx">raise</span><span class="p">,</span>
    <span class="na">row</span><span class="p">:</span> <span class="nx">found</span>
  <span class="p">};</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">oldDegree</code> and <code class="language-plaintext highlighter-rouge">newDegree</code> variables match the headers in the dataset, which have names like <code class="language-plaintext highlighter-rouge">old_bach</code> (aka old pay scale for teachers with bachelor’s degrees) and <code class="language-plaintext highlighter-rouge">new_bach</code> (new pay scale!). If someone with a bachelor’s or master’s degree indicates that they have an Ed.S., the Ed.S. salary overrides the salary ranges for those degrees, but it doesn’t override a doctorate.</p>

<p>The output from this function is returned to the user like this:</p>

<div class="language-html highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">&lt;h3&gt;</span>Your result:<span class="nt">&lt;/h3&gt;</span>
<span class="nt">&lt;div</span> <span class="na">class=</span><span class="s">"result-grid"</span><span class="nt">&gt;</span>
  <span class="nt">&lt;div</span> <span class="na">class=</span><span class="s">"old"</span><span class="nt">&gt;</span>
    <span class="nt">&lt;h4&gt;</span>${str("old_step")}: ${previous.step}<span class="nt">&lt;/h4&gt;</span>
    $${cash(previous.salary)}
  <span class="nt">&lt;/div&gt;</span>
  <span class="nt">&lt;span&gt;</span><span class="ni">&amp;raquo;</span><span class="nt">&lt;/span&gt;</span>
  <span class="nt">&lt;div</span> <span class="na">class=</span><span class="s">"raise"</span><span class="nt">&gt;</span>
    + $${cash(result.raise)}<span class="nt">&lt;br&gt;</span>
    (${(result.raise / previous.salary * 100).toFixed(1)}%)
  <span class="nt">&lt;/div&gt;</span>
  <span class="nt">&lt;span&gt;</span><span class="ni">&amp;raquo;</span><span class="nt">&lt;/span&gt;</span>
  <span class="nt">&lt;div</span> <span class="na">class=</span><span class="s">"new"</span><span class="nt">&gt;</span>
    <span class="nt">&lt;h4&gt;</span>${str("new_step")}: ${current.step}<span class="nt">&lt;/h4&gt;</span>
    $${cash(current.salary)}
  <span class="nt">&lt;/div&gt;</span>
<span class="nt">&lt;/div&gt;</span>
</code></pre></div></div>

<p>So now, our hypothetical teacher knows what their new salary band is. But how much of a raise do they get? And how does that compare to other salary bands?</p>

<h2 id="iteration">Iteration!</h2>

<p>If you want to show each row of a dataset on a page, you have to iterate, or loop through the data. In lit-html, this is done with the <code class="language-plaintext highlighter-rouge">map()</code> function.</p>

<p>In the early stages of this graphic, we iterated through the old salary ranges and displayed them as an ordered list in addition to the user’s specific salary band.</p>

<div class="language-html highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nt">&lt;ol&gt;</span>
  ${window.DATA.map(i =&gt; html`
    <span class="nt">&lt;li&gt;</span>
      B.A. $${dollars.format(i.old_bach)} |
      M.A. $${dollars.format(i.old_mast)} | 
      Doctorate $${dollars.format(i.old_doc)}
    <span class="nt">&lt;/li&gt;</span>`)}
<span class="nt">&lt;/ol&gt;</span>
</code></pre></div></div>

<p>This laid the groundwork for us to make a bar chart that showed the amount of raise for each salary band, calculated by subtracting the old salary band minimum from the new band minimum. Each row of the dataset gets a bar scaled so that the maximum raise takes up 100% of its grid container.</p>

<p>We wanted the graphic to emphasize the old salary band that matched the user’s salary, highlighting the resulting raise. In code, this translates to <code class="language-plaintext highlighter-rouge">if-then</code> logic; if the current salary variable falls within a certain range, then the graphic will usually emphasize that range.  <a href="https://lit.dev/docs/templates/conditionals/">Conditional ternary operators</a> in lit-html change the CSS style when the salary range matches.</p>

<p>Because we’re using a CSS grid to lay out our bar chart and its labels, we assign the “current” class in multiple places to bold text and highlight the bar. We also add a visually-hidden label to the user’s selected salary band for accessibility purposes, since the bold won’t be visible in most screen readers.</p>

<div class="language-js highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nx">$</span><span class="p">{</span><span class="nx">bands</span><span class="p">.</span><span class="nx">map</span><span class="p">(</span><span class="nx">b</span> <span class="o">=&gt;</span> <span class="p">{</span>
  <span class="kd">var</span> <span class="nx">highlight</span> <span class="o">=</span> <span class="nx">b</span><span class="p">.</span><span class="nx">current</span> <span class="p">?</span> <span class="dl">"</span><span class="s2">current</span><span class="dl">"</span> <span class="p">:</span> <span class="dl">"</span><span class="s2">not</span><span class="dl">"</span><span class="p">;</span>
  <span class="k">return</span> <span class="nx">html</span><span class="s2">`
  &lt;div class="step </span><span class="p">${</span><span class="nx">highlight</span><span class="p">}</span><span class="s2">"&gt;
    </span><span class="p">${</span><span class="nx">b</span><span class="p">.</span><span class="nx">step</span><span class="p">}</span><span class="s2">
    </span><span class="p">${</span><span class="nx">b</span><span class="p">.</span><span class="nx">current</span> <span class="p">?</span> <span class="nx">html</span><span class="s2">`&lt;span class="sr-only"&gt;(your step)&lt;/span&gt;`</span> <span class="p">:</span> <span class="dl">""</span><span class="p">}</span><span class="s2">
  &lt;/div&gt;
  &lt;div class="bar-container </span><span class="p">${</span><span class="nx">highlight</span><span class="p">}</span><span class="s2">"&gt;
    &lt;div class="bar"
      style="width: </span><span class="p">${</span><span class="nx">b</span><span class="p">.</span><span class="nx">raise</span> <span class="o">/</span> <span class="nx">max</span> <span class="o">*</span> <span class="mi">100</span><span class="p">}</span><span class="s2">%"&gt;
    &lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="amount </span><span class="p">${</span><span class="nx">highlight</span><span class="p">}</span><span class="s2">"&gt;$</span><span class="p">${</span><span class="nx">cash</span><span class="p">(</span><span class="nx">b</span><span class="p">.</span><span class="nx">raise</span><span class="p">)}</span><span class="s2">&lt;/div&gt;
  &lt;div class="percent </span><span class="p">${</span><span class="nx">highlight</span><span class="p">}</span><span class="s2">"&gt;</span><span class="p">${</span><span class="nx">b</span><span class="p">.</span><span class="nx">per_change</span><span class="p">}</span><span class="s2">%&lt;/div&gt;
`</span><span class="p">})}</span>
</code></pre></div></div>

<p>And that’s it!</p>

<h2 id="takeaways">Takeaways</h2>

<p>Although I had some familiarity with JavaScript before this, and I knew how important it was in web development, working on this graphic showed me how to actually develop an interactive graphic and understand the different moving parts.</p>

<p>Tools like Datawrapper let users look up information in a table or highlight specific areas of a graph, but they don’t make it easy to change what you’re looking at based on what a user wants, or to find something that isn’t explicitly included in the dataset–both of which are key to show a specific salary within a range. So even though it’s more difficult to use, this is why we have the rig: to have a bit more control over how we present data to readers, and to keep me from blowing a gasket when I can’t get Datawrapper to cooperate.</p>

<p>Working with a reporter to understand the context for these changes and with senior data editor Thomas Wilburn made it possible to develop a graphic that was both visually appealing and useful to readers.</p>]]></content><author><name>Nadia Bey</name></author><summary type="html"><![CDATA[And why Chalkbeat maintains a code-first visualization toolkit]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://dataviz.chalkbeat.org/assets/images/2023-08-14-teacher-salaries/banner.jpg" /><media:content medium="image" url="https://dataviz.chalkbeat.org/assets/images/2023-08-14-teacher-salaries/banner.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Slouching toward Pandas</title><link href="https://dataviz.chalkbeat.org/2023/06/26/slouching-toward-pandas.html" rel="alternate" type="text/html" title="Slouching toward Pandas" /><published>2023-06-26T00:00:00+00:00</published><updated>2023-06-26T00:00:00+00:00</updated><id>https://dataviz.chalkbeat.org/2023/06/26/slouching-toward-pandas</id><content type="html" xml:base="https://dataviz.chalkbeat.org/2023/06/26/slouching-toward-pandas.html"><![CDATA[<p>Why use spreadsheets for newsroom data? They may have been our best option in the Lotus 1-2-3 days, but in 2023 data journalists have many more choices. Notebook tools like Jupyter or Observable offer expressive syntax that can fall back to a real programming language, they’re visually appealing, and they’re often extremely powerful. As a result, they’re now the default for many teams.</p>

<p>Spreadsheets have not historically had most of these advantages, but they do have an almost-subterranean skill floor. I hate to see a reporter manually add numbers from two cells together with a calculator and then type the result into a third cell. But they <em>can</em> do that if they want to contribute to my analysis, whereas most of them are never going to use a Pandas dataframe. If they want to learn to do some basic cell calculations, that’s not so hard. Google Sheets are also deeply collaborative and shareable in a way that most notebooks are not.</p>

<p>I think it’s reasonable to decide that the power of notebook-based tooling is worth the barrier to entry and loss of frictionless collaboration. But if you, like me, think those are valuable qualities in a default toolkit, then the question becomes: how do we raise the skill ceiling closer to (or even equivalent to) notebook tools, so that experienced data reporters don’t feel like they’re trapped in the bronze age while everyone else shows off the latest ironworks?</p>

<p>This isn’t an impractical goal. Unknown to a lot of people, after years of stagnation, spreadsheets have made huge jumps in capability, to the point that I think they can be almost as expressive as a notebook or even a query language like SQL. I’ve already written about <a href="https://dataviz.chalkbeat.org/2022/02/07/she-has-the-range.html">how to use named ranges and array formulas</a> to clarify lookups and filters. Now I’d like to talk about two new Lisp-inspired functions that can help close the gap between Sheets and other data tools: LET and LAMBDA.</p>
<h2 id="let-the-right-one-in">LET the right one in</h2>

<p>Of the two, LET is easier to understand. It’s basically a wrapper to assign a local name to a value, but only within the scope of the current cell. For example, here’s a cell that computes the post-tax price for an item:</p>

<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">=</span><span class="nx">LET</span><span class="p">(</span>
  <span class="nx">price</span><span class="p">,</span> <span class="p">.</span><span class="mi">99</span><span class="p">,</span>
  <span class="nx">tax_rate</span><span class="p">,</span> <span class="mf">0.07</span><span class="p">,</span>
  <span class="nx">price</span> <span class="o">+</span> <span class="p">(</span><span class="nx">price</span> <span class="o">*</span> <span class="nx">tax_rate</span><span class="p">))</span>
</code></pre></div></div>

<p>In this case, we define two local variables, <code class="language-plaintext highlighter-rouge">price</code> and <code class="language-plaintext highlighter-rouge">tax_rate</code>, with the values .99 and .07, respectively. The final argument to LET is the calculation using those variables (the “return value” in other languages). You can start to see how this might make even simple formulas more self-documenting, but we can really start to see the benefits of LET if we need to use conditionals or other branching code.</p>

<p>Let’s say that we were computing school testing aggregates based on student data (a named range called <code class="language-plaintext highlighter-rouge">student_scores</code>), but we want to suppress any result that’s based on fewer than 10 individuals. Without LET, that formula would probably look like this:</p>

<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">=</span><span class="nx">IF</span><span class="p">(</span><span class="nx">COUNT</span><span class="p">(</span><span class="nx">FILTER</span><span class="p">(</span><span class="nx">student_scores</span><span class="p">,</span> <span class="nx">school</span> <span class="o">=</span> <span class="nx">A2</span><span class="p">))</span> <span class="o">&lt;</span> <span class="mi">10</span><span class="p">,</span> <span class="dl">"</span><span class="s2">-</span><span class="dl">"</span><span class="p">,</span> <span class="nx">AVERAGE</span><span class="p">(</span><span class="nx">FILTER</span><span class="p">(</span><span class="nx">student_scores</span><span class="p">,</span> <span class="nx">school</span> <span class="o">=</span> <span class="nx">A2</span><span class="p">)))</span>
</code></pre></div></div>

<p>In this formula, the filter clause is repeated in two places, once for the length test and again for the actual result. If our selection criteria gets more complicated (say, our sheet contains multiple test subjects and we only want one of them), we’ll have to update both filters in sync. Contrast this with a version that uses LET:</p>

<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">=</span><span class="nx">LET</span><span class="p">(</span>
  <span class="nx">scores</span><span class="p">,</span> <span class="nx">FILTER</span><span class="p">(</span><span class="nx">student_scores</span><span class="p">,</span> <span class="nx">school</span> <span class="o">=</span> <span class="nx">A2</span><span class="p">),</span>
  <span class="nx">IF</span><span class="p">(</span><span class="nx">COUNT</span><span class="p">(</span><span class="nx">scores</span><span class="p">)</span> <span class="o">&lt;</span> <span class="mi">10</span><span class="p">,</span> <span class="dl">"</span><span class="s2">-</span><span class="dl">"</span><span class="p">,</span> <span class="nx">AVERAGE</span><span class="p">(</span><span class="nx">scores</span><span class="p">)))</span>
</code></pre></div></div>

<p>Now we only have to maintain our filter in one place, and there’s a lot less noise in the formula, since we can assign human-readable names to sections of it. Note the indentation: you can add newlines to your formulas in Sheets by pressing Ctrl-Enter (on Windows) or Command-Enter (on a Mac). I like to put each variable declaration on its own line in LET statements for greater legibility.</p>

<h3 id="leg-of-lambda">Leg of LAMBDA</h3>

<p>While LET provides local variables, LAMBDA gives us the ability to compose functions for reuse. You can call a lambda function immediately, give it a local name using LET, or you can use the new “named function” panel to make them globally available to any cell in your sheet. LAMBDA expressions can also use other functions as inputs. For example, here’s a SUPPRESS function:</p>

<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">=</span><span class="nx">LAMBDA</span><span class="p">(</span>
  <span class="nx">result_array</span><span class="p">,</span> <span class="nx">fn</span><span class="p">,</span>
  <span class="nx">IF</span><span class="p">(</span><span class="nx">COUNT</span><span class="p">(</span><span class="nx">result_array</span><span class="p">)</span> <span class="o">&lt;</span> <span class="mi">10</span><span class="p">,</span> <span class="dl">"</span><span class="s2">-</span><span class="dl">"</span><span class="p">,</span> <span class="nx">fn</span><span class="p">(</span><span class="nx">result_array</span><span class="p">)))</span>
</code></pre></div></div>

<p>Once this is assigned to a named function in my workbook, I can call this in a LET, passing in my scores and a lambda that says what to do with the unsuppressed results:</p>

<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">=</span><span class="nx">LET</span><span class="p">(</span>
  <span class="nx">scores</span><span class="p">,</span> <span class="nx">FILTER</span><span class="p">(</span><span class="nx">student_scores</span><span class="p">,</span> <span class="nx">school</span> <span class="o">=</span> <span class="nx">A2</span><span class="p">),</span>
  <span class="nx">aggregate</span><span class="p">,</span> <span class="nx">LAMBDA</span><span class="p">(</span><span class="nx">input</span><span class="p">,</span> <span class="nx">AVERAGE</span><span class="p">(</span><span class="nx">input</span><span class="p">)),</span>
  <span class="nx">SUPPRESS</span><span class="p">(</span><span class="nx">scores</span><span class="p">,</span> <span class="nx">aggregate</span><span class="p">))</span>
</code></pre></div></div>

<p>I can change the aggregation by swapping out the lambda that I pass into SUPPRESS, such as asking for the standard deviation instead of the average:</p>

<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">=</span><span class="nx">LET</span><span class="p">(</span>
  <span class="nx">scores</span><span class="p">,</span> <span class="nx">FILTER</span><span class="p">(</span><span class="nx">student_scores</span><span class="p">,</span> <span class="nx">school</span> <span class="o">=</span> <span class="nx">A2</span><span class="p">),</span>
  <span class="nx">aggregate</span><span class="p">,</span> <span class="nx">LAMBDA</span><span class="p">(</span><span class="nx">input</span><span class="p">,</span> <span class="nx">STDEV</span><span class="p">(</span><span class="nx">input</span><span class="p">)),</span>
  <span class="nx">SUPPRESS</span><span class="p">(</span><span class="nx">scores</span><span class="p">,</span> <span class="nx">aggregate</span><span class="p">))</span>
</code></pre></div></div>

<p>Sadly, although it would be nice to just pass in the aggregate function directly, built-in functions must be wrapped in a LAMBDA.</p>

<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// this won't work</span>
<span class="o">=</span><span class="nx">SUPPRESS</span><span class="p">(</span><span class="nx">E</span><span class="p">:</span><span class="nx">E</span><span class="p">,</span> <span class="nx">SUM</span><span class="p">)</span>

<span class="c1">// wrapping SUM in a lambda creates a callable function</span>
<span class="o">=</span><span class="nx">SUPPRESS</span><span class="p">(</span><span class="nx">E</span><span class="p">:</span><span class="nx">E</span><span class="p">,</span> <span class="nx">LAMBDA</span><span class="p">(</span><span class="nx">v</span><span class="p">,</span> <span class="nx">SUM</span><span class="p">(</span><span class="nx">v</span><span class="p">))</span>
</code></pre></div></div>

<p>It’s worth noting, if you’re curious, that LET is effectively “syntax sugar” over LAMBDA: you can get the same effect by creating a function and immediately calling it with values to fill its arguments. The following two formulas do the same thing, but the LET formulation is probably easier for most people to reason about, since the variable values are closer to their name declarations.</p>

<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">=</span><span class="nx">LAMBDA</span><span class="p">(</span><span class="nx">x</span><span class="p">,</span> <span class="nx">y</span><span class="p">,</span> <span class="nx">x</span> <span class="o">+</span> <span class="nx">y</span><span class="p">)(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>

<span class="o">=</span><span class="nx">LET</span><span class="p">(</span><span class="nx">x</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="nx">y</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="nx">x</span> <span class="o">+</span> <span class="nx">y</span><span class="p">)</span>
</code></pre></div></div>

<p>LAMBDA in turn seems to be sugar over ARRAYFORMULA, at least according to some of the error messages I’ve seen when working with it. For that reason, while testing, it’s often a good idea to develop a LAMBDA from a series of cell operations and then integrate them into a final form for your named function definition, just as you’d step-debug a complex procedure in a traditional language.</p>

<h2 id="case-study-1-school-transfer-counts">Case study #1: school transfer counts</h2>

<p>One interesting thing about LET is that its binding expressions are cumulative–any variable definition can use the previous variables in its declaration. Even if you’re not using them in multiple places, you can make a very complex formula easier to understand by breaking down its component parts into named values and processing them line by line.</p>

<p>Here’s a real-world case: one of our bureaus had a spreadsheet containing individual student records from two points in time: where they were enrolled a number of years ago, their grade level at the time, and where they were enrolled more recently (assuming they were still enrolled).</p>

<p>I wanted to find the percentage of students ten years ago at a given school (<code class="language-plaintext highlighter-rouge">$A2</code>, the row prefix) had ended up at a specific type of school now (<code class="language-plaintext highlighter-rouge">C$1</code>, the column header). The result is basically a pivot, but the cell values are computed based on two input sets—the number that are still enrolled, and all students originally at a given school.</p>

<p><img src="/assets/images/2023-06-26-slouching-toward-pandas/image-0.png.jpg" alt="A Sheets table with columns reading &quot;from&quot;, &quot;school name&quot;, &quot;charter&quot;, and rows with data for each school ID" /></p>

<p>The formula uses two named functions set up in the workbook for convenience sake: the MATCHGRADES named function only finds students in grades PE-2, and COUNTFILTERED correctly handles counting non-numeric values (<code class="language-plaintext highlighter-rouge">COUNTA(FILTER(x))</code> will return 1 for no matches, since it counts the error as a value). The definitions for those look like this:</p>

<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// MATCHGRADES</span>
<span class="o">=</span><span class="nx">MATCH</span><span class="p">(</span><span class="nx">grade</span><span class="p">,</span> <span class="p">{</span> <span class="dl">"</span><span class="s2">PE</span><span class="dl">"</span><span class="p">,</span> <span class="dl">"</span><span class="s2">PK</span><span class="dl">"</span><span class="p">,</span> <span class="dl">"</span><span class="s2">K</span><span class="dl">"</span><span class="p">,</span> <span class="dl">"</span><span class="s2">1</span><span class="dl">"</span><span class="p">,</span> <span class="dl">"</span><span class="s2">2</span><span class="dl">"</span> <span class="p">},</span> <span class="mi">0</span><span class="p">)</span>

<span class="c1">// COUNTFILTERED</span>
<span class="o">=</span><span class="nx">IF</span><span class="p">(</span><span class="nx">ISERROR</span><span class="p">(</span><span class="nx">filtered</span><span class="p">),</span> <span class="mi">0</span><span class="p">,</span> <span class="nx">COUNTA</span><span class="p">(</span><span class="nx">filtered</span><span class="p">))</span>
</code></pre></div></div>

<p>Here’s the final formula for cell C2, which I then dragged out to fill the complete table:</p>

<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">=</span><span class="nx">LET</span><span class="p">(</span>
  <span class="nx">in_school</span><span class="p">,</span> <span class="nx">ARRAYFORMULA</span><span class="p">(</span><span class="nx">students_from</span> <span class="o">=</span> <span class="nx">$A2</span><span class="p">),</span>
  <span class="nx">in_grade</span><span class="p">,</span> <span class="nx">ARRAYFORMULA</span><span class="p">(</span><span class="nx">MATCHGRADES</span><span class="p">(</span><span class="nx">students_grade</span><span class="p">)),</span>
  <span class="nx">still_enrolled</span><span class="p">,</span> <span class="nx">ARRAYFORMULA</span><span class="p">(</span><span class="nx">students_enrolled</span> <span class="o">=</span> <span class="nx">TRUE</span><span class="p">),</span>
  <span class="nx">went_to</span><span class="p">,</span> <span class="nx">ARRAYFORMULA</span><span class="p">(</span><span class="nx">students_to_type</span> <span class="o">=</span> <span class="nx">C$1</span><span class="p">),</span>
  <span class="nx">enrolled</span><span class="p">,</span> <span class="nx">FILTER</span><span class="p">(</span><span class="nx">students_id</span><span class="p">,</span> <span class="nx">in_school</span><span class="p">,</span> <span class="nx">in_grade</span><span class="p">,</span> <span class="nx">went_to</span><span class="p">,</span> <span class="nx">still_enrolled</span><span class="p">),</span>
  <span class="nx">all</span><span class="p">,</span> <span class="nx">FILTER</span><span class="p">(</span><span class="nx">students_id</span><span class="p">,</span> <span class="nx">in_school</span><span class="p">,</span> <span class="nx">in_grade</span><span class="p">,</span> <span class="nx">still_enrolled</span><span class="p">),</span>
<span class="nx">COUNTFILTERED</span><span class="p">(</span><span class="nx">enrolled</span><span class="p">)</span> <span class="o">/</span> <span class="nx">MAX</span><span class="p">(</span><span class="nx">COUNTFILTERED</span><span class="p">(</span><span class="nx">all</span><span class="p">),</span> <span class="mi">1</span><span class="p">))</span>
</code></pre></div></div>

<p>Breaking this formula down to its component pieces, <code class="language-plaintext highlighter-rouge">in_school</code>, <code class="language-plaintext highlighter-rouge">in_grade</code>, <code class="language-plaintext highlighter-rouge">still_enrolled</code>, and <code class="language-plaintext highlighter-rouge">went_to</code> are all arrays containing true/false results for each student in the source data:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">in_school</code>  - did the student attend the school for this pivot row?</li>
  <li><code class="language-plaintext highlighter-rouge">in_grade</code> - were they in the grades we want to look at?</li>
  <li><code class="language-plaintext highlighter-rouge">still_enrolled</code> - are they still enrolled?</li>
  <li><code class="language-plaintext highlighter-rouge">went_to</code> - did they end up at the school type in this pivot column (charter, neighborhood, alternative, etc.)?</li>
</ul>

<p>From these arrays, we create two filtered lists of student IDs, one that matched on all conditions and one that matches on school, grade, and enrollment (but for any school type). With those two filtered sets, the last line computes the final percentage of students who went to a given school, in the desired grades, and are still enrolled at a school of a given type.</p>

<p>Without LET, I’d probably have needed a column in each row for the total enrolled students at each origin school, then columns for the counts of enrolled students in each category, then a duplicate set of columns that find the percentage from those figures. The result would be more explicit, but also a lot less compact. Each person will need to determine where the line is for that trade-off, but it’s exciting to have the options. On a team with some fluency in this technique, I don’t think this example is excessive.</p>
<h2 id="case-study-2-filling-sparse-columns">Case study #2: Filling sparse columns</h2>

<p>Perhaps knowing that if you wave something called LAMBDA in front of a bunch of nerds, they’re going to expect that you give them functional programming tools, Sheets does include some higher-order functions now like MAP and REDUCE. Most of the time, these are not terribly useful, since we already have first-class support for ARRAYFORMULA and a handy set of aggregate functions available to us, but it’s good to have them around just in case.</p>

<p>However, one unambiguously helpful lambda-dependent function is SCAN, which generates a sequence of values for each cell in a range based on both its value and the previous output–it’s like a REDUCE that shows its work. You can use this for a number of cool tricks, but the most common is to fill in sparse columns.</p>

<p>Take, for example, the NAEP test results over time. If you run <a href="https://www.nationsreportcard.gov/ndecore/shareredirect?su=NDE&amp;sb=MAT&amp;gr=4&amp;fr=2&amp;yr=2022R3-2019R3-2017R3-2015R3-2013R3-2011R3-2009R3-2007R3-2005R3-2003R3-2000R3-2000R2-1996R3-1996R2-1992R2-1990R2&amp;sc=MRPCM&amp;ju=NT-NL&amp;vr=TOTAL-false&amp;st=MN-MN&amp;sht=REPORT&amp;urls=xplore&amp;mi=false&amp;svt=true&amp;nd=0&amp;vl=SHORT&amp;yo=DESC&amp;inc=NONE&amp;up=true&amp;rrl=SAMPLE%7CSAMPLE%7C1--JURISDICTION%7CJURISDICTION%7C2--TOTAL%7CVARIABLE%7C3&amp;rtl=&amp;sm=false">this query</a> in the data explorer, you’ll get a table with individual rows for national and city results for each year–but only the first row will have the year displayed:</p>

<p><img src="/assets/images/2023-06-26-slouching-toward-pandas/image-1.png.jpg" alt="a table of values where the year is only listed for the first row in which it occurs" /></p>

<p>When you copy and paste this into a sheet for visualization, you’ll want to fill in the missing year values, but clicking on each one to fill down is almost as tiresome as adding the values by hand. It’s possible to write an IF that will fill in the values, but you’ll have to make sure that fills the entire column, which can be a problem if there are gaps on both sides. Instead, here’s SCAN to the rescue:</p>

<div class="language-javascript highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">=</span><span class="nx">SCAN</span><span class="p">(</span><span class="nx">A1</span><span class="p">,</span> <span class="nx">A1</span><span class="p">:</span><span class="nx">A22</span><span class="p">,</span> <span class="nx">LAMBDA</span><span class="p">(</span><span class="nx">acc</span><span class="p">,</span> <span class="nx">val</span><span class="p">,</span> <span class="nx">IF</span><span class="p">(</span><span class="nx">ISBLANK</span><span class="p">(</span><span class="nx">val</span><span class="p">),</span> <span class="nx">acc</span><span class="p">,</span> <span class="nx">val</span><span class="p">)))</span>
</code></pre></div></div>

<p>The SCAN function takes three arguments. The first is a starting value, which for our purposes is just the start of the range (2022). The second is the range we want to cover, which in this case is our sparse column (I’ve selected the rows corresponding from 2022 through 2000). Finally, we give it a lambda that receives the previous output (<code class="language-plaintext highlighter-rouge">acc</code>, for “accumulator”) and the current cell value. If the current value is blank, we repeat the last value. Otherwise, we replace it with the new year.</p>

<p><img src="/assets/images/2023-06-26-slouching-toward-pandas/image-2.png.jpg" alt="the prior table in Sheets, with years filled in on every row" /></p>

<p>It used to be that if I didn’t want to do this kind of data cleanup by hand, I’d have to switch over to the Apps Script editor and write some extension code, then add a note for other team members in case they needed to recreate the results. Now, everything is right there in the sheet, and I don’t need to context switch or remember the <code class="language-plaintext highlighter-rouge">SpreadsheetApp</code> API.</p>

<p>Essentially, LET and LAMBDA don’t just make our formulas more readable: they also create options for extensibility that would have previously required leaving the workbook for another tool.</p>]]></content><author><name>Thomas Wilburn</name></author><summary type="html"><![CDATA[Using LET and LAMBDA to raise the spreadsheet skill&nbsp;ceiling]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://dataviz.chalkbeat.org/assets/images/2023-06-26-slouching-toward-pandas/inksplatter.png.jpg" /><media:content medium="image" url="https://dataviz.chalkbeat.org/assets/images/2023-06-26-slouching-toward-pandas/inksplatter.png.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Scraping ASP-based dashboards at the protocol level</title><link href="https://dataviz.chalkbeat.org/2022/10/28/asp-scraping.html" rel="alternate" type="text/html" title="Scraping ASP-based dashboards at the protocol level" /><published>2022-10-28T00:00:00+00:00</published><updated>2022-10-28T00:00:00+00:00</updated><id>https://dataviz.chalkbeat.org/2022/10/28/asp-scraping</id><content type="html" xml:base="https://dataviz.chalkbeat.org/2022/10/28/asp-scraping.html"><![CDATA[<p>We’re generally lucky, as journalists, to live in a time when open data portals and releases have become common reporting tools. However, “generally” is not “always,” and there are still plenty of times where “open” data has to be more forcibly extracted from a web-based dashboard. Here at Chalkbeat, we write our fair share of scrapers, depending on the bureau and the local public records regime.</p>

<p>Scraping is not a crime, and <a href="https://themarkup.org/merchandise/shirt">I’ve got the t-shirt to prove it</a>. But just because it’s legal doesn’t mean it’s easy, or that the administrators of data sites make programmatic access a priority. Even if they’re not intentionally obfuscating data, the tools that are used for government data release were often designed in ways that make extraction difficult.</p>

<p>Sites using Microsoft’s ASP framework are particularly notorious for being opaque to simple scraping techniques. Faced with one of these dashboards, many data journalists turn to browser automation tools, like Selenium, that let them script clicks and keystrokes. However, this approach brings its own challenges:</p>

<ul>
  <li>Our fleshy, human eyes can easily see when the page is “interactable,” but it can be difficult to tell a computer when a page is ready to receive clicks, meaning that the script will either try to click/type too quickly or incorporate error-prone delays.</li>
  <li>Pages are often not designed to be used programmatically, so we have to figure out selectors for inputs that may not be marked clearly, or might not exist until JavaScript creates them.</li>
  <li>A browser will try to load images and other resources, many of which we don’t “need” for the data and which slow down our process.</li>
  <li>Selenium requires a lot of memory and isn’t tremendously stable, which often means it’s tough to run on an EC2 instance or another virtual machine for automatic scraping at regular intervals.</li>
</ul>

<p>As an alternative, instead of pretending to be a <em>user</em> with a keyboard and mouse, we should pretend to be a <em>browser client</em> that only knows how to send and receive messages. Essentially, we will drop down to the actual HTTP exchange layer.</p>

<p>Scraping via pure requests is more abstract and requires us to understand a bit more about the client/server protocol, but it reduces the process to a simple set of back-and-forth transactions: we send a message, record the response, rinse and repeat.</p>

<p>Ironically, the “smarter” and more interactive a page is in terms of client-side code, the more likely that it’s backed by simple HTTP endpoints that the scripts (and in many cases, native apps) are talking to, and which we can also contact for data directly.</p>

<p>In contrast, ASP dashboards are difficult not because the underlying software is sophisticated, but because it was designed and implemented before developers standardized around a more expressive use of these APIs.</p>

<h2 id="understanding-http">Understanding HTTP</h2>

<p>MDN has <a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview">a reasonably good guide to the HTTP protocol</a> if you’re interested in the details, but for our purposes we can think of it as a simple call and response. Every time you load a page, the browser sends a request to the server, and receives a response in return. This may trigger additional requests (e.g., a page may contain images, scripts, or stylesheets that the browser needs to download).</p>

<p>An HTTP request can be broken down into a few parts:</p>

<ul>
  <li>The URL path being requested from the server</li>
  <li>A method “verb” expressing the action for the request, most commonly GET or POST.</li>
  <li>Headers that add metadata to the message, including cookies and the preferred language.</li>
  <li>An optional body section that can contain data being sent to the server.</li>
</ul>

<p>Most of the time, in a browser, there’s no body and the request method will be GET, since we’re just asking to download a file. POST requests are used to upload to the server or make changes, and usually do include a body with the data that the user wants to submit (such as a file upload or the contents of a form).</p>

<p>Here’s a typical request made to Chalkbeat, for example. You can see the method, path, and headers. As noted, since it’s a GET, there’s no message body after the headers.</p>

<div class="language-http highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nf">GET</span> <span class="nn">https://www.chalkbeat.org/</span> <span class="k">HTTP</span><span class="o">/</span><span class="m">1.1</span>
<span class="na">Host</span><span class="p">:</span> <span class="s">www.chalkbeat.org</span>
<span class="na">User-Agent</span><span class="p">:</span> <span class="s">Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:105.0) Gecko/20100101 Firefox/105.0</span>
<span class="na">Accept</span><span class="p">:</span> <span class="s">text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8</span>
<span class="na">Accept-Language</span><span class="p">:</span> <span class="s">en-US,en;q=0.5</span>
<span class="na">Accept-Encoding</span><span class="p">:</span> <span class="s">gzip, deflate, br</span>
<span class="na">DNT</span><span class="p">:</span> <span class="s">1</span>
<span class="na">Connection</span><span class="p">:</span> <span class="s">keep-alive</span>
<span class="na">Upgrade-Insecure-Requests</span><span class="p">:</span> <span class="s">1</span>
<span class="na">Sec-Fetch-Dest</span><span class="p">:</span> <span class="s">document</span>
<span class="na">Sec-Fetch-Mode</span><span class="p">:</span> <span class="s">navigate</span>
<span class="na">Sec-Fetch-Site</span><span class="p">:</span> <span class="s">cross-site</span>
</code></pre></div></div>

<p>The server processes the request and sends back a response that’s structured in a similar way. Responses don’t have a path or method, but they include a numerical status code indicating success or failure (such as <code class="language-plaintext highlighter-rouge">200 OK</code> or the infamous <code class="language-plaintext highlighter-rouge">404 NOT FOUND</code>). They also have headers for things like modification date, and they almost always include a body section containing the data we asked for in our request.</p>

<p>Here’s the raw server response to our request, trimmed for length:</p>

<div class="language-http highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">HTTP</span><span class="o">/</span><span class="m">1.1</span> <span class="m">200</span> <span class="ne">OK</span>
<span class="na">Content-Type</span><span class="p">:</span> <span class="s">text/html;charset=UTF-8</span>
<span class="na">Connection</span><span class="p">:</span> <span class="s">keep-alive</span>
<span class="na">Date</span><span class="p">:</span> <span class="s">Mon, 24 Oct 2022 14:28:23 GMT</span>
<span class="na">Server</span><span class="p">:</span> <span class="s">istio-envoy</span>
<span class="na">strict-transport-security</span><span class="p">:</span> <span class="s">max-age=31536000; includeSubdomains;</span>
<span class="na">x-powered-by</span><span class="p">:</span> <span class="s">Brightspot</span>
<span class="na">x-envoy-upstream-service-time</span><span class="p">:</span> <span class="s">983</span>
<span class="na">x-envoy-decorator-operation</span><span class="p">:</span> <span class="s">brightspot-cms-verify.chalkbeat.svc.cluster.local:80/*</span>
<span class="na">Vary</span><span class="p">:</span> <span class="s">Accept-Encoding</span>
<span class="na">X-Cache</span><span class="p">:</span> <span class="s">Hit from cloudfront</span>
<span class="na">Via</span><span class="p">:</span> <span class="s">1.1 b75f3304a39fe185ba1556322bdff970.cloudfront.net (CloudFront)</span>
<span class="na">X-Amz-Cf-Pop</span><span class="p">:</span> <span class="s">ORD58-P2</span>
<span class="na">X-Amz-Cf-Id</span><span class="p">:</span> <span class="s">yvRz34BFe8NslGGux897su17I-QKBj-XRFGDCNj23Bjx8-pmqkaUlw==</span>
<span class="na">Age</span><span class="p">:</span> <span class="s">45</span>
<span class="na">Content-Length</span><span class="p">:</span> <span class="s">414837</span>

<span class="cp">&lt;!DOCTYPE html&gt;</span>
<span class="nt">&lt;html</span> <span class="na">class=</span><span class="s">"Page"</span> <span class="na">lang=</span><span class="s">"en"</span> <span class="na">data-sticky-header</span>
<span class="nt">&gt;</span>
<span class="nt">&lt;head&gt;</span>
(... body continues with the rest of the HTML)
</code></pre></div></div>

<p>We don’t typically look at raw HTTP requests (I had to install a proxy in order to do so). However, if you open up the dev tools and look at the network tab, you can see all of these in your browser in a more readable form:</p>

<p><img src="/assets/images/2022-10-28-asp-scraping/image-0.png.jpg" alt="A picture of a request in the dev tools" />
<em>“Headers” will show you the request method, response status code, and the headers that were sent by the browser and back by the server. Click “Response” to see the message body.</em></p>

<p>When scraping a website, although it’s tempting to look at the DOM inspector, it’s often easier to just go straight to the network tab and see what messages are being sent when interacting with the page (especially with the filter set to “Fetch/XHR”, which will only show requests made by client-side JavaScript). To me, this is similar to asking for machine-readable data in a FOIA instead of trying to OCR a document scan – the latter is possible, but it’s usually a lot more trouble.</p>

<h2 id="how-a-normal-site-uses-http">How a normal site uses HTTP</h2>

<p>A typical web application does almost all its interactions using the GET method (meaning, the browser wants to read data but doesn’t want to write it) and differentiates between views using the URL. As a refresher, here’s the parts of a URL (image courtesy of MDN):</p>

<p><img src="/assets/images/2022-10-28-asp-scraping/image-1.png.jpg" alt="Image file" /></p>

<p>Different data views are usually expressed using either the path or query parameters. Either way, you have a specific URL that you can construct and request in order to scrape data out of the service. Since you’re not changing anything, you can just use GET instead of <a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods">one of the other HTTP verbs</a>, and you don’t have to send a request body to the server. This pattern – where a resource is made available at a given address regardless of other requests that have been made – is often referred to as RESTful. It’s popular because it’s also easy to implement, and friendly to caching (meaning that it’s cheap).</p>

<h2 id="how-asp-pages-use-http">How ASP pages use HTTP</h2>

<p>The first thing you will notice when trying to scrape an ASP-based site is that you’ll see repeated requests made to the server for different data, but all using the same URL. In order to extract data from these, we have to learn to send our requests in a slightly different format.</p>

<p>For example, let’s load the <a href="https://www.nycenet.edu/offices/d_chanc_oper/budget/dbor/sber/FY2018/FY2018_Default.aspx">NYC school-based expenditure report dashboard</a>, then select District 1 to look at. The page will refresh, and the document request will look like this:</p>

<p><img src="/assets/images/2022-10-28-asp-scraping/image-2.png.jpg" alt="A POST request made to the dashboard URL" /></p>

<p>The request URL is the same as the page we’re on, but notice that the method is a POST, not a GET. That usually means that we’re sending data to the server in the request body.</p>

<p>We can see what we sent by clicking over to the Payload tab:</p>

<p><img src="/assets/images/2022-10-28-asp-scraping/image-3.png.jpg" alt="The top of a request form, ending in a big parameter hash" />
<em>(many lines of _VIEWSTATE later…)</em>
<img src="/assets/images/2022-10-28-asp-scraping/image-4.png.jpg" alt="The bottom of a request" /></p>

<p>When we changed the drop-down to select the district, instead of simply requesting data from a new URL, the page actually made a form submission with a number of parameters corresponding to the dashboard settings and a unique value assigned to our viewing session (the <code class="language-plaintext highlighter-rouge">_VIEWSTATE</code>). If you inspect the page, you can actually find the hidden form inputs that it uses to generate this:</p>

<p><img src="/assets/images/2022-10-28-asp-scraping/image-5.png.jpg" alt="HTML page in dev tools with hidden inputs shown" /></p>

<p>What this boils down to is that ASP pages are not inherently <em>stateless</em> the way that most HTTP endpoints are. Instead, the server is maintaining a session for you and each transaction has to match that session. You can imagine this as though the dashboard has its own browser tab open, identical to yours visually, and when you click a button, the page tells the server to perform the actual click in its tab, and then it sends the updated page back to you. If you try to make a request that doesn’t match the server’s idea of where you “are” on the site, it’ll usually just send the index page back to you as a fallback.</p>

<p>Luckily, even though this is a dizzyingly overcomplicated way to run a server, it’s not terribly hard to send these events from our side. We just need to make our own POST request that includes the values that the server embedded in the form, updated with the correct input parameters. Step-by-step, that process usually looks something like this:</p>

<ol>
  <li><strong>Get the initial page, and pull out the hidden input values into a state object.</strong> We could be more discriminating, but the easiest way to do this for ASP is to just query for inputs with <code class="language-plaintext highlighter-rouge">type="hidden"</code> set on them, and create an object out of their <code class="language-plaintext highlighter-rouge">name</code> and <code class="language-plaintext highlighter-rouge">value</code> attributes.</li>
  <li><strong>Update the filter values in that object</strong>. Usually I find these values by poking around in the network tab and seeing what changes when I update the form – <em>this</em> form input produces <em>that</em> form value in the POST submission.</li>
  <li><strong>POST that object to the server and pull data out of the response.</strong></li>
  <li>(optional) Update our state object to match the response, then repeat from step 2.</li>
</ol>

<p>You can see an example scraper for a couple of NYC finance pages <a href="https://gist.github.com/thomaswilburn/a54b691498184ca90c17367a3abba709">in this Gist</a>. Of particular note is the getASPValues function, which extracts the hidden form inputs:</p>

<div class="language-js highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">async</span> <span class="kd">function</span> <span class="nx">getASPValues</span><span class="p">(</span><span class="nx">target</span><span class="p">)</span> <span class="p">{</span>
  <span class="kd">var</span> <span class="nx">response</span> <span class="o">=</span> <span class="k">await</span> <span class="nx">fetch</span><span class="p">(</span><span class="nx">target</span><span class="p">);</span>
  <span class="kd">var</span> <span class="nx">html</span> <span class="o">=</span> <span class="k">await</span> <span class="nx">response</span><span class="p">.</span><span class="nx">text</span><span class="p">();</span>
  <span class="kd">var</span> <span class="nx">$</span> <span class="o">=</span> <span class="nx">cheerio</span><span class="p">.</span><span class="nx">load</span><span class="p">(</span><span class="nx">html</span><span class="p">);</span>
  <span class="kd">var</span> <span class="nx">data</span> <span class="o">=</span> <span class="p">{};</span>
  <span class="kd">var</span> <span class="nx">inputs</span> <span class="o">=</span> <span class="nx">$</span><span class="p">(</span><span class="s2">`input[type="hidden"]`</span><span class="p">);</span>
  <span class="k">for</span> <span class="p">(</span><span class="kd">var</span> <span class="nx">i</span> <span class="k">of</span> <span class="nx">inputs</span><span class="p">)</span> <span class="p">{</span>
    <span class="nx">data</span><span class="p">[</span><span class="nx">i</span><span class="p">.</span><span class="nx">attribs</span><span class="p">.</span><span class="nx">name</span><span class="p">]</span> <span class="o">=</span> <span class="nx">i</span><span class="p">.</span><span class="nx">attribs</span><span class="p">.</span><span class="nx">value</span> <span class="o">||</span> <span class="dl">""</span><span class="p">;</span>
  <span class="p">}</span>
  <span class="k">return</span> <span class="nx">data</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>There are also some values that I snooped from the network tab, such as the <code class="language-plaintext highlighter-rouge">_EVENTTARGET</code> (which tells the server that we changed the district drop-down):</p>

<div class="language-js highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">const</span> <span class="nx">sberFormValues</span> <span class="o">=</span> <span class="p">{</span>
  <span class="na">_EVENTTARGET</span><span class="p">:</span> <span class="dl">"</span><span class="s2">ctl00$ContentPlaceHolder1$Input_District</span><span class="dl">"</span><span class="p">,</span>
  <span class="nx">ctl00$ContentPlaceHolder1</span><span class="na">$reportnumber</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span>
  <span class="nx">ctl00</span><span class="na">$Fiscal_Year</span><span class="p">:</span> <span class="dl">"</span><span class="s2">SELECT_A_YEAR</span><span class="dl">"</span>
<span class="p">};</span>
</code></pre></div></div>

<p>We can then combine our ASP form values, our event constants, and our specific filter setting into a form submission and POST it to scrape a given district page:</p>

<div class="language-js highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">var</span> <span class="nx">getSBER</span> <span class="o">=</span> <span class="k">async</span> <span class="kd">function</span><span class="p">(</span><span class="nx">asp</span><span class="p">,</span> <span class="nx">district</span><span class="p">)</span> <span class="p">{</span>
  <span class="kd">var</span> <span class="nx">data</span> <span class="o">=</span> <span class="p">{</span>
    <span class="p">...</span><span class="nx">asp</span><span class="p">,</span>
    <span class="p">...</span><span class="nx">sberFormValues</span><span class="p">,</span>
    <span class="nx">ctl00$ContentPlaceHolder1</span><span class="na">$Input_District</span><span class="p">:</span> <span class="nb">String</span><span class="p">(</span><span class="nx">district</span><span class="p">).</span><span class="nx">padStart</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="dl">"</span><span class="s2">0</span><span class="dl">"</span><span class="p">)</span>
  <span class="p">};</span>
  <span class="c1">// generate a form-encoded POST body</span>
  <span class="kd">var</span> <span class="nx">body</span> <span class="o">=</span> <span class="nx">formEncode</span><span class="p">(</span><span class="nx">data</span><span class="p">);</span>
  <span class="kd">var</span> <span class="nx">response</span> <span class="o">=</span> <span class="k">await</span> <span class="nx">fetch</span><span class="p">(</span><span class="nx">SBER</span><span class="p">,</span> <span class="p">{</span>
    <span class="na">method</span><span class="p">:</span> <span class="dl">"</span><span class="s2">POST</span><span class="dl">"</span><span class="p">,</span>
    <span class="nx">headers</span><span class="p">,</span>
    <span class="nx">body</span>
  <span class="p">});</span>
  <span class="kd">var</span> <span class="nx">html</span> <span class="o">=</span> <span class="k">await</span> <span class="nx">response</span><span class="p">.</span><span class="nx">text</span><span class="p">();</span>
  <span class="kd">var</span> <span class="nx">$</span> <span class="o">=</span> <span class="nx">cheerio</span><span class="p">.</span><span class="nx">load</span><span class="p">(</span><span class="nx">html</span><span class="p">);</span>
  <span class="kd">var</span> <span class="nx">rows</span> <span class="o">=</span> <span class="nx">$</span><span class="p">(</span><span class="dl">"</span><span class="s2">.CSD_HS_Detail</span><span class="dl">"</span><span class="p">);</span>
  <span class="kd">var</span> <span class="nx">scraped</span> <span class="o">=</span> <span class="p">[];</span>
  <span class="k">for</span> <span class="p">(</span><span class="kd">var</span> <span class="nx">row</span> <span class="k">of</span> <span class="nx">rows</span><span class="p">)</span> <span class="p">{</span>
    <span class="kd">var</span> <span class="p">{</span> <span class="nx">children</span> <span class="p">}</span> <span class="o">=</span> <span class="nx">row</span><span class="p">;</span>
    <span class="kd">var</span> <span class="nx">cells</span> <span class="o">=</span> <span class="nx">children</span><span class="p">.</span><span class="nx">map</span><span class="p">(</span><span class="nx">c</span> <span class="o">=&gt;</span> <span class="nx">$</span><span class="p">(</span><span class="nx">c</span><span class="p">).</span><span class="nx">text</span><span class="p">());</span>
    <span class="nx">scraped</span><span class="p">.</span><span class="nx">push</span><span class="p">(</span><span class="nx">cells</span><span class="p">);</span>
  <span class="p">}</span>
  <span class="k">return</span> <span class="nx">scraped</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>This code is in JavaScript, but the equivalent in Python using BeautifulSoup is pretty straightforward. I recommend <a href="https://requests.readthedocs.io/en/latest/user/quickstart/#more-complicated-post-requests">using Requests</a> to send your POST, since it will automatically encode the data for you and set the correct header to match the form content type. The above linked Gist <a href="https://gist.github.com/thomaswilburn/a54b691498184ca90c17367a3abba709#file-scrape-py">also includes a port of the scraper to Python</a> for comparison.</p>

<p>Although it seems complicated compared to typical scraping tasks, once you’ve written a few of these and the building blocks are more familiar, you’ll be able to write ASP scrapers very quickly. And as you become accustomed to working through the network protocol, not the UI level, you’ll be impressed by how much of this same technique can be adapted to other scraping tasks, including script-generated data views and native applications.</p>]]></content><author><name>Thomas Wilburn</name></author><summary type="html"><![CDATA[How to replace Selenium and browser automation with HTTP exchanges]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://dataviz.chalkbeat.org/assets/images/2022-10-28-asp-scraping/header.jpg" /><media:content medium="image" url="https://dataviz.chalkbeat.org/assets/images/2022-10-28-asp-scraping/header.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Updates to the Dailygraphics Rig</title><link href="https://dataviz.chalkbeat.org/2022/08/18/dg-next-updates.html" rel="alternate" type="text/html" title="Updates to the Dailygraphics Rig" /><published>2022-08-18T00:00:00+00:00</published><updated>2022-08-18T00:00:00+00:00</updated><id>https://dataviz.chalkbeat.org/2022/08/18/dg-next-updates</id><content type="html" xml:base="https://dataviz.chalkbeat.org/2022/08/18/dg-next-updates.html"><![CDATA[<p>I didn’t choose to join Chalkbeat solely because the team was already using the open-source tools <a href="https://blog.apps.npr.org/2019/01/04/dailygraphics-update-and-next.html">I had written at NPR</a>, but I have to admit that it did make the transition a lot easier. One of the virtues of a successful open-source project is that it makes knowledge transferable. While the NPR <a href="https://github.com/nprapps/interactive-template">Interactive Template</a> and <a href="https://github.com/nprapps/dailygraphics-next">Dailygraphics Rig</a> are not de facto standards, they’re common enough throughout the industry (and similar enough to related tools from other newsrooms) that learning them has benefits.</p>

<p>Of course, the other virtue of open-source software is that improvements don’t have to be limited to just the originating team — they can also come from users, and then make their way out to the broader community when they’re merged upstream. To that end, we’ve been working at Chalkbeat on adding <a href="https://github.com/chalkbeat/dailygraphics-next">new capabilities to Dailygraphics Next</a>, making it an even more powerful tool for building charts and graphs on the web.</p>

<h2 id="supporting-google-docs-and-archieml">Supporting Google Docs and ArchieML</h2>

<p>Historically, the Dailygraphics workflow has loaded text in from spreadsheets along with any other data. This works for fairly small and uniform pieces of content, like headlines and chatter, but it’s awkward for any substantial amount of text.</p>

<p>Here at Chalkbeat, we do most of our traditional charts through Datawrapper, falling back on the rig for graphics that don’t fit into a standard template, prioritize more interaction, or (in the case of our <a href="https://dataviz.chalkbeat.org/2022/02/18/flowchart.html">NYC COVID policy flowchart</a>) both. That means it has become more common for us to need structured text, such as in our <a href="https://tn.chalkbeat.org/2022/7/18/23268665/memphis-shelby-county-schools-board-election-voter-guide-tennessee-safety-equity-mental-health">school board voter guides</a>.</p>

<p>We added Docs support (with built-in ArchieML parsing via <a href="https://github.com/nprapps/betty">Betty</a>) as a “secret” feature — it’s not built into the default UI the way that Sheets are, but can be added to a template or a graphic manifest to enable a similar “open document” button on any graphic. Once added, you can access the raw file via <code class="language-plaintext highlighter-rouge">TEXT.raw</code> in your HTML templates, and the parsed ArchieML object will be available as <code class="language-plaintext highlighter-rouge">TEXT.parsed</code>.</p>

<p>There are of course advantages in terms of structure — ArchieML makes it possible to build out freeform and nested data structures in a way that a spreadsheet can’t. It also gives us better options for migrating graphics to or from the Interactive Template, which added Docs support a few years back. But the real win for this feature is how much easier it is to edit and collaborate, since reporters and editors can add specific comments or track changes much more easily than in a cramped grid cell. If you’re doing any kind of narrative presentation in a graphics embed, it’s worth checking out!</p>

<h2 id="infrastructure-upgrades">Infrastructure upgrades</h2>

<p>Last year, new users of Dailygraphics started to notice warnings in the console for the <a href="https://github.com/thlorenz/mold-source-map">mold-source-map</a> package — nothing fatal, but alarming. This package turns out to be part of the venerable Browserify bundler, which the rig used to combine and compile client-side scripts. One of the earliest script compilation tools from the modern JavaScript tooling era, Browserify was clearly suffering from some developer neglect compared to more modern tools like Webpack or Rollup.</p>

<p>The Dailygraphics rig does have slightly different needs from most front-end projects, which means we can’t just pull in off-the-shelf bundler configuration. For one thing, it’s building an arbitrary set of bundles based on the contents of the graphics repo, not a single well-known output. It also needs to compile dependencies from modules in a non-standard location, since the graphics are kept outside the rig itself. And it should be able to do so entirely in-memory, without producing files on disk, since Dailygraphics Next doesn’t store any of its compiled artifacts locally.</p>

<p>After some experimentation, we were able to move pretty seamlessly from Browserify to Rollup as a bundler, without breaking compatibility with older graphics at either Chalkbeat or at NPR. This change should put the rig onto firmer ground moving forward, including creating an easier path for moving to the newest versions of D3 (which is only distributed using the ES module <code class="language-plaintext highlighter-rouge">import</code> syntax that Rollup natively supports).</p>

<p>While we were in the guts of the rig’s dependencies, we were also able to update some other infrastructure: it now uses the newest version of the AWS client libraries, which we hope will make it more reliable when publishing and synchronizing graphics.</p>

<h2 id="csv-and-offline-support">CSV and offline support</h2>

<p>If you’ve ever had to set up a fresh newsroom dataviz toolchain, you know that one of the most painful parts of the process is getting authorization right. While the Google integration of the Dailygraphics rig is one of its biggest selling points, it’s also a cryptic trudge through OAuth tokens, environment variables, and a constantly changing service console. The need to authorize against a Google account also means that even with aggressive caching, it’s often difficult to use the rig if you’re traveling or on a less-reliable network connection.</p>

<p>Spurred by a summer of travel and relocation, team member Kae Petrin and I decided to take this opportunity to give the rig the ability to work with local data instead of pulling from Sheets. We did so by adding CSV support to its data pipeline based on a new manifest key.</p>

<p>If you want, you can now create graphics templates that never talk to Google at all, just by swapping references to the <code class="language-plaintext highlighter-rouge">COPY</code> variable in the HTML layer over to the new <code class="language-plaintext highlighter-rouge">CSV</code> object. This also plays nicely with the rig’s synchronization support: large data files can be excluded from source control, synchronized to S3, processed externally, and accessed as local CSV.</p>

<p>Hand-in-hand with this functionality, there’s now a new <code class="language-plaintext highlighter-rouge">--offline</code> flag for the rig that disables the authentication checks it normally performs. This “airplane mode” will still reach out to the Google Drive API if you load a graphic that includes a <code class="language-plaintext highlighter-rouge">"sheet"</code> or <code class="language-plaintext highlighter-rouge">"doc"</code> key in its manifest, but it won’t ever redirect you to the “authorize account” screen if your connection drops momentarily, making it a handy option for developers who want a fully local workflow, or who need the rig to just back off a little bit while on the road.</p>

<h2 id="join-us-on-the-cutting-edge-or-not">Join us on the cutting edge (or not)</h2>

<p>Do these additions sound interesting to you? If so, feel free to pull from our <a href="https://github.com/chalkbeat/dailygraphics-next">Dailygraphics fork</a>, which is under active development. However, if you’d prefer a little more vetting, many of these features have already been merged upstream into the <a href="https://github.com/nprapps/dailygraphics-next">main NPR version of the rig</a>, where they’re tested by the team there before being enabled. At the time of this writing, the Rollup bundler and AWS upgrades have been merged in, and Docs support is in a branch undergoing testing.</p>]]></content><author><name>Thomas Wilburn</name></author><summary type="html"><![CDATA[Adding support for Google Docs, CSV files, and a new JavaScript bundler]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://dataviz.chalkbeat.org/assets/images/2022-08-18-dg-next-updates/header.jpg" /><media:content medium="image" url="https://dataviz.chalkbeat.org/assets/images/2022-08-18-dg-next-updates/header.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry></feed>