Chalkbeat Data Team

Data Crimes of 2023

2023-12-19T00:00:00+00:00

As data journalists, we’re used to patching together a workflow from whatever tools we can lay our hands on, but there’s a fine line between Macgyver and MacGruber, and sometimes we cross it. Since we think it’s a good thing to share not only our triumphs, but also our hall of shame, here are some of our most duct-taped solutions to reporting problems from 2023.

(Suspiciously, almost all of these are related to processing dashboards.)

Thomas: Deleting 16,000 empty columns in Excel

Most people don’t know that Excel supports a maximum of 16,384 columns (that’s 2^14, for my real binary fans). It’s hard to imagine a scenario in which you would ever need that many, which is why it took me by surprise when I ran a set of spreadsheet files through a CSV converter in order to merge them, and discovered that the file size had ballooned from ~1MB to half a gigabyte, the vast majority of which was trailing commas.

When I opened the files in Excel, they looked normal: columns A through Q had the data that I would expect, and there was nothing to the right of Q. However, when I moused over the header and tried to resize the last column, I discovered that what I thought was a normal header row at the top of the table actually extended all the way out to near-infinity (with labels like “Cell 13678”), and it had simply been hidden by setting the column width to zero past that point:

oh no

Excel can handle these files just fine, because it doesn’t care about sparse ranges, but any of the tools that I normally use for combining data (such as openpyxl or xlsx2csv) try to get the full width of the sheet, see these hidden header cells extending off into the distance, and work themselves into a frenzy adding empty commas to every line of actual data.

I tried a few processing tricks to handle these files, before finally turning to my tool of last resort: Visual Basic for Applications, the macro language that’s built into Excel itself. Making a new workbook that contained the filenames of all the input files, I then ran this subroutine in the script editor to open each one, delete the trailing headers, and re-save the file.

Public Sub trimSheets()
Dim i As Integer
Dim wb As Workbook
Dim file As String
Dim sheet As Worksheet
For i = 1 To 13
    file = Cells(i, 1).Value
    Set wb = Workbooks.Open(file)
    Set sheet = wb.ActiveSheet
    sheet.Range("R1:XFD10").Delete
    wb.Save
    wb.Close
Next i
End Sub

It’s a hideous, utterly unscalable monstrosity, but it did the trick. Feeding these new XLSX files to our standard tools produced CSV files that I could safely combine into a single dataset.

Kae: Filtering and re-aggregating state report cards

Normally, data dashboards suck, but they have one thing going for them: all the data is formatted the same way over time, with the same column names/headers and consistent output.

Not so for one state. The download files boast 12 separate tabs — which change names from year-to-year — and 1,000+ columns per tab. Those also change names. Fun.

On the plus side, this problem led to a nifty piece of reusable code.

dataset = ['General']
df = []
all_df = []

for data in dataset:
   all_files = glob.glob(output_dir + '*' + data + '*.csv',recursive=True)
   print(all_files)
   all_df = []
   for f in all_files:
       df = pd.read_csv(f, sep=',')
       bad_columns = {"Student Enrollment - Total": "# Student Enrollment",
                      '# Grade 9 Total' : '# Grade 9'}
       df.rename(columns=lambda x: re.sub('Number of ','# ',x), inplace=True)
       df.rename(columns=bad_columns, inplace=True)
       all_df.append(df)
   df = pd.concat(all_df, sort=False)
   df = df[(df.Type == 'District') | (df.Type == 'Statewide')]
   df.to_csv(cleaned_dir + data + '_processed.csv',sep=',')

Sometimes, it only takes one runthrough to pull all the relevant data.

This particular time, it took four separate versions of this code, because the state apparently stopped rolling up school-level data to the district level in some (but not all) years. I had to sum all the schools vertically, then group/sum all the relevant student group columns horizontally, then recreate the unique district ID from the school ID hash. Oh, and some of the 2018 data was just in a 2023 column with “(2018)” appended to the standard column name.

After all that, a particular school district disputed the state’s numbers, claiming they were off by as much as a factor of 10 in one school year. So I pulled out the calculator on the raw, pre-processing dataset. The state’s numbers may well be wrong! But thankfully, to save having to re-run the whole pipeline, our totals weren’t.

Thomas: Extracting ACLU data through the dev tools debugger…

If you’re writing about laws targeting trans students, there are a few datasets available but none of them are particularly great. The easiest to cite in a journalism context is probably the ACLU’s “Mapping Attacks on LGBT Rights” (although, as always, it’s the table we want and not the map). They don’t make this downloadable in a machine-readable format, but it’s scrapable, right?

Well, not really. The page is built in Vue and actually assembled in the browser, with data being passed to the interactive sections via attributes:

There’s literally thousands of lines of JSON being jammed into the middle of the HTML here. Set aside any critique of the architecture, this is pretty annoying.

If our goal was to have regular updates of this dataset, we’d need to write a scraper that loads this page, finds these Vue-specific attributes, convert all the HTML entities back into strings, and hope that none of this implementation changes. That’s doable, and not even particularly difficult. But we really only needed this information once. So instead, I opened the browser dev tools, found a reference to this.bills in the JavaScript code, set a breakpoint, and just copied the value out of the console.

I don’t think the ACLU was being intentionally obfuscatory–they just built on the defaults that the framework provided–but perhaps this is a good reminder to us all that there’s not really any such thing as a “secure” JavaScript application if you’re willing to take a crowbar to it in the debugger.

Kae: Tracking schools that change operators, and districts, and also operators and districts

One state that we’ve covered likes to raise existential questions about the nature of schools and districts.

For example, what if a school was a school in a district? And then it was also its own district, for no apparent reason? But it’s also still in a public school district that same year?

And then what if a charter school operator took it over, so it actually became its own district? Also, for some reason, it was also still in the public school district? What if this school just disappeared from school-level reporting entirely for a year, then came back? What if, for some reason, the exact same school was two schools simultaneously in the same year, and one was a district and one was a school?

Oh, and what if some grades were also just a separate school, but only in the data, not in the actual school? What if one school was just a school, no district, and some other school was just a district, not a school? What if all of that was also different between Reading test and ELA/Math test reporting?

What then? Huh?

Normally, the “what if a school were a district, actually” problem is bad enough. But when we’re also trying to legitimately track schools over time for accountability purposes, it becomes a whole extra nightmare.

This is a common data question in a few different cities, where special “turnaround” programs shifted struggling district schools into the hands of charter school operators, or shifted them into specific district-run programs with similar purposes. From a data perspective, these are (usually) considered different schools. But from a student experience perspective, it’s often the same building and the same school.

So how do we capture both the incongruence and the continuity — and answer questions about the long-term outcomes of schools changing hands?

Basic questions like, “well, did the charter operator improve student outcomes?” become pretty gnarly to answer. That’s partially because charter schools often have fewer reporting requirements than district schools, and partially because it’s hard to tell what trends are attributable to a new operator, vs. other outside factors. (Like, I don’t know, a pandemic, remote learning, or a change in test format.)

But some states really want to make a problem out of this.

Anyway, the main solution to this is “lots and lots and lots of data crosswalks, utterly unreplicable copy-pasting, and by God you better hope you wrote all this down,” but I’ll let the project folder speak for itself.

Thomas: Unzipping the poor man’s GZIP

As Kae notes above, dashboards are the ruin of many a data journalist’s week. Over time, you develop a stable of tricks that you can use to extract what you need from them, within limits. So it’s almost a pleasure when, from time to time, you run into something puzzling and new.

My usual routine with a dashboard is to immediately look at the network requests it makes, and see if there’s a straightforward endpoint I can simply copy. To my surprise, one state was loading data from pairs of CSV files: one that contained only numbers, and the other being an “index” file with only one column of text strings:

It took a little while to figure out what was going on, but by searching for the values that we could see in the dashboard page, we eventually realized that any time there’s a whole number in the first file, it represents the value at that row number in the second. So if you see “0” in the data, that actually means “1.8%” pulled from the first row of the index file, “1” translates to “1.9%”, and so on down the list.

The best I can imagine is that this approach is meant to be some sort of compression, like the Huffman coding that’s used in ZIP and MP3, which encodes the most common values in a binary tree. But there’s no point to this, since browsers already leverage GZIP compression for requests. It doesn’t really seem like it’s meant to hide anything. Maybe the developer just thought it was clever. And I’ll say this: while it’s still a data crime, it did make pulling data from this dashboard feel a little more fun, for once.

Scraping Sans Selenium

2023-10-23T00:00:00+00:00

This talk was originally presented at SRCCON 2023.

Download video as MP4 (182MB)

View slides

Showing what’s not there

2023-09-26T00:00:00+00:00

Municipal and state data is often imperfect. At its best, it’s riddled with suppression, missing values, or missing important categories or identifiers. At its worst, agencies export and maintain it incorrectly — it’s full of duplication errors, or an Excel formula got dropped or deleted, or there are 1,600 columns with unclear names and no documentation. That’s assuming you can even obtain it in a usable format, instead of trying to drudge meaning out of a PDF of slapped-together JPEGS.

These problems are only compounded the moment that you need to compare anything between cities or states, each with their own distinct problems.

Sometimes, analysis and text are sufficient and give you room for necessary disclaimers about data quality. Others, even extensive cleaning, clever restructuring, or other more intensive troubleshooting can’t change that the data has some fundamental, underlying issues.

If you’re trying to then visualize this data, you run into more problems. A chart, even with a lengthy note, communicates certainty in a way that a caveat-filled paragraph doesn’t. Audiences tend to take an image at face value.

Here are a few ways that Chalkbeat navigates this.

Showing non-comparable data on the same topic

Okay, so you’ve got a bunch of data. It’s pretty interesting data! No one else has put it all together! Assembling it would be novel work that could shed light on an important issue! But there’s no standardized tracking, not every agency even has it, and the agencies that do have it define it all differently.

You can’t let audiences compare it. It’s not comparable.

Which means you can’t put it all on the same axis.

What next?

Don’t let people compare the data

One solution: small multiples. This is the great way to show the landscape and variability in data without encouraging people to draw conclusions that just don’t exist.

If you want to, you can also add visual flags, notes, and other cues to emphasize just how different the data is.

Here’s one way Chalkbeat used this method to collate teacher retention data, which has been a subject of much debate in the education world:

You’ll notice that instead of laying out the line charts side-by-side, which we do more often when the data is at least slightly consistent but topically different, we require an interactive step to switch between datasets.

This graphic has a few key advantages:

The one-view-at-a-time format prevents audiences from comparing data that doesn’t use the same time-frames, collections processes, and definitions.
That extra “interaction” step cues audiences that they’re switching into something that is different, even if they don’t understand the full extent of how or why.
Individual notes let us define and describe the data on a case-by-case basis.

Show how and why it can’t be compared

But if the data is so not-comparable, or has additional collection and methodological issues, you might not even be able to do that.

In that case, it might be worth a graphic that highlights the uneven landscape — and the sheer impossibility of reporting all the data. You could highlight the different criteria for collection, states that don’t release data, school districts where a given survey wasn’t distributed, and so on.

We did that here:

This graphic doesn’t even dig into the fact that there are half a dozen different definitions of “nonbinary student” in this data, so those states that do collect it may be categorizing different groups of students.

But, it highlights the uneven landscape: a mix of released data, unreleased data, and non-collection at the state level. It also neatly sidesteps the issue that the actual data itself is pretty mediocre.

A similar concept is also applicable anywhere that measured reporting thresholds are different, or where municipal data doesn’t make it up to state or federal reporting due to noncompliant collection methods (or just plain shoddy reporting).

Showing suppression and other missing data

When there are systematic flaws in something we want to chart, the solution is often to go with a really limited visual that doesn’t require using the problem data. This often involves abandoning over-time trends and other common visual approaches.

However, if you get a bit creative, you can analyze and visualize the quality of the data instead of the content of the data. This is especially appropriate when the poor-quality data illuminates an accountability failure.

Show what we do have

The simplest solution, of course, is to work around the data.

When we wanted to visualize the tiny handful of students reported as nonbinary by state departments of education across the U.S., we ran into a few issues. Different timeframes, for one. But also, because the student populations were so small, the over-time trend data was tiny at the absolute level, and extremely subject to looking like it had huge surges at the proportion level (from, say, 12 to 38 kids — a 217% increase that accounts for less than 0.1% of a state’s K-12 population).

As an extra fun problem, some school districts across the same states started letting students change their gender markers to nonbinary at different times — and others might not even have a way for students to report that data, even though the state has started collecting it. So the actual reporting entities change quickly year-over-year. On the flip side, other states added the option all at once at the state level, but students might not know it existed for a year or so.

This poses extra ethical issues due to widespread misinformation about increases in populations of trans youth. A misstep in visualization could enable bad-faith, out-of-context social media screenshots and contribute to viral misinformation.

So, we removed the trend data in favor of first-year and most-recent-year snapshots.

This restriction controls for the major problems with the data, but still gives some idea of the current reporting landscape.

On the other hand, when dealing with student testing data for online schools in Colorado, it simply didn’t make sense to do that — the snapshot data and the trend data were both bad because many of the schools were so new, and eligible students weren’t taking the tests. That meant that even though a lot of the data was reported, it wasn’t really meaningful or representative.

Instead we just chose a metric that every school did have to report for all students: graduation rates.

This isn’t always an option, of course, but especially when you’re working with big state releases, there may be multiple datasets that answer the same question.

In our case, we just wanted to know, “how do the students at online schools perform compared to students at brick-and-mortar schools?” Any number of datasets could answer that question. We just went with the one that showed the most complete picture of student outcomes, instead of the more traditional test score metric.

Show what we don’t have — and how prevalent that is

If all else fails, sometimes it’s better to visualize the information that is missing — demonstrating concretely what’s there and what’s not.

On the Colorado online schools story above, we found that we weren’t the only people having problems evaluating the performance of the schools. The schools threw up a bunch of “could not evaluate due to incomplete data” flags for state education department analysts, too.

The state releases categorized performance information each year. This ended up serving as a nice shorthand, which we were able to use to show the full extent of how little insight anyone has into the performance of these schools. It also emphasized that this problem was unique to these schools.

But in other instances, no one is doing that calculation for you. Still — you can do it yourself.

In the case of this government dashboard in Indiana, that’s what we ended up doing. The Indiana Department of Education started tracking median income of high school graduates in one of several attempts to understand student outcomes. In the dashboard, the data looks straightforward. And it’s presented as a comprehensive income tracking resource.

But if you check the data definitions, there’s a red flag: the median is based only on students who have “sustained employment,” which itself has three definitional requirements.

Okay, so maybe we can show general employment and sustained employment?

Except… how is this data collected? Well, it’s based on workers… employed only in Indiana… and whose employers participate in unemployment insurance. Which means that any graduates who move out-of-state aren’t tracked at all. And not even all workers in Indiana are tracked, on top of the “sustained” vs. more general employment definitions.

So instead, we FOIA’d the raw numbers, did some simple subtraction, then converted to portions show exactly how incomplete the data is:

It really emphasizes how unrepresentative the median is — and doesn’t even require explaining all the complicated limitations of the dashboard.

Knowing when something’s just too bad to work with

It’s of course also important to consider when data is so bad that you might take on liability by trying to use, fix, or interpret it. And it’s extra important to consider when visualization might lend credibility to trends that speak more to the randomness of bad data than any sort of reality.

Large margins of error caused by incomplete or suppressed data are an obvious one. In education data, for instance, I frequently check the suppressed totals against the unsuppressed totals, and make sure that the portion of unreported student data in any subselection of schools doesn’t exceed a reasonable threshold. If a third of relevant students’ data is omitted, it’s probably not a very useful metric.

But there are subtler signs, too.

If a data collection is new, year-over-year trend data might be complete but subject to the whimsies of inconsistent implementations, slow rollouts, and spotty instruction on how to collect and report the data, especially if the data deals with a small subpopulation. If the data definitions changed for only one year — as we see frequently with COVID-19 data — it may be reasonable to make that one year visually distinct, but if the data definitions change every year, or multiple years in a row, a visual trend is just noise.

Maybe you’ve assembled data from multiple states or regional agencies that all track roughly the same thing but use totally different definitions, or are staffed inconsistently for in-person data collection like inspections. In that case, the data might be usable on an individual basis, but any assembled dataset could imply comparisons that just aren’t there.

Deciding what’s too bad is often more of an editorial call than an exact science, but the core questions to ask yourself include:

Is so much data missing that any visualization distorts reality?
Is it possible that the visual will overstate random noise, or accidentally create the impression of a trend that may or may not exist?
If new agencies, geographies, domains, etc, are being introduced to the tracking, is there any way to normalize for it — e.g. with a per capita calculation or some other metric?
How much would we need to explain this to an audience? Is it possible to explain it clearly and accurately, or would any data work be so full of caveats that it renders the important information incomprehensible?
As a journalist, are you comfortable standing by and defending the judgment calls you made to interpret the data?

Why and when is this worth all the time and effort?

It’s also worth considering whether the data’s low quality makes the story extra important.

Sometimes in newsrooms, if you pitch a data story with a million caveats attached, an editor’s response is: We can’t run a story if the data’s no good.

It’s a reasonable first impulse, in some senses. But it’s also a major oversight from a journalistic perspective: What’s the point of a free press, if we’re not uncovering new information and shedding light on problems that the public has overlooked?

In an increasingly data-driven age, problems with data are problems for people.

For instance, though the U.S. Census and other data reporting on multiracial people is notoriously inconsistent, they’re a growing portion of the U.S. that shouldn’t be ignored. And oversights in data collection can have significant impacts for individuals, because it can impact research and funding. Entirely missing data also means that government oversight measures can’t be enforced.

Part of the reason we encounter this so often at Chalkbeat is that, in part, we’re a mission-driven publication that covers public education from an equity-focused lens. That means we’re specifically interested in showing what’s happening to students who have been historically underserved by their public schools. And that often means that our data work tries to look at students, schools, and issues that are omitted or overlooked not just by other journalists, but by the systems that track and create data on schools to start with.

So when there is data that’s flawed, but it’s on especially undercovered groups, or on an issue that is complicated to track, it can be well worth the time to figure out some way to report within the limitations of the existing data — as long as you keep in mind why it matters, and who it matters to.

Building an interactive graphic for teacher salaries

2023-08-14T00:00:00+00:00

Coming into this summer, I knew I wanted to work with Chalkbeat’s dailygraphics rig. I thought it would be a great opportunity to learn about building visualizations with code, since I’d primarily used tools like Tableau, Flourish, and Datawrapper in the past.

Building custom visualizations isn’t always a good investment compared to those kinds of out-of-the-box tools. For a lot of local stories, it makes sense to stick with Datawrapper and not reinvent the wheel. However, for stories where we had a lot of data series, like changes in average NAEP scores for 13-year-olds by subject and racial or ethnic group, then I worked in the rig to have more control over axes, colors, and small multiple grouping.

One story in particular called for something more individualized. At the end of June, Memphis reporter Laura Testino asked the data team to help her visualize changes to how Memphis Shelby County Schools pays its teachers. The district moved from a 31-step payscale to an 18-step scale, with new brackets for teachers with an Ed.S. degree.

The goal was to have a graphic that was teacher-friendly and contained all the pertinent information. We decided to make an interactive where teachers could select their current salary range, and the graphic would return the new salary range. Rather than use one of our pre-existing dailygraphics templates, we started from scratch using JavaScript and lit-html. The latter lets us create a strong link between JavaScript values and the structure or content of the page, where updating one changes the other.

The basics

The first thing we did was set up a box for teachers to input their salary and indicate what kind of degree they had. Both inputs updated a state object with values for the numerical salary and pay ladder. A separate variable for Ed.S. degrees was controlled by a checkbox, and could be true or false. By using a setState function in the template code, changes would update the values and then re-render the template based on the new state.

<label for="salary-input">Salary:</label>
<input
    id="salary-input"
    .value=${state.salary}
    @input=${e => setState("salary", Number(e.target.value))}
>
<label for="degree-select">Degree:</label>
<select
  id="degree-select"
  .value=${state.oldDegree}
  @input=${e => setState("oldDegree", e.target.value)}
>
    <option value="bach">B.A.</option>
    <option value="mast">M.A.</option>
    <option value="doc">Doctorate</option>
</select>
<input 
  id="eds-check"
  type="checkbox"
  .value=${state.eds}
  @input=${e => setState("eds", e.target.checked)}>
<label for="eds-check">You have an Ed.S.?</label>

The graphic takes the old salary, then displays the old salary band and new salary band. A salaryMatch function retrieves numbers from the data provided to us by Memphis Shelby County Schools, and then tries to assign the salary to a range by checking which step was either less than or equal to the input.

function salaryMatch() {
  var oldDegree = "old_" + state.oldDegree;
  var newDegree = "new_" + state.oldDegree;
  if (state.eds && state.oldDegree != "doc"){
    newDegree = "new_eds";
  }
  var first = window.DATA[0];
  var salary = Number(state.salary.replace(/[\$,]/g, ""));
  let found = window.DATA.findLast(d => d[oldDegree] <= salary) || first;
  let previous = {
    degree: oldDegree,
    step: found.old_step,
    salary: found[oldDegree]
  };
  let current = {
    degree: newDegree,
    step: found.new_step,
    salary: found[newDegree]
  };
  let raise = current.salary - previous.salary;
  return {
    previous,
    current,
    raise,
    row: found
  };
}

The oldDegree and newDegree variables match the headers in the dataset, which have names like old_bach (aka old pay scale for teachers with bachelor’s degrees) and new_bach (new pay scale!). If someone with a bachelor’s or master’s degree indicates that they have an Ed.S., the Ed.S. salary overrides the salary ranges for those degrees, but it doesn’t override a doctorate.

The output from this function is returned to the user like this:

<h3>Your result:</h3>
<div class="result-grid">
  <div class="old">
    <h4>${str("old_step")}: ${previous.step}</h4>
    $${cash(previous.salary)}
  </div>
  <span>&raquo;</span>
  <div class="raise">
    + $${cash(result.raise)}<br>
    (${(result.raise / previous.salary * 100).toFixed(1)}%)
  </div>
  <span>&raquo;</span>
  <div class="new">
    <h4>${str("new_step")}: ${current.step}</h4>
    $${cash(current.salary)}
  </div>
</div>

So now, our hypothetical teacher knows what their new salary band is. But how much of a raise do they get? And how does that compare to other salary bands?

Iteration!

If you want to show each row of a dataset on a page, you have to iterate, or loop through the data. In lit-html, this is done with the map() function.

In the early stages of this graphic, we iterated through the old salary ranges and displayed them as an ordered list in addition to the user’s specific salary band.

<ol>
  ${window.DATA.map(i => html`
    <li>
      B.A. $${dollars.format(i.old_bach)} |
      M.A. $${dollars.format(i.old_mast)} | 
      Doctorate $${dollars.format(i.old_doc)}
    </li>`)}
</ol>

This laid the groundwork for us to make a bar chart that showed the amount of raise for each salary band, calculated by subtracting the old salary band minimum from the new band minimum. Each row of the dataset gets a bar scaled so that the maximum raise takes up 100% of its grid container.

We wanted the graphic to emphasize the old salary band that matched the user’s salary, highlighting the resulting raise. In code, this translates to if-then logic; if the current salary variable falls within a certain range, then the graphic will usually emphasize that range. Conditional ternary operators in lit-html change the CSS style when the salary range matches.

Because we’re using a CSS grid to lay out our bar chart and its labels, we assign the “current” class in multiple places to bold text and highlight the bar. We also add a visually-hidden label to the user’s selected salary band for accessibility purposes, since the bold won’t be visible in most screen readers.

${bands.map(b => {
  var highlight = b.current ? "current" : "not";
  return html`
  <div class="step ${highlight}">
    ${b.step}
    ${b.current ? html`<span class="sr-only">(your step)</span>` : ""}
  </div>
  <div class="bar-container ${highlight}">
    <div class="bar"
      style="width: ${b.raise / max * 100}%">
    </div>
  </div>
  <div class="amount ${highlight}">$${cash(b.raise)}</div>
  <div class="percent ${highlight}">${b.per_change}%</div>
`})}

And that’s it!

Takeaways

Although I had some familiarity with JavaScript before this, and I knew how important it was in web development, working on this graphic showed me how to actually develop an interactive graphic and understand the different moving parts.

Tools like Datawrapper let users look up information in a table or highlight specific areas of a graph, but they don’t make it easy to change what you’re looking at based on what a user wants, or to find something that isn’t explicitly included in the dataset–both of which are key to show a specific salary within a range. So even though it’s more difficult to use, this is why we have the rig: to have a bit more control over how we present data to readers, and to keep me from blowing a gasket when I can’t get Datawrapper to cooperate.

Working with a reporter to understand the context for these changes and with senior data editor Thomas Wilburn made it possible to develop a graphic that was both visually appealing and useful to readers.

Slouching toward Pandas

2023-06-26T00:00:00+00:00

Why use spreadsheets for newsroom data? They may have been our best option in the Lotus 1-2-3 days, but in 2023 data journalists have many more choices. Notebook tools like Jupyter or Observable offer expressive syntax that can fall back to a real programming language, they’re visually appealing, and they’re often extremely powerful. As a result, they’re now the default for many teams.

Spreadsheets have not historically had most of these advantages, but they do have an almost-subterranean skill floor. I hate to see a reporter manually add numbers from two cells together with a calculator and then type the result into a third cell. But they can do that if they want to contribute to my analysis, whereas most of them are never going to use a Pandas dataframe. If they want to learn to do some basic cell calculations, that’s not so hard. Google Sheets are also deeply collaborative and shareable in a way that most notebooks are not.

I think it’s reasonable to decide that the power of notebook-based tooling is worth the barrier to entry and loss of frictionless collaboration. But if you, like me, think those are valuable qualities in a default toolkit, then the question becomes: how do we raise the skill ceiling closer to (or even equivalent to) notebook tools, so that experienced data reporters don’t feel like they’re trapped in the bronze age while everyone else shows off the latest ironworks?

This isn’t an impractical goal. Unknown to a lot of people, after years of stagnation, spreadsheets have made huge jumps in capability, to the point that I think they can be almost as expressive as a notebook or even a query language like SQL. I’ve already written about how to use named ranges and array formulas to clarify lookups and filters. Now I’d like to talk about two new Lisp-inspired functions that can help close the gap between Sheets and other data tools: LET and LAMBDA.

LET the right one in

Of the two, LET is easier to understand. It’s basically a wrapper to assign a local name to a value, but only within the scope of the current cell. For example, here’s a cell that computes the post-tax price for an item:

=LET(
  price, .99,
  tax_rate, 0.07,
  price + (price * tax_rate))

In this case, we define two local variables, price and tax_rate, with the values .99 and .07, respectively. The final argument to LET is the calculation using those variables (the “return value” in other languages). You can start to see how this might make even simple formulas more self-documenting, but we can really start to see the benefits of LET if we need to use conditionals or other branching code.

Let’s say that we were computing school testing aggregates based on student data (a named range called student_scores), but we want to suppress any result that’s based on fewer than 10 individuals. Without LET, that formula would probably look like this:

=IF(COUNT(FILTER(student_scores, school = A2)) < 10, "-", AVERAGE(FILTER(student_scores, school = A2)))

In this formula, the filter clause is repeated in two places, once for the length test and again for the actual result. If our selection criteria gets more complicated (say, our sheet contains multiple test subjects and we only want one of them), we’ll have to update both filters in sync. Contrast this with a version that uses LET:

=LET(
  scores, FILTER(student_scores, school = A2),
  IF(COUNT(scores) < 10, "-", AVERAGE(scores)))

Now we only have to maintain our filter in one place, and there’s a lot less noise in the formula, since we can assign human-readable names to sections of it. Note the indentation: you can add newlines to your formulas in Sheets by pressing Ctrl-Enter (on Windows) or Command-Enter (on a Mac). I like to put each variable declaration on its own line in LET statements for greater legibility.

Leg of LAMBDA

While LET provides local variables, LAMBDA gives us the ability to compose functions for reuse. You can call a lambda function immediately, give it a local name using LET, or you can use the new “named function” panel to make them globally available to any cell in your sheet. LAMBDA expressions can also use other functions as inputs. For example, here’s a SUPPRESS function:

=LAMBDA(
  result_array, fn,
  IF(COUNT(result_array) < 10, "-", fn(result_array)))

Once this is assigned to a named function in my workbook, I can call this in a LET, passing in my scores and a lambda that says what to do with the unsuppressed results:

=LET(
  scores, FILTER(student_scores, school = A2),
  aggregate, LAMBDA(input, AVERAGE(input)),
  SUPPRESS(scores, aggregate))

I can change the aggregation by swapping out the lambda that I pass into SUPPRESS, such as asking for the standard deviation instead of the average:

=LET(
  scores, FILTER(student_scores, school = A2),
  aggregate, LAMBDA(input, STDEV(input)),
  SUPPRESS(scores, aggregate))

Sadly, although it would be nice to just pass in the aggregate function directly, built-in functions must be wrapped in a LAMBDA.

// this won't work
=SUPPRESS(E:E, SUM)

// wrapping SUM in a lambda creates a callable function
=SUPPRESS(E:E, LAMBDA(v, SUM(v))

It’s worth noting, if you’re curious, that LET is effectively “syntax sugar” over LAMBDA: you can get the same effect by creating a function and immediately calling it with values to fill its arguments. The following two formulas do the same thing, but the LET formulation is probably easier for most people to reason about, since the variable values are closer to their name declarations.

=LAMBDA(x, y, x + y)(1, 2)

=LET(x, 1, y, 2, x + y)

LAMBDA in turn seems to be sugar over ARRAYFORMULA, at least according to some of the error messages I’ve seen when working with it. For that reason, while testing, it’s often a good idea to develop a LAMBDA from a series of cell operations and then integrate them into a final form for your named function definition, just as you’d step-debug a complex procedure in a traditional language.

Case study #1: school transfer counts

One interesting thing about LET is that its binding expressions are cumulative–any variable definition can use the previous variables in its declaration. Even if you’re not using them in multiple places, you can make a very complex formula easier to understand by breaking down its component parts into named values and processing them line by line.

Here’s a real-world case: one of our bureaus had a spreadsheet containing individual student records from two points in time: where they were enrolled a number of years ago, their grade level at the time, and where they were enrolled more recently (assuming they were still enrolled).

I wanted to find the percentage of students ten years ago at a given school ($A2, the row prefix) had ended up at a specific type of school now (C$1, the column header). The result is basically a pivot, but the cell values are computed based on two input sets—the number that are still enrolled, and all students originally at a given school.

The formula uses two named functions set up in the workbook for convenience sake: the MATCHGRADES named function only finds students in grades PE-2, and COUNTFILTERED correctly handles counting non-numeric values (COUNTA(FILTER(x)) will return 1 for no matches, since it counts the error as a value). The definitions for those look like this:

// MATCHGRADES
=MATCH(grade, { "PE", "PK", "K", "1", "2" }, 0)

// COUNTFILTERED
=IF(ISERROR(filtered), 0, COUNTA(filtered))

Here’s the final formula for cell C2, which I then dragged out to fill the complete table:

=LET(
  in_school, ARRAYFORMULA(students_from = $A2),
  in_grade, ARRAYFORMULA(MATCHGRADES(students_grade)),
  still_enrolled, ARRAYFORMULA(students_enrolled = TRUE),
  went_to, ARRAYFORMULA(students_to_type = C$1),
  enrolled, FILTER(students_id, in_school, in_grade, went_to, still_enrolled),
  all, FILTER(students_id, in_school, in_grade, still_enrolled),
COUNTFILTERED(enrolled) / MAX(COUNTFILTERED(all), 1))

Breaking this formula down to its component pieces, in_school, in_grade, still_enrolled, and went_to are all arrays containing true/false results for each student in the source data:

in_school - did the student attend the school for this pivot row?
in_grade - were they in the grades we want to look at?
still_enrolled - are they still enrolled?
went_to - did they end up at the school type in this pivot column (charter, neighborhood, alternative, etc.)?

From these arrays, we create two filtered lists of student IDs, one that matched on all conditions and one that matches on school, grade, and enrollment (but for any school type). With those two filtered sets, the last line computes the final percentage of students who went to a given school, in the desired grades, and are still enrolled at a school of a given type.

Without LET, I’d probably have needed a column in each row for the total enrolled students at each origin school, then columns for the counts of enrolled students in each category, then a duplicate set of columns that find the percentage from those figures. The result would be more explicit, but also a lot less compact. Each person will need to determine where the line is for that trade-off, but it’s exciting to have the options. On a team with some fluency in this technique, I don’t think this example is excessive.

Case study #2: Filling sparse columns

Perhaps knowing that if you wave something called LAMBDA in front of a bunch of nerds, they’re going to expect that you give them functional programming tools, Sheets does include some higher-order functions now like MAP and REDUCE. Most of the time, these are not terribly useful, since we already have first-class support for ARRAYFORMULA and a handy set of aggregate functions available to us, but it’s good to have them around just in case.

However, one unambiguously helpful lambda-dependent function is SCAN, which generates a sequence of values for each cell in a range based on both its value and the previous output–it’s like a REDUCE that shows its work. You can use this for a number of cool tricks, but the most common is to fill in sparse columns.

Take, for example, the NAEP test results over time. If you run this query in the data explorer, you’ll get a table with individual rows for national and city results for each year–but only the first row will have the year displayed:

When you copy and paste this into a sheet for visualization, you’ll want to fill in the missing year values, but clicking on each one to fill down is almost as tiresome as adding the values by hand. It’s possible to write an IF that will fill in the values, but you’ll have to make sure that fills the entire column, which can be a problem if there are gaps on both sides. Instead, here’s SCAN to the rescue:

=SCAN(A1, A1:A22, LAMBDA(acc, val, IF(ISBLANK(val), acc, val)))

The SCAN function takes three arguments. The first is a starting value, which for our purposes is just the start of the range (2022). The second is the range we want to cover, which in this case is our sparse column (I’ve selected the rows corresponding from 2022 through 2000). Finally, we give it a lambda that receives the previous output (acc, for “accumulator”) and the current cell value. If the current value is blank, we repeat the last value. Otherwise, we replace it with the new year.

It used to be that if I didn’t want to do this kind of data cleanup by hand, I’d have to switch over to the Apps Script editor and write some extension code, then add a note for other team members in case they needed to recreate the results. Now, everything is right there in the sheet, and I don’t need to context switch or remember the SpreadsheetApp API.

Essentially, LET and LAMBDA don’t just make our formulas more readable: they also create options for extensibility that would have previously required leaving the workbook for another tool.

Scraping ASP-based dashboards at the protocol level

2022-10-28T00:00:00+00:00

We’re generally lucky, as journalists, to live in a time when open data portals and releases have become common reporting tools. However, “generally” is not “always,” and there are still plenty of times where “open” data has to be more forcibly extracted from a web-based dashboard. Here at Chalkbeat, we write our fair share of scrapers, depending on the bureau and the local public records regime.

Scraping is not a crime, and I’ve got the t-shirt to prove it. But just because it’s legal doesn’t mean it’s easy, or that the administrators of data sites make programmatic access a priority. Even if they’re not intentionally obfuscating data, the tools that are used for government data release were often designed in ways that make extraction difficult.

Sites using Microsoft’s ASP framework are particularly notorious for being opaque to simple scraping techniques. Faced with one of these dashboards, many data journalists turn to browser automation tools, like Selenium, that let them script clicks and keystrokes. However, this approach brings its own challenges:

Our fleshy, human eyes can easily see when the page is “interactable,” but it can be difficult to tell a computer when a page is ready to receive clicks, meaning that the script will either try to click/type too quickly or incorporate error-prone delays.
Pages are often not designed to be used programmatically, so we have to figure out selectors for inputs that may not be marked clearly, or might not exist until JavaScript creates them.
A browser will try to load images and other resources, many of which we don’t “need” for the data and which slow down our process.
Selenium requires a lot of memory and isn’t tremendously stable, which often means it’s tough to run on an EC2 instance or another virtual machine for automatic scraping at regular intervals.

As an alternative, instead of pretending to be a user with a keyboard and mouse, we should pretend to be a browser client that only knows how to send and receive messages. Essentially, we will drop down to the actual HTTP exchange layer.

Scraping via pure requests is more abstract and requires us to understand a bit more about the client/server protocol, but it reduces the process to a simple set of back-and-forth transactions: we send a message, record the response, rinse and repeat.

Ironically, the “smarter” and more interactive a page is in terms of client-side code, the more likely that it’s backed by simple HTTP endpoints that the scripts (and in many cases, native apps) are talking to, and which we can also contact for data directly.

In contrast, ASP dashboards are difficult not because the underlying software is sophisticated, but because it was designed and implemented before developers standardized around a more expressive use of these APIs.

Understanding HTTP

MDN has a reasonably good guide to the HTTP protocol if you’re interested in the details, but for our purposes we can think of it as a simple call and response. Every time you load a page, the browser sends a request to the server, and receives a response in return. This may trigger additional requests (e.g., a page may contain images, scripts, or stylesheets that the browser needs to download).

An HTTP request can be broken down into a few parts:

The URL path being requested from the server
A method “verb” expressing the action for the request, most commonly GET or POST.
Headers that add metadata to the message, including cookies and the preferred language.
An optional body section that can contain data being sent to the server.

Most of the time, in a browser, there’s no body and the request method will be GET, since we’re just asking to download a file. POST requests are used to upload to the server or make changes, and usually do include a body with the data that the user wants to submit (such as a file upload or the contents of a form).

Here’s a typical request made to Chalkbeat, for example. You can see the method, path, and headers. As noted, since it’s a GET, there’s no message body after the headers.

GET https://www.chalkbeat.org/ HTTP/1.1
Host: www.chalkbeat.org
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:105.0) Gecko/20100101 Firefox/105.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate, br
DNT: 1
Connection: keep-alive
Upgrade-Insecure-Requests: 1
Sec-Fetch-Dest: document
Sec-Fetch-Mode: navigate
Sec-Fetch-Site: cross-site

The server processes the request and sends back a response that’s structured in a similar way. Responses don’t have a path or method, but they include a numerical status code indicating success or failure (such as 200 OK or the infamous 404 NOT FOUND). They also have headers for things like modification date, and they almost always include a body section containing the data we asked for in our request.

Here’s the raw server response to our request, trimmed for length:

HTTP/1.1 200 OK
Content-Type: text/html;charset=UTF-8
Connection: keep-alive
Date: Mon, 24 Oct 2022 14:28:23 GMT
Server: istio-envoy
strict-transport-security: max-age=31536000; includeSubdomains;
x-powered-by: Brightspot
x-envoy-upstream-service-time: 983
x-envoy-decorator-operation: brightspot-cms-verify.chalkbeat.svc.cluster.local:80/*
Vary: Accept-Encoding
X-Cache: Hit from cloudfront
Via: 1.1 b75f3304a39fe185ba1556322bdff970.cloudfront.net (CloudFront)
X-Amz-Cf-Pop: ORD58-P2
X-Amz-Cf-Id: yvRz34BFe8NslGGux897su17I-QKBj-XRFGDCNj23Bjx8-pmqkaUlw==
Age: 45
Content-Length: 414837

<!DOCTYPE html>
<html class="Page" lang="en" data-sticky-header
>
<head>
(... body continues with the rest of the HTML)

We don’t typically look at raw HTTP requests (I had to install a proxy in order to do so). However, if you open up the dev tools and look at the network tab, you can see all of these in your browser in a more readable form:

“Headers” will show you the request method, response status code, and the headers that were sent by the browser and back by the server. Click “Response” to see the message body.

When scraping a website, although it’s tempting to look at the DOM inspector, it’s often easier to just go straight to the network tab and see what messages are being sent when interacting with the page (especially with the filter set to “Fetch/XHR”, which will only show requests made by client-side JavaScript). To me, this is similar to asking for machine-readable data in a FOIA instead of trying to OCR a document scan – the latter is possible, but it’s usually a lot more trouble.

How a normal site uses HTTP

A typical web application does almost all its interactions using the GET method (meaning, the browser wants to read data but doesn’t want to write it) and differentiates between views using the URL. As a refresher, here’s the parts of a URL (image courtesy of MDN):

Different data views are usually expressed using either the path or query parameters. Either way, you have a specific URL that you can construct and request in order to scrape data out of the service. Since you’re not changing anything, you can just use GET instead of one of the other HTTP verbs, and you don’t have to send a request body to the server. This pattern – where a resource is made available at a given address regardless of other requests that have been made – is often referred to as RESTful. It’s popular because it’s also easy to implement, and friendly to caching (meaning that it’s cheap).

How ASP pages use HTTP

The first thing you will notice when trying to scrape an ASP-based site is that you’ll see repeated requests made to the server for different data, but all using the same URL. In order to extract data from these, we have to learn to send our requests in a slightly different format.

For example, let’s load the NYC school-based expenditure report dashboard, then select District 1 to look at. The page will refresh, and the document request will look like this:

The request URL is the same as the page we’re on, but notice that the method is a POST, not a GET. That usually means that we’re sending data to the server in the request body.

We can see what we sent by clicking over to the Payload tab:

(many lines of _VIEWSTATE later…)

When we changed the drop-down to select the district, instead of simply requesting data from a new URL, the page actually made a form submission with a number of parameters corresponding to the dashboard settings and a unique value assigned to our viewing session (the _VIEWSTATE). If you inspect the page, you can actually find the hidden form inputs that it uses to generate this:

What this boils down to is that ASP pages are not inherently stateless the way that most HTTP endpoints are. Instead, the server is maintaining a session for you and each transaction has to match that session. You can imagine this as though the dashboard has its own browser tab open, identical to yours visually, and when you click a button, the page tells the server to perform the actual click in its tab, and then it sends the updated page back to you. If you try to make a request that doesn’t match the server’s idea of where you “are” on the site, it’ll usually just send the index page back to you as a fallback.

Luckily, even though this is a dizzyingly overcomplicated way to run a server, it’s not terribly hard to send these events from our side. We just need to make our own POST request that includes the values that the server embedded in the form, updated with the correct input parameters. Step-by-step, that process usually looks something like this:

Get the initial page, and pull out the hidden input values into a state object. We could be more discriminating, but the easiest way to do this for ASP is to just query for inputs with type="hidden" set on them, and create an object out of their name and value attributes.
Update the filter values in that object. Usually I find these values by poking around in the network tab and seeing what changes when I update the form – this form input produces that form value in the POST submission.
POST that object to the server and pull data out of the response.
(optional) Update our state object to match the response, then repeat from step 2.

You can see an example scraper for a couple of NYC finance pages in this Gist. Of particular note is the getASPValues function, which extracts the hidden form inputs:

async function getASPValues(target) {
  var response = await fetch(target);
  var html = await response.text();
  var $ = cheerio.load(html);
  var data = {};
  var inputs = $(`input[type="hidden"]`);
  for (var i of inputs) {
    data[i.attribs.name] = i.attribs.value || "";
  }
  return data;
}

There are also some values that I snooped from the network tab, such as the _EVENTTARGET (which tells the server that we changed the district drop-down):

const sberFormValues = {
  _EVENTTARGET: "ctl00$ContentPlaceHolder1$Input_District",
  ctl00$ContentPlaceHolder1$reportnumber: 1,
  ctl00$Fiscal_Year: "SELECT_A_YEAR"
};

We can then combine our ASP form values, our event constants, and our specific filter setting into a form submission and POST it to scrape a given district page:

var getSBER = async function(asp, district) {
  var data = {
    ...asp,
    ...sberFormValues,
    ctl00$ContentPlaceHolder1$Input_District: String(district).padStart(2, "0")
  };
  // generate a form-encoded POST body
  var body = formEncode(data);
  var response = await fetch(SBER, {
    method: "POST",
    headers,
    body
  });
  var html = await response.text();
  var $ = cheerio.load(html);
  var rows = $(".CSD_HS_Detail");
  var scraped = [];
  for (var row of rows) {
    var { children } = row;
    var cells = children.map(c => $(c).text());
    scraped.push(cells);
  }
  return scraped;
}

This code is in JavaScript, but the equivalent in Python using BeautifulSoup is pretty straightforward. I recommend using Requests to send your POST, since it will automatically encode the data for you and set the correct header to match the form content type. The above linked Gist also includes a port of the scraper to Python for comparison.

Although it seems complicated compared to typical scraping tasks, once you’ve written a few of these and the building blocks are more familiar, you’ll be able to write ASP scrapers very quickly. And as you become accustomed to working through the network protocol, not the UI level, you’ll be impressed by how much of this same technique can be adapted to other scraping tasks, including script-generated data views and native applications.

Updates to the Dailygraphics Rig

2022-08-18T00:00:00+00:00

I didn’t choose to join Chalkbeat solely because the team was already using the open-source tools I had written at NPR, but I have to admit that it did make the transition a lot easier. One of the virtues of a successful open-source project is that it makes knowledge transferable. While the NPR Interactive Template and Dailygraphics Rig are not de facto standards, they’re common enough throughout the industry (and similar enough to related tools from other newsrooms) that learning them has benefits.

Of course, the other virtue of open-source software is that improvements don’t have to be limited to just the originating team — they can also come from users, and then make their way out to the broader community when they’re merged upstream. To that end, we’ve been working at Chalkbeat on adding new capabilities to Dailygraphics Next, making it an even more powerful tool for building charts and graphs on the web.

Supporting Google Docs and ArchieML

Historically, the Dailygraphics workflow has loaded text in from spreadsheets along with any other data. This works for fairly small and uniform pieces of content, like headlines and chatter, but it’s awkward for any substantial amount of text.

Here at Chalkbeat, we do most of our traditional charts through Datawrapper, falling back on the rig for graphics that don’t fit into a standard template, prioritize more interaction, or (in the case of our NYC COVID policy flowchart) both. That means it has become more common for us to need structured text, such as in our school board voter guides.

We added Docs support (with built-in ArchieML parsing via Betty) as a “secret” feature — it’s not built into the default UI the way that Sheets are, but can be added to a template or a graphic manifest to enable a similar “open document” button on any graphic. Once added, you can access the raw file via TEXT.raw in your HTML templates, and the parsed ArchieML object will be available as TEXT.parsed.

There are of course advantages in terms of structure — ArchieML makes it possible to build out freeform and nested data structures in a way that a spreadsheet can’t. It also gives us better options for migrating graphics to or from the Interactive Template, which added Docs support a few years back. But the real win for this feature is how much easier it is to edit and collaborate, since reporters and editors can add specific comments or track changes much more easily than in a cramped grid cell. If you’re doing any kind of narrative presentation in a graphics embed, it’s worth checking out!

Infrastructure upgrades

Last year, new users of Dailygraphics started to notice warnings in the console for the mold-source-map package — nothing fatal, but alarming. This package turns out to be part of the venerable Browserify bundler, which the rig used to combine and compile client-side scripts. One of the earliest script compilation tools from the modern JavaScript tooling era, Browserify was clearly suffering from some developer neglect compared to more modern tools like Webpack or Rollup.

The Dailygraphics rig does have slightly different needs from most front-end projects, which means we can’t just pull in off-the-shelf bundler configuration. For one thing, it’s building an arbitrary set of bundles based on the contents of the graphics repo, not a single well-known output. It also needs to compile dependencies from modules in a non-standard location, since the graphics are kept outside the rig itself. And it should be able to do so entirely in-memory, without producing files on disk, since Dailygraphics Next doesn’t store any of its compiled artifacts locally.

After some experimentation, we were able to move pretty seamlessly from Browserify to Rollup as a bundler, without breaking compatibility with older graphics at either Chalkbeat or at NPR. This change should put the rig onto firmer ground moving forward, including creating an easier path for moving to the newest versions of D3 (which is only distributed using the ES module import syntax that Rollup natively supports).

While we were in the guts of the rig’s dependencies, we were also able to update some other infrastructure: it now uses the newest version of the AWS client libraries, which we hope will make it more reliable when publishing and synchronizing graphics.

CSV and offline support

If you’ve ever had to set up a fresh newsroom dataviz toolchain, you know that one of the most painful parts of the process is getting authorization right. While the Google integration of the Dailygraphics rig is one of its biggest selling points, it’s also a cryptic trudge through OAuth tokens, environment variables, and a constantly changing service console. The need to authorize against a Google account also means that even with aggressive caching, it’s often difficult to use the rig if you’re traveling or on a less-reliable network connection.

Spurred by a summer of travel and relocation, team member Kae Petrin and I decided to take this opportunity to give the rig the ability to work with local data instead of pulling from Sheets. We did so by adding CSV support to its data pipeline based on a new manifest key.

If you want, you can now create graphics templates that never talk to Google at all, just by swapping references to the COPY variable in the HTML layer over to the new CSV object. This also plays nicely with the rig’s synchronization support: large data files can be excluded from source control, synchronized to S3, processed externally, and accessed as local CSV.

Hand-in-hand with this functionality, there’s now a new --offline flag for the rig that disables the authentication checks it normally performs. This “airplane mode” will still reach out to the Google Drive API if you load a graphic that includes a "sheet" or "doc" key in its manifest, but it won’t ever redirect you to the “authorize account” screen if your connection drops momentarily, making it a handy option for developers who want a fully local workflow, or who need the rig to just back off a little bit while on the road.

Join us on the cutting edge (or not)

Do these additions sound interesting to you? If so, feel free to pull from our Dailygraphics fork, which is under active development. However, if you’d prefer a little more vetting, many of these features have already been merged upstream into the main NPR version of the rig, where they’re tested by the team there before being enabled. At the time of this writing, the Rollup bundler and AWS upgrades have been merged in, and Docs support is in a branch undergoing testing.

Utilization, iteration, and visualization

2022-08-12T14:41:58+00:00

Public schools around the country have seen declining enrollment, and small schools spend far more per student, making them a financial strain on districts. However, school closure decisions are controversial and often disproportionately impact marginalized communities.

One school district facing declining enrollment and tough school closure decisions is Jeffco Public Schools, and Chalkbeat Colorado wanted to help local readers understand how those decisions would be made, and how their school specifically would be affected.

Chalkbeat reporter Yesenia Robles obtained data on enrollment and utilization (the percent of a building’s capacity that’s in use) from the district and identified five key takeaways from the data, which we then set out to visualize.

One of the takeaways — that schools with lower enrollment and utilization tend to have a higher concentration of students living in poverty — presented a series of challenges. First, there were too many schools to visualize all of them individually, so I tried out this graphic presenting averages.

However, explaining what this means gets convoluted quickly. Let’s break down what’s going on in those grouped bars:

Low (schools with less than 60% utilization and enrollment under 250)
- Average utilization: 50%
- Students eligible for free and reduced price lunch: 50%
High (schools with more than 60% utilization and enrollment over 250)
- Average utilization: 80%
- Students eligible for free and reduced price lunch: 23%

It shows the disparity between high and low utilization schools, but you really have to think about what those bars are showing — it’s a lot of numbers to keep track of.

To avoid the issue of having so many percentages floating around, I tried a different approach, presenting the number of schools in each category serving mostly-FRL-eligible or mostly-not-FRL-eligible students.

But, as bureau chief Erica Meltzer pointed out, that high bar on “high utilization” paired with the headline on the chart makes it look like it’s showing a lot of students on free and reduced price lunch if you’re not reading every label carefully.

That brought me back to my original idea, which used a bullet chart to pair the two metrics together. It was way too long to be meaningful or useful, especially since most of our audience reads our stories on mobile devices, so I hadn’t even bothered to send it to Yesenia and Erica at first.

The benefit of sharing this bad graphic was that it clearly captured the disparity we were interested in, so we were able to use it as a jumping off point to discuss how to pick representative schools and make a more manageable version. We chose to select the schools with the five highest and lowest utilization rates, which we ultimately published.

Mocking up several ideas and trying out different approaches — even ones that were obviously not going to be good final choices — helped to jumpstart the brainstorming process and land on a better visualization faster. The newsroom’s subject matter expertise ensured that the graphic we settled on was methodologically appropriate and easy to understand, and frequent communication prevented me from spinning my wheels in less useful directions.

How we turned an elementary school curriculum challenge into a dataset

2022-05-23T14:15:00+00:00

Last year, Tennessee enacted a law intended to restrict K-12 classroom discussions about the legacy of slavery, racism, and white privilege. Then Williamson County Schools, a district south of Nashville, received forty complaints about texts in its K-5 English language arts curriculum.

The district later released a 113-page report that detailed the challenges, their outcomes, and the decision committee’s reasons.

Chalkbeat reporter Marta W. Aldrich found two Williamson parents, numerous experts, and teachers to dig into the county’s curriculum debate, which often circled the idea that some texts were simply “age-inappropriate” to teach in early elementary school. These voices captured the human aspect of the story, as well as the conversations around policy and educational practice.

But we also wanted to analyze the complaints themselves. It was a matter of fact-checking, as well as adding crucial context. A giant report like this is rich with data, if you can decide how to convert it from plain text into a structured dataset.

We wanted to quantify: Who has challenged the books? Why? What were common themes and phrases? Did the complaints’ contents actually line up with what experts and parents are observing in the story?

The answers became the introductory visualization in our final story, How the age-appropriate debate is altering curriculum in Tennessee and nationwide.

The viz juxtaposed concrete data against the euphemistic talking points that have become common in covering curriculum debates. This analysis equipped readers with the knowledge of what, exactly, sources were talking about (or around) in the following story.

Working backwards from tables and semi-structured data

The report offers a few obvious starting points: data viz! Tabular data lets us start building a dataset of books challenged, the grades they’re taught in, and the challenge outcome.

The report also contains a few other tables, including breakout aggregates of the outcome results by status and a list of books that were adjusted or removed.

Even better, each individual book has a full page evaluation report, which we can use to add more detail to the committee’s decision-making process. Particularly interesting: categories indicating whether books were on topic, valuable for educational purposes, or objectionable. This also allowed us to fact-check the tabular data up above. (We caught two discrepancies and were able to reconcile them by comparing the tables to the individual report pages.)

All of these are great! They are the starting blocks for a structured dataset that contains some basic information on the challenges and their outcomes. But, overall, it doesn’t tell us much about what’s actually in the books, why parents filed complaints, or how the committee evaluated the texts.

Finding themes

Instead, we started looking at the text itself. There are programmatic ways to do this using keyword extraction, but it was simple enough to manually catalog thirty books laid out in one consistent document format.

We debated digging into the more detailed complaints, which are appended to the main summary documents. The detailed complaints had a few advantages: they were original written complaints, rather than summaries written or quoted directly through the appeal board, and they contained more expansive comments about specific pages or lines.

However, when we attempted to analyze those complaints, they were (unsurprisingly, for a public comment process) often unclear or vague about their objections. Many addressed teachers’ materials that aren’t published with the books; we could not consistently access those. Some complaints just seemed random or irrelevant. For instance, one complaint noted, “It’s all about changing seasons & Chamaeleon,” to which the committee responded, “The committee does not share the concern of the complainants.” Another simply read “CA,” to which the committee responded, “The committee does not understand this concern.”

In other words, lots of junk data.

So instead, we focused on the committee’s summaries. Though they carried the downside of an added filter, the committees did directly quote the complaints, and they had already done the work of identifying items relevant to the curriculum.

I added booleans for keywords as we encountered them — “agenda,” “conditioning,” “oppression” — and summarized other themes or concepts that materialized — “negative about white people,” “age inappropriate,” “dark or scary.” A few themes showed up once (“boring to boys”) but never came up again, whereas others reappeared consistently.

We also tracked themes in the outcomes sections. I created several text and note sections, then boiled them down into a few more booleans based on the committee’s defense of each book (“historical value” or “taught in context”) and the ultimate outcomes (“language changes during teaching,” “omit pages,” “make counselors available.”)

With a full dataset of trends in front of us, it was easier to see patterns in the complaints.

This also gave us an easy plain-text way to send the findings to reporters and editors. They were able to review what we left out, then recommend including or excluding other items based on how the story’s reporting developed.

As I decided what to highlight, I asked a few questions:

What’s the basic background that we need to add?

This became the books broken out by grade level, an overview of the residents who filed the complaints, and the ultimate outcomes.

What information is necessary to understand the broadest themes of the data?

This became a few of the filters, including some elements that weren’t foregrounded in the main story. For instance, sexual themes and violence didn’t really come up in the reporting, but the complaints mentioned them regularly, and they often coincided with complaints about “age inappropriate” discussions of race — so it seemed important to include.

These themes, taken as a group, also starkly emphasized the difference between common explanations for book bans and the actual content of the challenges against the books. For instance, books challenged specifically for containing “critical race theory,” being age-inappropriate, or being racist included two historical accounts of Ruby Bridges attending an all-white elementary school as a black child in New Orleans at age 6. (One of those books was her autobiography.)

The Williamson County committee summarized what they heard from parents in the hearing: “Themes of segregation and racism are not appropriate for this age group or grade level in which 7–8-year-olds do not yet have the maturity or capacity to think critically.” A book about Sylvia Mendez’s elementary school experiences received similar commentary.

Though we couldn’t capture all that detail in the viz, it was important to describe the core: “Parents accused three of those books of being racist because they depicted racism. Those books each recounted historical fights to desegregate schools.”

What details are interesting or notable within the context of the story?

We ended up excluding a lot of the details on how the committee justified their decisions, because they just weren’t that remarkable on their own. But the recommendation to make counselors available struck us as worth including, especially because the books that received that recommendation were somewhat surprising.

We also highlighted some rarer complaints. For instance, books depicting the U.S.’s historic mistreatment of Native Americans and immigrants were accused of being “anti-American,” which only came up in relation to three books. But these phrases seemed particularly relevant to broader conversations about race and how it is taught, so we included them.

We also ended up ignoring flags that might have seemed initially like they should be included. For instance, the phrase “critical race theory” seems like it would be relevant to a story on curriculum restrictions for “critical race theory.” However, the specific phrase only came up four times. And in two instances, it was used in confusing (and arguably definitionally incorrect) ways. Most complaints used vaguer language about “hate” and “division” to refer to similar ideas.

Wrapping it up

At its core, this is just due-diligence reporting. Plus, treating the review outcomes as a dataset gives us a systematic way to look at a curriculum challenge. Though there are ways to proceduralize this kind of keyword search-work, it didn’t take long to read the entire report.

But it’s also important to take these extra steps to dig into the actual content of the texts. A lot of coverage over curriculum debates has leaned heavily on soundbites, without exploring the exact educational materials being challenged under euphemisms like “age-inappropriate” or “divisive.” By digging into those complaints as data, we hope to model a more accurate methodology for culture-war reporting.

Expressive rendering and UI in Tarot

2022-03-14T00:00:00+00:00

It is a truth universally acknowledged that a newsroom in possession of a data team must be in want of a social card generator.

This kind of tool is an easy win if you’re just starting out in a strange newsroom: it’s immediately useful, highly visible, not terribly difficult to build, and it helps make inroads with everyone from the engagement team to the managing editors. Vox has one, as does NPR. Politico rewrote NPR’s Quotable from scratch for theirs. It was one of my first projects at the Seattle Times, which then got a Svelte adaptation at the Star-Tribune.

While they’re easy to write, building a social card generator that’s future-proof is less straightforward. For example, the Seattle Times tool places and renders the card’s contents using hand-written JavaScript, which means it’s very fast and can do some fun interactive image positioning tricks, but additional layouts need to be manually written (and so they never were). By contrast, NPR’s Lunchbox tool renders a styled block of the HTML document to an image using an external library. That makes it easy to create new looks using CSS, but there’s still only really one combination of text elements, and you’re at the mercy of the rendering library in terms of display support.

To build Chalkbeat’s social card generator, Tarot, I wanted something that offered flexibility without sacrificing control. The resulting architecture supports multiple templates with customized color themes, including any number of content blocks and customized design elements. It also automatically provides alt text for all its templates, to make it easier to publish accessible quotes. All of this is accomplished by designing a domain-specific language of custom elements that can both render to a canvas and provide an interface for users to design the card.

Painting with brushes

The fundamental building block in Tarot is a “brush” class that elements inherit from. It provides some utility functions, like unpacking padding strings and converting normalized coordinates into pixels. But its most important job is to provide two stub functions that subclasses must override: getLayout() and draw().

Templates for each card are written as a set of brush elements in an HTML file, and the application splats these into a form element when loading the layout. To render a card, Tarot loops through the form element’s children and calls draw() on each child, passing in the 2D context and a config object setting the color theme. There are brush elements for text, colored rectangles, special logos, and user-customized photos.

Here’s the template for drawing a simple breaking news alert:

<text-brush
  anchor="top center"
  color="textAlt" bold
  x=".5" y=".05"
  padding="10 0"
  size="36" noform
  id="eyebrow"
  value="BREAKING NEWS"
></text-brush>

<image-brush
  src="./assets/Chalkline-teal-dark.png"
  x=".5" y=".15" width="180"
  recolor="textAlt"
></image-brush>

<text-brush
  id="headline"
  size="60" bold
  anchor="middle left"
  x=".2" y=".5" width=".6"
  value="This is the headline of a breaking news story, which spans multiple lines."
>Headline</text-brush>

<logo-brush x=".5" y=".9" color="text"></logo-brush>

A card generated from the above markup.

Each brush is positioned and configured using attributes, including a default value. Most coordinates are expressed as normalized values from zero to one, so that we’re not locked into a single canvas size, although “aesthetic” settings like padding and image width are in pixels. It’s not dissimilar to working in a format like SVG, which is intentional: markup like this should be readable (and editable) by designers and other non-development team members, in case we want to add new templates or tweak an existing setup.

Remember, however, that we only loop through the children of the form element when rendering. This gives us the ability to define wrapper elements that delegate rendering to their children in specific ways, such as our vertical spacer and stack brushes. When their draw() method is called, these brushes use the getLayout() methods to find the heights and widths of their child brushes, then set the canvas transform before calling draw() on each child to create “flex” layouts (with equal spacing between items) and linear text sequences, respectively.

Here’s a quote card template, which combines two wrapper elements to space blocks equally apart, while stacking the quote, horizontal rule, and attribute in the middle:

<vertical-spacer padding="20">

  <series-logo id="series" color="accent" x=".5"></series-logo>

  <vertical-stack dx="20" anchor="top" x=".1">
    
    <text-brush
      id="quote" quoted wordcount
      size="60" width=".8"
      padding="0 0 20"
      value="Insert quote text here."
      >Quotation</text-brush>

    <image-brush
      recolor="accent" align="left"
      src="./assets/Chalkline-teal-dark.png"
      ></image-brush>
    
    <text-brush
      id="name" size="48"
      bold color="textAlt"
      padding="12 0 4"
      value="Firstname Lastname"
      >Attribution</text-brush>
    
    <text-brush
      id="title" size="36"
      italic color="textAlt"
      value="Title/affilation/etc."
      >Title</text-brush>

  </vertical-stack>

  <logo-brush x=".5" align="top" color="text"></logo-brush>

</vertical-spacer>

Vertical spacers and stacks ensure that the quote text block is always vertically centered in the available space, regardless of whether a series logo is present or if the Chalkbeat logo includes a bureau tag.

Although this pattern seems obvious in retrospect, it was a relatively late addition. I had originally designed brushes to have a “follows” attribute, in which each element would use another as an anchor for positioning. Getting this to work required a lot of duplicated code, especially as different brushes would size themselves in different ways or need to be vertically centered against multiple other blocks. When I realized that a wrapper could handle all of those cases in a single place, it radically simplified the code — and the markup inside the templates.

Form begets function

While the HTML templates mimic the display tree of an SVG, we also need a way for application users to add the text or image associated with each brush, or it’s not much use to us. However, this is where custom elements can truly shine, since they can display their own UI via their shadow DOM. A text brush, for example, provides a <textarea> and a word count display (unless the noform attribute is set). As a result, the template for each layout in Tarot isn’t just an abstract display tree – it’s also the visible form that users interact with in order to generate the final card.

Everything on the right between the theme selector and the download button is generated by the display template.

Brush elements are responsible for monitoring their own UI and dispatching “update” events when those change (usually using invalidate(), which is inherited from the Brush base class). Tarot watches the form element for these events, and schedules a render when it sees them come in. Wrapper elements place their children in a <slot> so that they’re still in the light DOM, and the events bubble up normally.

The result is a UI that largely runs itself. The code that loads templates asynchronously and handles this render loop is currently a relatively unstructured (but short) “top-level” module. Individual elements are loaded from files that contain just the rendering code and UI templates for a particular brush, making them nicely encapsulated (no giant, monolithic render() here) and potentially testable. A separate definitions module sets up color themes and provides utility functions for accessing them.

At times, building a canvas app this way felt like cheating. Coupling the render layer with its configuration in shadow DOM means that there’s no need to worry about keeping the UI in sync with the design – if you place a new item in the display tree, it’ll add form elements to match automatically. It’s hard for me to think of another front-end framework that could pair both sides so easily.

Tarot reading

One weakness of social cards, and of social media promotion in general, is that it’s often inaccessible: putting text into images to work around character limits or pull in readers runs the risk of excluding screen readers. I wanted to make sure that we avoided that, so all brush elements in Tarot have an “alt” getter property that returns their text contents, if any. Wrapper elements return the combined alt text of all their children. The resulting block is shown at the bottom of the UI, right under the download button.

Still, for all that we were able to solve a lot of problems in Tarot, at times its DOM markup fell into an uncanny valley where it looked like HTML or SVG, but didn’t act that way. It was definitely the right decision to support CSS-style position strings for padding, where you can specify up to four values for the top, right, bottom, and left sides. But mimicking CSS meant also realizing sometimes how easy it is to take browser layout for granted. I don’t know if I’ve ever appreciated flexbox as much as when I had to build my own version.

The final interesting aspect of Tarot is the way it neatly highlights the value of class inheritance. In JavaScript, classes have always gotten a skeptical treatment, and parts of the community have certainly leaned more into the functional side of the language. However, being able to share functionality across brush elements – and to create specialized subclasses for image recoloring in the logo brushes – shows that there’s a place for classical inheritance even in an application of this size.

For example, the elements for images, the Chalkbeat logo, and series logo selection all need to recolor their rendering to match the theme, so that a purple card in the “Taro” theme doesn’t have an illegible green series logo on top. By designing the image brush to do this recoloring, the logos are able to simply inherit that functionality while overriding the UI and drawing functions for their particular use cases.

Ultimately, social card generators don’t have to be revolutionary. At best they’re icebreakers, at worst they’re utilitarian. But there’s something to be said for executing small things with elegance, and I think Tarot manages that quite well.