Diving into Data Journalism: Strategies for getting started or going deeper
A few years ago, the digital revolution sparked upheaval — and in many newsrooms, concern.
As technology brought data journalism and other new practices into newsrooms, some editors and publishers fretted that revered principles of shoe-leather reporting, experience, and intuition might decline or even disappear.
But when the fear subsided, it became clear that the fundamentals of what made a good work of journalism remained the same.
The new practice of data journalism is not a completely new type of journalism. Rather, think of data analysis as simply part of journalism for the modern world. Reporting has always involved numbers. Today, technology enables journalists to use numbers less anecdotally, more authoritatively, and to uncover otherwise invisible stories.
Data is essential to making the journalism of today stronger than what came before.”
In that sense, data is essential to making the journalism of today stronger than what came before.
Consider, for example, the 2015 Pulitzer Prize winners. The Public Service winner was anchored by data about domestic violence against women in South Carolina. The Investigative Reporting co-winners revealed data about lobbyist donations and Medicare payments. The Explanatory Reporting winner tracked and visualized companies avoiding taxes. All are stories that would have been less convincing or entirely undiscovered without data analysis.
This paper, part of the American Press Institute’s series of Strategy Studies, will address how to incorporate data into the reporting that’s already happening at your news organization, as well as how to grow and sustain the practice, despite challenges of funding and staff time.
Based on months of interviews with data journalism practitioners, reviews of published guidelines and teachings, and other extensive research, this paper will provide a practical guide to the understanding following:
- The rise of data reporting
- How is data journalism different from what journalists have always done?
- How do you get started, or do more if you have already begun?
- How do you establish these practices newsroom-wide and make them sustainable?
- What are the challenges and possible pitfalls, and how can you avoid them?
- Appendix: Resources for learning on your own
The rise of data reporting
“Data and journalism have become deeply intertwined, with increased prominence,” journalist Alex Howard wrote in in a report for the Tow Center at Columbia University entitled “Debugging the backlash to data journalism.” “To make sense of the data deluge, journalists today need to be more numerate, technically literate and logical.”
These changes come to the industry’s great advantage, said Steve Doig, who teaches at the Cronkite School of Journalism at Arizona State University. Twenty years ago, Doig was a reporter at the Miami Herald, painstakingly looking for patterns in data stored on 9-track tapes.
“Data is not a stranger to journalism,” he said. Back in the 1960s and even earlier, before reporters had easy access to personal computers, they were doing computer-assisted reporting projects. The name of the practice has evolved — precision journalism, computer-assisted reporting, data-driven reporting. But “the key word in ‘data journalism’ is ‘journalism,’” Doig said in an interview.
The key word in ‘data journalism’ is ‘journalism.’”
The people who do this work stress that data projects begin and end with traditional journalism know-how: how to find a story, how to find impact and human interest, how to explain concepts to the public.
“There have been reporters going to libraries, agencies, city halls and courts to find public records about nursing homes, taxes, and campaign finance spending for decades,” he wrote. “The difference today is that in addition to digging through dusty file cabinets in court basements, they might be scraping a website, or pulling data from an API.”
Doig was early experimenter in computer-assisted reporting back when that phrase could be taken more literally: he did reporting assisted by computers when even computers were scarce, back in the 1980s. Data aficionados sometimes had to borrow computers from universities, he said.
When Hurricane Andrew roared through the Florida coast in 1992, Doig was a Miami Herald reporter who had been experimenting with SAS and census data on the Herald’s behemoth mainframe.
“When our reporters mentioned that the county was doing a house-by-house inventory of damage, I realized I could match that with the property tax roll and look for patterns,” he said. The property tax roll provided Doig with what he called “the one true smoking gun of my career”: the newer the home, the more likely it was to be destroyed in the storm.
There have been reporters going to libraries, agencies, city halls and courts to find public records about nursing homes, taxes, and campaign finance spending for decades.”
This discovery led the Herald to investigate building code inspections — “literally millions” of them — showing inspectors carrying out, on paper, at least, an impossible 60 to 70 inspections a day.
From there, the reporters moved to campaign finance, finding that a huge chunk of contributions in Florida came from the construction industry. The paper had to hire a data punch house to transform 10 years of campaign finance data from paper to structured data.
At the end of all this work, all these floppy disks, all these piles of paper, came the 1993 Pulitzer Prize for Public Service, the Pulitzers’ highest award.
“A key part of the project was trying to get beyond the finger pointing that was going on,” Doig said. “The data allowed us to find that evidence.”
Today, Doig said, reporters can do that analysis with nothing but Internet access and free tools on a laptop. Not only that, but the public has come to expect it.
“To me, the real heart of learning [data reporting] is so that we can fulfill our watchdog function,” he said. “It’s a necessary part of journalism to be able to analyze data that particularly the government collects, and use it to basically measure whether the government is doing its job or not.”
Not only are journalists and the average public far more familiar with data now, but data has proliferated: computer technology and automation has created a world of accurate, recorded data that simply didn’t exist before.
Tools are easier, computers are faster, and “every day, millions and millions of rows of data are coming out of federal governments, state governments, city governments,” ProPublica editor Scott Klein said.
“It’s the journalist’s responsibility to help people make sense of that,” he continued. “I believe it’s a responsibility of local newsrooms to help people be empowered by all of this data.”
Data journalism has also exploded in popularity in the last few years thanks to in part to work such as Nate Silver’s perfect prediction of the 2012 presidential election. At the same time, some data journalism pioneers still work in the field as editors or professors, carrying its foundations to a generation raised on computers.
Jacob Harris spent several years on a New York Times data team. He said the plethora of new tools and publication platforms make data journalism much easier, but they also make it easier to “start churning this stuff out without really thinking about it.”
Keeping step with the proliferation of data journalism in general, meritless data projects have been proliferating on the Internet: suspicious scientific surveys, for instance, or maps of fun facts.
“It’s super easy to put dots on a map at this point,” Harris said. The key ingredient that’s missing, as we will see in this paper, is journalism.
How data journalism is different
The definition of data journalism is at once both painfully simple and frustratingly vague.
“When you say data journalism, it means something different to just about everyone,” said Aron Pilhofer, visual editor at The Guardian.
In his Tow Center paper, Alex Howard offered a more detailed definition for data journalism: “gathering, cleaning, organizing, analyzing, visualizing and publishing data to support the creation of acts of journalism.”
Those who practice it do tend to agree on one principle: data journalism is, first and foremost, journalism. It simply uses data as a source in addition to humans.
Pilhofer and others tend to delineate a few categories, each with its own skills and job descriptions. While these may vary or overlap depending on who you’re talking to, they tend to fall roughly along these lines:
- Acquisition: Getting data, whether that means scraping a website, downloading a spreadsheet, filing a public records request or some other means
- Analysis: Doing calculations or other manipulations on data you’ve got, to look for patterns, stories or clues
- Presentation: Publishing data in an informative and engaging way. Infographics, news apps and web design are all examples of this
Not all of these categories might fit a strict definition of reporting, but they all do constitute journalism, said Sarah Cohen, who leads a data team at The New York Times. Even news app developers who spend their days writing code are journalists, Cohen says, because they’re writing code in order to explain and communicate information to the public.
“The necessary skill for a data journalist is journalism and some interest in data,” Cohen said.
What reporters can do with data that they can’t with traditional reporting
While data brings its own challenges, as we will discuss later, it also offers some opportunities that are impossible or harder to get at in more traditional forms of reporting.
Data allows journalists to more authoritatively verify claims
The clearest advantage data has over other sources is that it’s fact, said Sarah Cohen at The New York Times. It’s an actual counted number of fatalities, for instance, or tax dollars or potholes. There’s not as much need to rely on anecdotal evidence when you have the real evidence in front of you.
Take a story by the Associated Press from earlier this year, which used a congressman’s Instagram account as a source for an investigation. This particular politician had been taking flights on his donors’ private jets, and billing the public for it, suggesting an overly cozy or even illicit relationship with his top donors.
The reporters had found the scoop by comparing the location data on his Instagram posts to public data on flight records. These days, Cohen said, “anything can be data.”
Data allows journalists to tackle bigger stories
With data, size no longer matters: reporters can easily get ahold of information ranging from granular to the global. It might be just as easy to get budgets for every county in the state as it is to get it for just your county, opening up a wealth of new possibilities for exploration.
Hilary Niles said this capacity gives newsrooms an “investigative edge” they wouldn’t have otherwise, especially small or medium-sized newsrooms. Niles works as a data consultant and freelancer in Vermont, advising newsrooms on using data and doing her own freelance reporting.
Back in 1992, Steve Doig, a reporter at the Miami Herald, had to examine millions of building code inspections using a computer program called SAS. His investigation revealed the state had been extraordinarily lax about its inspections. Such a monumental task would have been impossible if his team of reporters had to work only with the inspection reports on paper.
Data makes it easier to find new stories
With data, reporters can suss out patterns and follow up on leads in a way they can’t with verbal stories or anecdotes.
While reporters should still use their journalistic judgment, Jue Yang said, data offers a view that doesn’t lean so heavily on instinct or personal judgment. Yang is a technologist-in-residence at the City University of New York, where she helps shape its innovative Social Journalism program. “Computers are great when it comes to discovering things faster or discovering things you didn’t expect,” she said.
Data enables journalists to better illuminate murky issues
Data can also support or oppose an existing claim, or theory, or even an urban legend. Kuang Keng Kuek Ser is a consultant who coaches small- or medium-sized newsrooms who want to start using data.
Keng shares an example from The Guardian: after a series of riots in the UK caught international attention, the government claimed the riots were unrelated to poverty, and The Guardian wanted to investigate.
“But the question is, how do we know?” The Guardian wrote in an article explaining their work. “If poverty affects health, education and crime, could it be a factor in the events of last week?”
It’s tough to definitively say, because someone could easily make a claim either way. The Guardian’s solution was to get hold of the police records of everyone arrested in the riot, and map out their home addresses. The reporting compared those addresses to a map of impoverished areas, which it obtained through other public data. In the end, The Guardian found that some of the government’s claims were true, while some were not.
Like Steve Doig, a Miami Herald reporter who found a clear connection between building inspections and hurricane damage, The Guardian used hard numbers to clear up what had been an issue of finger pointing.
“The core of data journalism, on at least the analysis end, is looking for patterns,” Doig said. “The patterns are going to be what tells the story.”
Just as data can illuminate a murky social issue, it can also quantify it, which contributes valuable information to the social discourse.
In 1989, even before Doig was doing his hurricane investigation, reporters in Atlanta were trying to investigate rumors of racial discrimination in bank loans. Using six years’ worth of lender reports, the Atlanta Journal-Constitution was able to show African-Americans were denied bank loans at rates far exceeding those for whites. The paper became one of the first to win a Pulitzer for an investigation using data.
The Atlanta reporters already had anecdotes about racial discrimination, Doig said, but the data allowed them to go beyond that and establish clear patterns – even illuminating the quantity and scale of the problem.
Data can offer detail and distance
Jacob Harris, who now works as an innovation specialist at the General Services Administration’s 18F project, said data allows more capacity for showing the ‘near’ and ‘far’ view of a topic. In other times, he said, a man on the street interview would be the ‘near’ and an expert interview would be the ‘far.’ There’s not so much need to rely solely on expert testimony when data can provide the ‘far’ or ‘macro’ view more precisely.
On the other hand, the scale of the data itself can be overwhelming for the audience. While data on every police force in the United States can offer a “far” view for a story, no reader is actually going to sift through all that information if it’s put in front of them. But the web allows them to “look at their own ‘near,’” Harris said.
Harris gave the example of ProPublica’s “Surgeon Scorecard,” a news app that lets users find data on their own doctor, hospital or town. In this way, ProPublica distills data on tens of thousands of doctors and millions of dollars of Medicare payments into whatever fits each reader.
Data offers the potential to be more transparent
At the same time, there may be a reason to share a huge data set with an audience. Data sources and web technology have made it possible for journalists to be transparent as they never have been before. Reporters can even share how they reached their conclusions, or allow readers to come to their own. “Transparency is the new objectivity” became a saying among journalists. Blogger David Weinberger wrote about it for KMWorld in 2009.
“Outside of the realm of science, objectivity is discredited these days as anything but an aspiration,” he wrote. “If you don’t think objectivity is possible, then presenting information as objective means hiding the biases that inevitably are there. It’d be more accurate and truthful to acknowledge those biases, so that readers can account for them in what they read.”
Bill Kovach and Tom Rosenstiel made a similar case in 2001 in The Elements of Journalism, when they argued that scientific-style transparency was the lost meaning of objectivity.
Today, transparency is a common concept at many organizations. Jeremy Singer-Vine and his data team at BuzzFeed published an investigation earlier this year showing that migrants who came to the U.S. on skilled labor visas were being exploited by their employers. They went on to publish not just the raw data, but the calculations they’d done to reach their findings, allowing their readers to check their work form their own conclusions.
“It’s important to show our work,” Singer-Vine said. “Readers should see where this is coming from and not just trust our word.”
Data can make reporting more efficient
Reporters frequently collect information from the same sources over and over again: building permits, police reports, census surveys. Obtaining and organizing this information can be made infinitely more efficient, even totally automatic, by keying in to the data behind the reports.
Derek Willis, a developer at ProPublica, found himself constantly checking the Federal Election Commission’s website for new campaign filings. He automated this process, bit by bit, until he had a program that checked for new filings every 15 minutes, and alerted him to interesting ones. “I don’t miss a thing,” he said.
A little programming knowledge had made Willis’s task not only more accurate and efficient, but freed up his time for other reporting tasks.
What might a data journalism team look like?
Before ProPublica, Willis worked on the Upshot, a data and analysis blog at The New York Times. The Upshot is one of the Times’ four data teams, which fall roughly along the data journalism categories we discussed earlier: acquisition, analysis and presentation. The presentation side is split into separate teams for visualization and news apps.
Besides the Upshot, which analyzes and presents data in innovative and attention-grabbing ways, the Times has a data visualization team, a news apps team and a computer-assisted reporting team, which works mostly on data acquisition and analysis for investigations.
BuzzFeed has a single data team, consisting of Jeremy Singer-Vine and two other reporters, nested inside its investigative unit. The three team members spend a lot of their time helping on data projects with other parts of the newsroom, such as the science desk.
At The Guardian, a newsroom known for pushing data journalism forward, the data projects team is only two people, who spend most of their time working with other reporters. Dozens of other reporters also use data on the Data Blog and Visuals teams.
Other newsrooms have a single data reporter.
Jaimi Dowdell, a training director at the Investigative Reporters and Editors organization, said having a single data reporter can be challenging because editors want that reporter to be everything: reporter and editor, features and daily writer, trainer and evangelist.
All that’s needed for data journalism is a journalist with a little interest in data.”
“I feel like that does set the data person up for failure a little bit,” she said, “because you just can’t [be everything].”
The most successful teams, our reporting suggests, tend to be those that perform some mix of their own stories, collaborations with other reporters, training with other staff members and what Pilhofer called the “evangelism” role: raising the level of data literacy across the newsroom.
What it always boils down to, though, is not the size or caliber of your team – or even the existence of one. All that’s needed for data journalism is a journalist with a little interest in data.
For that reason, editors and publishers shouldn’t necessarily think of a “data journalist” as a unique person who should be headhunted, or even necessarily a separate team that needs to be put together. Outlets that are more successful at maintaining data use in their stories tend to have their reporters incorporate data into what they’re already doing.
“Thinking of it as a complement to everything else, rather than a standalone thing, probably helps,” BuzzFeed data editor Jeremy Singer-Vine said. Data, he emphasized, is just another skill that helps reporters tell stories.
How to get started with data journalism
Once you’ve decided data is something your staff should be able to handle, the question is how to incorporate it into their workflow. Every newsroom is already busy, and many are strapped for funding and staff. This section will address how to train journalists in data journalism while ensuring that it gets folded into the work they’re already doing.
What skills should journalists have?
Across the board, those who practice it told me there are two basic skills needed to get started as a data journalist: the ability to engage in critical thinking and basic familiarity with spreadsheets.
Most reporters already have critical thinking skills. (Although, as we will see in the “challenges” section, they need to learn to apply it to data sources as well as traditional human sources).
In the case of data journalism, it means the ability to treat numbers as skeptically as you would any other source, said Cheryl Phillips, a professional in residence at Stanford.
One oft-cited example of what not to do is FiveThirtyEight’s story on kidnappings in Nigeria. The data blog published an article, an animated map and other story elements, that demonstrated a dramatic increase in kidnappings in that country, which was a relevant topic at that time because of news of the kidnapping of hundreds of teenage girls.
The problem was, the data was based on the number of recorded news stories, not kidnappings themselves.
“You cannot assert that there are more kidnappings just because the media is running more stories about them,” data visualization expert Alberto Cairo wrote for Nieman Lab. “It might be that you’re seeing more stories simply because news publications are increasingly interested in this beat.”
FiveThirtyEight had successfully analyzed the data it had, in the sense that the reporter had calculated and mapped out changes in the numbers. But it hadn’t thought critically about what the limits of the data were.
“(Journalists) need to know how to interview data, how to ask questions of data,” Cheryl Phillips at Stanford University said. “They can do all that with a spreadsheet, honestly.”
The second core skill is simpler: command of basic spreadsheet use.
What that means, according to Derek Willis of ProPublica, is learning to be a user or a creator of spreadsheets — rather than simply a viewer of them. That means not just reading a table of data, but being able manipulate and organize it into new forms. Knowing enough basic math to calculate, say, percent change, is another rudimentary skill.
At Arizona State University, Steve Doig leads an online course that teaches these skills in a few hours. He goes into a few more advanced tactics, but the spreadsheet basics are:
- Sorting: Rearranging the rows of data in a certain order. This allows you to find, for example, the highest salary in the state or the lowest crime rate in the country.
- Filtering: Narrowing down the data to only the parts you’re interested in. This allows you to see, for example, only campaign donors in your state rather than the whole country.
- Basic math: Simple calculations like addition and division enable you to find, for example, how much a budget has increased over the year before.
Helena Bengtsson, who does data stories and staff training at The Guardian, said two of those functions – sorting and simple math – account for the work behind most data stories. “So I can teach anybody to do 80 percent of all data journalism in under half a day,” she said.
These basic functions can be done with Microsoft Excel or its free alternative Google Sheets. “Excel is still the tool I mostly use, and I’ve been doing this 20 years,” Bengtsson said.
If people are interested enough to go beyond Excel, Bengtsson said, they can move into something more specialized.
For instance, investigative reporting would lead them to the analysis side, learning to use tools like SQL and relational databases. If they’re more interested in the presentation side, they could explore visualization tools like Google Fusion Tables.
That’s the point where they would specialize in one of the previously mentioned categories of data journalism – acquisition, analysis and presentation. But at its foundation, data journalism requires only two skills: critical thinking and basic spreadsheet knowledge.
One of them journalists should already have. The other they can learn in an afternoon.
How to hire people with these skills, or train your existing staff
Hiring managers at newsrooms can that require new hires know data skills, ProPublica developer Derek Willis said, and they probably should, if only to send the message that this is something they’re invested in. But reality is not that simple.
“There’s a pipeline problem,” Willis said. Not many journalists learn these skills, at least not formally.
USNews’s Lindsey Cook wrote about the dearth of data teaching in journalism education for Source, a blog for journalism coders.
“It happens every year, just the same,” she wrote. “Papers are posted to a board at NICAR seeking journalists with tech skills; journalists tweet encouragements that any young person wanting a job in journalism should learn data and coding. Look at all these jobs! This is what the young whippersnappers should learn! If only there were more of this!”
Cook said the journalism industry – and education system – have a lot of catching up to do when it comes to data, just like every other form of technology. The landscape changed so fast that old methods are crashing and burning.
Take the stereotype that journalists are bad at math. Cook said as a journalism student she almost always heard that stereotype tossed out by visiting journalists who came to speak to the students.
Students “have been told by everyone they admire in journalism that you don’t need math, when that’s not the reality of the field,” she said. “And that’s really hurting us when it comes to data journalism.”
Adding to the problem, she said, is that the old model of hiring a college grad and working with them intensely for a year or so has diminished. Instead, institutions hire young journalists expecting them to have skills right off the bat – and, often, lay off older journalists who lack the digital skills newsrooms now seek.
Hiring managers, then, are left with experienced professionals and new journalism grads, both of whom may lack data training.
The next best thing is to train the staff already in place.
Newsrooms around the world offer different approaches to this kind of training, including:
- Workshops taught by outside contractors
- Data “boot camps”
- Workshops taught by the data team
- Collaborations between data teams and other reporters
- Assigning reporters to teach themselves online
- Call on support networks
At The Guardian, Helena Bengtsson has had the most success with a combination of outreach attempts. Her team holds workshops for anyone who’s interested in learning Excel, pitches their own data stories, and does what Pilhofer called “aggressive collaboration:” working directly with individual reporters to make their stories better.
Nonetheless, Bengtsson said, it’s probably the wrong approach to try get absolutely everyone on board with data reporting. Rather, the data “evangelists” should target people in the newsroom who seem most open to learning new skills, and most likely to actually use them.
Flor Coelho, a data editor at La Nacion, a large newspaper in Argentina, suggested that these candidates aren’t necessarily the most tech-savvy reporters, but people who like to innovate. They’re the ones that will actually try something new, and keep trying.
It’s not always the young “techie” reporters who pick it up, Bengtsson agreed. One of her most successful students, she said, was a social science reporter in her 50’s. Bengtsson helped her quantify freedom of information responses related to sexual harassment on campuses, and the reporter took it from there. “She ‘got it,’” Bengtsson said. “How (data) could help her.”
When reaching out to work with individual reporters, Bengtsson said, it’s most important to truly collaborate, meaning use techniques and tools the reporter will understand and be able to use themselves. “Collaboration means trust,” she said.
Data ‘evangelists’ should target people in the newsroom who seem most open to learning new skills, and most likely to actually use them.”
That’s why she proposed newsrooms equip everyone – editors included – with the basic spreadsheet knowledge discussed earlier. Since spreadsheets are the necessary foundation for any data work, they act like a “gateway” to other forms of data journalism, like visualization, investigation or writing code for news apps.
Bengtsson said the voluntary training sessions at The Guardian have attracted people from all over the company, not just the editorial staff. “People are very receptive here,” she said. She teaches a few advanced tools like Pivot Tables and formulas, but focuses on the spreadsheet basics.
Lindsey Cook, at USNews, stressed that these workshops should be voluntary. For years, reporters were told by their bosses that they needed to learn to use Twitter. To many of them, it seemed like extra work just for the sake of extra work. “That’s kind of a dangerous loop to get into,” Cook said.
Instead, those doing the training should get a sense of the reporters’ workflow, and make sure the data skills fit into it and make it more efficient. She also recommended that data’s so-called evangelists pitch data skills as less work, rather than more, and work one on one as much as possible.
“It’s important to remember you can’t make anyone do something they don’t want to do,” she said. “It’s hard to make someone sit in a class who doesn’t want to sit in a class.”
An ideal model, she said, might be something similar to coding schools: intensive series of classes that teach specific skills over a period of six or eight weeks. The only thing like that in journalism is data boot camps.
IRE offers computer-assisted reporting boot camps several times a year at its home base in Columbia, Mo. The week-long, intensive training sessions introduce professional journalists to data skills ranging from basic spreadsheet knowledge to visualization. It costs several hundred dollars, but fellowships and other financial aid is available.
Another way to teach data skills is to bring the trainer straight to the newsroom. IRE offers workshops like this, and Keng, the data consultant, does the same as a freelancer.
Keng visits small- and medium-sized newsrooms, where he coaches people on how to use data to report more deeply and efficiently. He said one of his first tasks is to help reporters and editors realize that data is for them.
“There’s a misconception or conception among them that they think data journalism is very hard to do, or expensive,” he said. “One of the challenges is actually to change the perception that only big organizations like The New York Times or Washington Post can do data journalism.”
One of the challenges is actually to change the perception that only big organizations like The New York Times or Washington Post can do data journalism.”
The second challenge, he said, is pulling journalists out of their regular routine to spend time learning it.
After the workshops, Keng makes a point of putting the reporters in touch with networks like IRE, Hacks/Hackers or the Global Investigative Journalism Network, who can offer technical or logistical support.
Derek Willis at ProPublica suggested local journalism schools or the outlet’s own alumni network – staffers who have moved on to other newsrooms – as further options for support networks. If nothing else, he said, it’s important that reporters know their problems are not unique.
If all else fails, editors can simply give a reporter time to learn on their own. That takes dedication on the part of the reporter, but many of the best data journalists out there are self-taught. The next section addresses the best ways to tackle teaching yourself data skills.
Learning on your own: how reporters and students can get started in this field, or teach themselves
One way to start out, NICAR training director Jaimi Dowdell suggested, could be to start with a municipal budget. In most cases, a data source like that is simple to understand, and in every case, it’s possible to get. In the U.S., at least, government budgets are always public.
Sarah Cohen, now at The New York Times, wrote the book “Numbers in the Newsroom” to help reporters get over a fear of math that paralyzed them from learning data or even how to read reports like budgets. On another level, though, she said, she hoped it would change the culture a little bit by implying that this is part of a journalist’s job.
Luckily, she said, journalism schools today are so focused on finding people jobs that they aren’t propagating the “journalists are bad at math” stereotype as much as they used to. “It’s less common to joke about it as a charming thing,” she said.
“Numbers in the Newsroom” is an excellent walkthrough of data and simple calculations – “nothing above third-grade math.” It includes an entire chapter on how to analyze a budget. That tutorial, and others, can be found in the appendix to this paper.
Scott Klein, an editor at ProPublica, describes Cohen’s book is an invaluable resource in his classroom at the New School. Another good reason to start with a budget, he said, is that it’s journalistically valid. Klein recommended reporters start with a data set they’re legitimately interested in – and examine it for actual journalism, not just practice.
Journalists should commit to doing a journalism project right from the start, and follow it through.”
A lot of online learning courses, like those that teach you how to code, walk the learner through hypothetical situations like, “how to make a peanut butter sandwich” or how to construct a simple game. Instead, Klein said, journalists should commit to doing a journalism project right from the start, and follow it through.
Jue Yang, who teaches at CUNY, said reporters need to tap into “that startup mentality: JFDI, just effing do it.”
“If you really want to be the innovator, rather than just catching up, you got to just start doing things,” she said.
The problem with jumpstarting data journalism efforts, 18F’s Jacob Harris said, is that editors want data stories to be fast, cheap and accurate. “It’s hard to get all three,” he said. That’s why you see so many stories using the same data sets over and over again, like:
- The census
- Bureau of Labor Statistics
- FBI crime statistics
- Campaign finance filings
- Local budgets
These data sets can be obtained by any reporter, and any reporter is likely to find a story in them, because they can be localized. You can read about these data sources in more detail in the appendix to this paper.
The National Institute for Computer-Assisted Reporting, a subsidiary organization of IRE, offers many national data sets cleaned, organized and ready to be analyzed. NICAR charges for access to the data, but the fees are scaled to the size of the news organization.
NICAR training director Dowdell said the ideal training situation would be for mid-level managers – those who deal with copy on a day-to-day basis – to be at least a little familiar with data, so they can edit data stories and understand the limits placed on them. If nothing else, she said, reporters can take it upon themselves to learn.
“Sometimes you have to invest a little of your own time,” Dowdell said. “Over time you’re going to get more time, more support.”
Online courses, support networks and data sets with training wheels are all good ways to get started. The next challenge is to fit it into a reporter’s daily life.
Where in the workflow does this go?
A lot of reporters, and their editors, feel like they can’t fit in data reporting when they’re so busy covering news that’s already happening.
“And that’s a very real scenario,” ProPublica developer Derek Willis acknowledged. “What I would say, though, is… done right, this is not an either-or kind of thing.”
While there is an initial time investment to learning data skills, he said, data can actually help a reporter become more efficient, and free up more of her time. If they are really struggling to fit in data reporting, Willis said, they should look at what they’re already doing.
Business reporters, for example, often pull up new business permits. As it is, they’re pulling up data in the form of paper or PDF documents. If they requested that as structured data – that is, something more like a spreadsheet – it would be far easier to analyze and look for patterns. Every reporter, Willis said, has data like that they’re already looking at – they’re just not looking at it in the right way.
“Take a look at the way you’re already collecting information that you consider valuable,” Willis said. “You can make that task easier just by altering the way that you collect and store information in-house.”
Take a look at the way you’re already collecting information that you consider valuable. You can make that task easier just by altering the way that you collect and store information in-house.”
Niles, the freelancer, said a few of her successful investigations have come from analyzing data the state had already collected, but simply hadn’t analyzed. “There’s just a ton of data that gets collected,” she said. “Just insights sitting there on the table waiting for somebody to find them.”
She suggested reporters ask themselves, “what is a question that comes up a lot on my beat?” Data could answer or illuminate it. Or, “what is a report I frequently read on my beat?” Data could automate it or make it simpler to look for trends.
If reporters make a habit of these things, she said, requesting and analyzing data should fit naturally into their workflow. “You want it to be as seamless as possible,” she said. “That’s the goal.”
Incorporating new skills into workflow, she said, also provides the key to the next challenge: sustainability.
How to establish data reporting newsroom-wide
Learning spreadsheet basics is one thing, but establishing data as a steady practice is another. Those who teach data journalism agree there are some steps needed to get the most out of these data skills: namely, establishing habits and collaboration across all levels of staff.
Establishing the ‘data state of mind’
Hilary Niles, who works as a data freelancer, said she would like to think sustainability can happen from the ground up – originating with reporters – but there needs to be editorial support. “There needs to be buy-in from the editors,” she said.
One way to support that grassroots sustainability is establishing a “data state of mind.” IRE and others use this phrase to describe the awareness that if you are looking into a topic, there is data on it somewhere out there. And not only that, but you can get it and examine it.
Derek Willis at ProPublica recommended that journalists practice simply requesting information in the form of data. Budgets and crime rates, he said, are examples of information every newsroom should be getting as data, making it easier to examine later. In addition, governments are releasing information in this format more and more.
Getting into a mindset of asking for data is one of the most important factors in becoming a data-savvy newsroom.
Getting into a mindset of asking for data is one of the most important factors in becoming a data-savvy newsroom.”
Data journalism “is more a way of thinking” than merely a technique, Bengtsson said. When reporters first started using phones to report stories, they didn’t say, “I’m now going to do telephone journalism,” she said. It was merely a technological progression in what they were already doing.
Keng, the data consultant, said a little data knowledge goes a long way. For example, Keng said, reporters might not realize how easy it is to make their own charts or web presentations, reducing the workload on the graphics or IT teams.
All of these advantages, however, require an initial time investment, followed by additional time allowance from higher-ups. Without that sustained support, training would come to naught. Editors who suspect data is of no use, and don’t afford their reporters enough time to work on data stories, can make those suspicions a self-fulfilling prophecy.
How to grow and sustain data capabilities, despite financial and staff limitations
Besides fears that data journalism is too hard or too time-consuming, there’s a prevailing idea that it’s prohibitively expensive. Some of this is rooted in fact – but it’s not an aspect of the data, but rather, of newsrooms that haven’t managed to keep up with the changing technology in general.
“Unfortunately, in many newsrooms there’s been successive resistance to lots of different kinds of technology,” Huffington Post technology and society editor Alex Howard said. “The very rapid change in delivery and distribution, which is now owned by tech companies, has put a lot of these papers in a difficult place.”
Newsrooms experience what Howard called “technical debt” – an unavoidable inheritance of the technology that was purchased and used by the generation before, for better or worse.
This isn’t just an issue for data work, then, but rather all technology from computers to email to social media. Content management systems, the programs that organize and publish content, were designed for print news.
One solution to overcome this technical debt, Howard said, is to pull from outside sources like GitHub. GitHub is a website that hosts a universe of open-source tools, meaning tools that are free to use. The only obstacle, he said, is the understanding and skills needed to put it to work.
The most common tool for data use is one that most newsrooms probably already have installed, no matter how deep their technical debt: Microsoft Excel. While Excel isn’t technically a free tool, it’s a part of the Microsoft Office suite, installed on practically every office computer since the ‘90s.
The most common tool for data use is one that most newsrooms probably already have installed, no matter how deep their technical debt: Microsoft Excel.”
“Most of our work is done in Excel,” said Flor Coelho, a data and multimedia editor at La Nacion.
To expand beyond Excel’s borders, though, the team experimented with free tools like Google Spreadsheets, a free alternative to Excel, and Google Fusion Tables, free software for making data visualizations.
While Excel is technically a paid program, free alternatives like Google Sheets exist. Thanks to rapidly spreading technology and the open source mindset, a reporter could go through her whole career using only free tools.
“I may use Navicat Essentials (a paid app) to do some joining and analysis, if there’s something I can’t do easily in Excel,” she said. This $40 program helped Niles score one of her best stories: that the state of Vermont had no idea how much it was spending on IT services.
Niles got the story by making her own database out of public data that the state didn’t look at. “There’s just insights sitting there on the table waiting for somebody to find them,” she said. “We identified at least six follow-up stories to mine from the database.”
Not only did her data work provide a wealth of stories, Niles said, but her client, Vermont Public Radio, received accolades. A few donors called to say that Niles’s story was why they were renewing their memberships for the station.
Keng, the data consultant, said it’s important to convey to newsrooms how data projects can help their bottom line. If their business model is advertising-based, for example, data projects can increase their traffic. If they depend on funding from foundations, data can make a story more widely cited or published.
Take advantage of free tools
Ideally, journalists can in turn contribute to the open source community. An example is a tool called Tabula, which finds data tables inside PDF’s and scrapes them out into Excel format, so they can be analyzed. A team of journalists created the program with help from organizations like La Nacion and the Knight Foundation. The journalist-coders made the tool because they needed one, then expanded that to share it with other journalists or anyone who needs to scrape data out of a PDF.
Free and open-source tools like Google Sheets and Tabula are a great way to start overcoming your newsroom’s technical debt. To sustain data work, though, actual foundations are needed. This can mean an investment of time and staff labor, where employees learn to use data for their work. It can also mean a financial investment, in the form of new software or tools.
Willis, at ProPublica, said this cost isn’t as prohibitive as it used to be. In other days in the industry, you would have to pay for software to make data-based projects like maps, ProPublica’s Derek Willis said.
“But for most things these days,” he said, “neither the software nor the hardware cost a lot of money to do. Technical costs are much, much less than they used to be.” When he was at The New York Times, Willis said, most of the software they used on the data team was open-source and free.
“(But) there’s no escaping that for many of these skills that if you don’t have them then there is an upfront cost in time,” he went on. “A lot of times people will get scared off by the initial investment,” he said. “(But) it’ll pay off both in terms of the kinds of stories you’re able to do, and being able to build on those kinds of stories.”
A lot of times people will get scared off by the initial investment, [but] it’ll pay off both in terms of the kinds of stories you’re able to do, and being able to build on those kinds of stories.”
All too often, he said, editors think of it as a one-off project, like a feature story, where the reporter invests several days of work and then washes his hands of the project.
“I think that’s a mistake, because I think the payoff is definitely not ephemeral,” Willis said. “It’ll pay off both in terms of the kinds of stories you’re able to do and being able to build on those kinds of stories.”
Willis gave the example of his computer program that checks the FEC website for him. “That gives me a competitive advantage as a reporter,” he said.
Like Niles, who derived half a dozen stories from one database, Willis had made himself – and, by extension, his newsroom – more efficient.
“If you’re doing something repetitive with a computer, then you’re probably doing it wrong,” he said. “There’s probably a better way to do it.”
Bridge the gap between reporters and editors
Once committed, reporters still need time and space to practice their skills, and that message doesn’t always reach editors and publishers.
A way to get buy-in from higher-ups, Huffington Post editor Alex Howard said, is to have “measures of success” in order for data work to be sustainable. Measurements like Google rankings, web analytics, ad revenue and monetization can all influence the higher-ups at an outlet, and guide editors on where to allocate resources.
Data journalist Hilary Niles said when she’s pitching a data story to a newsroom, she always lays out how it will benefit their bottom line. She posits that a data visualization adds value to a text story, while previously unused data can drive web traffic.
Niles is a fan of creating, uploading databases and then maintaining them. Lots of data, like the salaries of public figures, local budgets and crime statistics can be updated every year with new data from the state.
Measurements like Google rankings, web analytics, ad revenue and monetization can all influence the higher-ups at an outlet, and guide editors on where to allocate resources.”
For her story for Vermont Public Radio earlier this year, Niles obtained a bunch of data from the state on its IT spending. She organized it, analyzed it and did some shoe leather reporting, all of it leading to at least six more stories. “Putting it into a structured format allowed for much keener analysis that revealed a virtual mine of public interest stories,” Niles said.
Flor Coelho at La Nacion, agreed that you can get many stories from one database, and that can in turn make your reporting, and data reporting in general, more time- and cost-effective.
For instance, La Nacion keeps a database updated of complaints phoned in to the Buenos Aires government. Whenever a new mayoral election comes along, they can ask the database, “what are people most upset about with the way the city is run?”
“So that gives you original content,” Coelho said. “You can ask different questions to the databases.”
Establishing databases to be used again and again can make data journalism sustainable, just like establishing some base knowledge of data can be used again and again for different stories and sources.
Starting with free – or common – tools like Excel and Google Sheets can help publishers overcome technical debt, even though it may be necessary to pay for some software down the road. More integral to sustained data work, though, isn’t the money spent on fancy software, but the time given to reporters to practice and learn and experiment.
Challenges of data journalism
Now that we’ve gained some basic data knowledge and are putting it to use, we need to take care not to get entangled in a mistake or misunderstanding. The downsides to data reporting are, for the most part, identical to those of regular reporting: misleading or biased sources, honest error, and so on.
Data, however, does require some bulletproofing not necessary for more traditional projects. This section will address how data presents both new and traditional challenges for journalists.
Ethical concerns for reporting with data
The Associated Press recently announced it would be incorporating data standards into the 2017 AP Stylebook, further signaling data’s formal role in 21st century journalism. Significantly, these standards won’t be limited to language and style, but will include ethical standards.
The ethics standards of journalism – don’t break laws, don’t lie, lessen harm – all still apply to data use, Arizona State University professor Steve Doig said. “The data is just another source,” he said. “It doesn’t absolve you of the same kind of ethical considerations that you’re supposed to be taking.”
Just like stealing mail from a mailbox isn’t ethically acceptable, stealing data from a website isn’t, either.
One of the most important edicts for using data, Doig said, is to not use it out of context. As all the practitioners liked to drive home, you have to think critically about your data. One paper he cited had published some data on infant mortality in their city: the mortality rate was astronomically higher in one low-income neighborhood.
But after publishing it, he said, they found that that neighborhood housed a large teaching hospital, where sick infants were brought from all over the state. You have to do your shoe-leather reporting, Doig said.
[Data] doesn’t absolve you of the same kind of ethical considerations that you’re supposed to be taking.”
Talking to the human sources behind the data is a must, practitioners told me. Doig shared an example from his own experience: his outlet had published data showing that a small number of convicted criminals had received no jail time. The reporters later found out from the court clerks that that simply meant the criminals had been assigned community service instead, they just hadn’t entered it into the database.
All data, particularly the kind used for journalism, has its root in human sources. As a result, it is subject to human error, biases and fallacies.
Data also always requires context.
Alberto Cairo, Knight Chair at the University of Miami, recommends going a step further and talking to experts who are experienced in analyzing the data. “It’s not just a matter of asking a couple of researchers some questions while you write a blog post,” he wrote for Nieman Lab. “It’s also a matter of doing your reporting in collaboration with those researchers, as they’re the ones that know the data really well.”
This isn’t a particularly groundbreaking notion: The publications who regularly conduct solid data and investigative journalism nowadays, like ProPublica, work this way on a regular basis.”
Earlier, we noted that one of the advantages of data is it lets reporters process very large sets of information. But when it comes to using data as a source, the single most important thing to remember is that the data comes from and involves human beings. It is unwise, often inaccurate, and potentially unethical to simply obtain numbers and publish them. Former New York Times developer Jacob Harris addressed this in an essay called “Connecting with the dots.”
“It’s super easy to put dots on a map at this point,” Harris said in an interview with API. “It’s easy to forget that they’re still people.”
When Cheryl Phillips worked at the Seattle Times, a devastating mudslide killed 43 people. They could easily and quickly have mapped the houses affected by the disaster, but that could have come across as callous, she said.
“You have to remember there are individuals in those data points,” she said. Even though the Seattle Times published the map a full week after the disaster, it included on-the-ground reporting with photos, profiles and stories of the victims.
“We wanted to publish something more fully formed and that helped tell the story of the tragedy in a more sensitive way,” Phillips said.
On the flip side, there are privacy and sensitivity issues in publishing a data set that displays every single individual.
Phillips gave the example of salary databases: while public salaries are public information and therefore liable to be published, journalists should think about whether there’s a journalistic impetus to do so. The Seattle Times, she said, obtained the data but only published newsworthy items like excessive overtime or changes over time.
API’s Jeff Sonderman wrote a piece for Poynter outlining the difference between what journalists can and should publish when it comes to data. A map of gun owners, published by a newspaper in New York, was a case in point.
“Data can be wrong, misleading, harmful, embarrassing or invasive,” Sonderman wrote. “Presenting data as a form of journalism requires that we subject the data to a journalistic process.”
The Guardian’s digital editor Aron Pilhofer scorned what he called the “data porn” that freewheels on the Internet: word clouds, pretty pictures, dots splattered on a map.
“Journalism has to have a nut graf,” Pilhofer said. “A reason for people to care.”
Tips for avoiding disaster
Most of the flaws with data are the same as with human sources, too: error, bias, unreliability, misunderstandings.
The following distill the counsel of data veterans on how to avoid being led astray.
Don’t jump to conclusions
Even after you think you’ve found a trend or a connection, continue to be as skeptical as possible. Think of FiveThirtyEight’s story on Nigerian kidnappings: how could they have avoided it?
Once you’ve got the numbers nailed down, step outside the numbers and look at your findings critically. Could there be any confounding variables, or issues that could cause a change that appears to be caused by something else?
Practitioners suggest investigating data sources for biases, hidden variables, privacy or legality issues, or anything else that could possibly lead you to a wrong conclusion. “It’s easy to believe in the pretty spreadsheet,” Guardian data editor Helena Bengtsson said.
Most say you should also confer with an expert or another person who is familiar with the data. Like Steve Doig’s story on criminals apparently getting off scot-free, there may be something in the data you never thought of. If the stakes are high enough – like if there may be legal liability – Pilhofer suggested sharing an entire finding and body of work with an expert or the source itself.
Investigate the data before you report on it
Aron Pilhofer, at the Guardian, insists journalists should know the data “inside and out” before they analyze it. Don’t assume anything: what the column titles mean, what the outliers are, whether there are any parts missing.
Stanford’s Cheryl Phillips recommended figuring out what she called the “shape of the data”: blanks, outliers, patterns and limitations.
As always, 18F innovation specialist Jacob Harris said, vet the data like you would any human source: “You (would) think, maybe the source has an agenda, maybe the source is lying, maybe the source doesn’t know what they’re talking about,” he said.
He also suggested keeping a detailed log of what you did with the data – what columns you moved, calculations you performed, and so on.
Clearly explain the data to your audience
While data may be something of a miracle source of information, it still has its flaws. Be frank with your viewers or readers about incomplete data, differing interpretations, margin of error or anything else that could affect their understanding of your conclusions. Don’t overstate your case.
When it comes down to it, Pilhofer said, you and your data analysis are your own source. “And you better be right. And that can be kind of scary.”
So many of these pitfalls sound obvious, Harris said, and yet, anyone could easily fall into their traps, even experienced data journalists. “I still think that skepticism and paranoia are the best two things you could have on your side,” he said. “I know I could easily fall into similar mistakes myself.”
Don’t republish conclusions formed by someone else
Reporters should be extremely skeptical, Harris said, of surveys or studies that are given to them with the analysis and conclusions already done. Oftentimes, it’s just a startup or PR company trying to get some exposure.
He wrote about a particularly cringeworthy one in Source: a range of publications ran a very dubious study claiming that Democrats watched more porn than Republicans. The source was a porn website.
“Remember that skepticism is your truest friend if you want to call yourself a journalist,” he wrote. “It’s not hard to see the flaws in a flimsy study if you are predisposed to contemplate all the ways in which the data is probably bad rather than tacitly accepting it as good.”
Ideally, he said, journalists would obtain the raw data behind each survey and study and do their own analysis, as well as investigating the source.
Most importantly, do the groundwork reporting
All data journalists stressed that you can’t do a single project with just data and not journalism. Hitting the pavement, making phone calls, talking to sources are always necessary.
Freelancer Hilary Niles said when she did her bulletproofing on her public radio story, the state’s disarray when it came to understanding their own data became part of the story. “The conventional reporting required in order to compile the database also revealed real gaps in accountability,” she said. “I think this also illustrates the importance of coupling data reporting with traditional reporting in order to draw the most complete picture possible.”
As always, the key to reporting with data is that it’s simply reporting. And with data becoming ever more omnipresent, it’s no longer something that can be demarcated as a separate method to more old-fashioned reporting.
Luckily, the proliferation of tools and of data itself makes this kind of reporting easier and easier. API’s Strategy Study on encouraging innovation in the newsroom quoted the legendary 20th Century journalist Hodding Carter: “This is the most exciting time ever to be a journalist – if you are not in search of the past.”
- Using Excel to do Precision Journalism by Steve Doig: Here you can learn the “basic spreadsheet knowledge” we talked about – sorting, filtering and basic calculations – in about one to two hours. If you’re interested, Doig goes on to cover formulas and tables, which are slightly more advanced.
- Google Sheets support from Google: If you don’t have Excel, you can do the same functions with Google’s free online version, Google Sheets, once you make a Google account.
- Numbers in the Newsroom by Sarah Cohen: This book, which can also be purchased as an ebook, addresses all the data scenarios a beginning reporter might face, and walks the reader the math needed to complete them.
- Data Journalism 101: Self-guided training by Michael Berens: This three-hour webinar walks you through basics of data reporting like filing records requests and finding stories.
- The Art and Science of Data-Driven Journalism by Alex Howard: If you’re interested in the development of data journalism as a profession, this paper does an excellently thorough job of explaining its history, good practices and further learning resources.
- Hacks/Hackers: Hacks/Hackers is a global organization with almost 100 individual chapters around the world. Each chapter does its own events, but their goal is to bridge the gap between journalism and technology.
- NICAR list serv: You can email all the members of the NICAR email list for help or advice on a data project or other tech-related journalism issue. It has almost 2,000 members and is one of the most active journalism list servs out there.
- Civic hacking groups, like Code for America: Code for America is a national group with local chapters that try to liberate data for use by the public. These chapters and other “civic hacking” groups are often knowledgeable and eager to help with a data project or to acquire data for use by journalists.
- Alumni (former staffers): Derek Willis, at ProPublica, suggested getting in touch with former staffers at your newsroom who have moved on to other outlets. They are often happy to help with training or advice for free.
- Local budgets:“Numbers in the Newsroom” (above) has a chapter on finding stories in a municipal budget, which can usually be downloaded from your local government’s website. Cohen suggests checking the budget’s math, or comparing the planned spending to actual spending.
- NICAR data library: For a small fee, NICAR will provide a national data set that has been cleaned and organized by journalists. These cheap, clean, national data sets allows you to localize a story, such as finding local businesses in a database of workplace accidents.
- The U.S. census: The Census is well-liked among data journalists for having some of the most well-organized and well-explained data to come from the government. Navigating the website can be difficult at first, because there are so many layers of data, but it lets you contact experts who are singularly helpful.
- Bureau of Labor Statistics: Like the Census, the BLS has above-average presentation and explanation of its data, and a lot of it. These numbers include unemployment statistics, industry data and other financial information.
- FBI UCR statistics: UCR statistics are the only nationally-collected numbers on crimes like murder and burglary. However, using this data can be precarious: because each police agency reports crimes individually, the resulting could be incomplete, erroneous or nonconformant with other agencies.
- FEC disclosures: Campaign finance, including political donations and advertising dollars, can always be localized, and is a readymade news story when election time rolls around. The website offers raw data or walkthroughs in the form of presentations and graphics.