After getting your hands on a data set, the hardest part of incorporating data analysis into your beat is getting started — and avoiding beginners’ pitfalls along the way.
From scrambled columns to unintelligible field names, every file you receive with comes with challenges for new and experienced data reporters alike.
We talked to Sean Mussenden, chief of the data and graphics bureau at the University of Maryland’s Capital News Service, about 10 mistakes to avoid while you establish a workflow and get comfortable with data sets in your day-to-day reporting.
Mistake No. 1 – Don’t overestimate the meaning of your data
Before even opening the file, data reporters should think carefully about the potential limitations of a data set, and what the data can and cannot tell you about a topic.
“I think the tendency that a lot of people who are beginning to do this sort of work have is [to think] that humans are fallible but numbers are ironclad,” Mussenden said. “The data is only as good as how it’s collected.”
Pay attention to where the information comes from for key database fields and make sure you can trust the source, Mussenden said. For example, when dealing with demographic data, self-reported racial categories are much more accurate than a third-party observation.
If there’s any hesitation about the integrity of a field’s data, either don’t use it or explain its limitations to readers in the text or in a sidebar. Reporters should also think carefully before using data sets with fewer than 100 entries, as any small change could drastically affect findings.
At all stages of a data project, it’s vital to “real world check” findings against other sources and your cumulative knowledge from working a beat, Mussenden said. If your findings seem shocking, think carefully about if the data could be misleading and do your best to verify at least part of the trend with a different source.
Mistake No. 2 – Not checking the file type
Knowing both the type and size of a data file will help you decide which programs to use to work with it. Ideally, your data will come as an Excel spreadsheet file (.xlxs) and take up less than 700 megabytes of space in the program. Almost every data set you encounter should work in Excel. If you’re working with a file larger than that, consider using Microsoft Access, or another database program that runs on SQL.
A .csv (comma-separated values) file will also open in Excel and work with all of the program’s features. However, if your data set has more than one sheet or you’ve added a sheet while working, make sure to change the file type to an Excel workbook file before saving or you’ll lose everything but the sheet that you’re currently on. Occasionally other programs, such as MySQL databases, will require you to change a workbook file to a CSV before you upload them.
Some data will download as a plain text file (.txt) although that file type isn’t as common as the others. On its own, a text file is the data without the benefit of organized columns and rows.Usually you can open it in Excel, save it as a workbook or CSV and go from there.
If you received or found your data as a PDF, you’ll need a converter tool such as Tabula to break the data out into usable rows and columns. Most are free, fast and require only a bit of editing with basic Excel formulas to get the job done.
Mistake No. 3 – Not cleaning the data first
You’ve likely waited so long for the data set that it’s tempting to jump right in and get to work, but the first few hours of most data projects should involve cleaning up the data to make sure it’s usable, Mussenden said. He recommends running spreadsheets through an Internet-based, open-source tool such as OpenRefine to weed out any small discrepancies within fields (e.g. ATT and AT&T in an employer field).
Other common tasks while cleaning data sets include splitting first and last names into separate fields or splitting full addresses into usable chunks. There’s an Excel formula out there for just about any task, most of which can be found here.
Mistake No. 4 – Not indexing your fields
Most data sets will come organized in a meaningful way, whether it be alphabetically, by date or something else. But while usually not intentional, the way the data set is organized when you get it is rarely the best way to spot the trends you’re looking for.
The sort feature in Excel is a powerful way to reorder and analyze your data, but can mess it up beyond repair if you don’t “index” your fields before sorting. To avoid the headache, create a new column to the far left or far right of your data and label it “index.” Then, fill in numbers starting at 1 and counting up through the end of the rows. To undo your sort, just sort by this new column from smallest to largest and Excel will put your data back the way it was.
Mistake No. 5 – Assuming you know what the field names mean
Regardless of how simple or complex your data set seems, always request the “data dictionary,” a list of all of the fields in your data set, their names and what type of information is in there, such as dates, numbers or phrases. If there isn’t a data dictionary, call the agency or office it came from and ask to talk about the fields with whoever maintains the database or file.
Even if everything seems obvious, it pays to double check. “I almost always call someone to talk about the fields before I start working,” Mussenden said. “Even if my initial impressions were right, I usually learns something new.”
Mistake No. 6 – Not saving each major change as a new copy
Data analysis is one of the only types of reporting in which you can lose all of your hard-fought victories if you hit save after making a mistake. To avoid data catastrophes, make a folder for the project and label each subsequent version of the data with a number and the date (for example CrimeStatsOriginal, CrimeStats1_June20).
Never permanently modify the original data set, so you’ll always have a clean version to come back to for reference. It’s also important to always know where the original file is in case the agency disputes what it gave you or claims that you unfairly modified the data before your analysis.
Mistake No. 7 – Doing too much at a time
Data analysis can be extremely difficult to double check. With that in mind, it’s important to work slowly and take frequent breaks. Mussenden prefers to break for at least 10 minutes per hour, but you’ll find your own pace and rhythm as you go.
Despite all of the programs and expertise out there, or maybe even in another part of your newsroom, there’s no quick fix for realizing you’ve made a mistake but you’re not sure where. At that point, your only choice will be to throw out all of your progress and start again from one of the earlier versions you’ve saved. Instead of letting it get to that point, think fast, work slow and take breaks.
Mistake No. 8 – Not involving your editor
Your editor probably doesn’t sit in on your interviews and most likely doesn’t want to watch you shift columns in Excel for an hour, but they do need to have an active role in any data-driven project you do.
Successful data reporters should take detailed notes on what they do each day, both for their own benefit and in case an editor wants to review their progress. It also helps tremendously to “logic check” your workflow with a trusted colleague to make sure you aren’t missing anything obvious or jumping to conclusions.
When working on complicated projects, students in Mussenden’s bureau often block out ideas and processes on whiteboards and then go over their strategy with editors before starting each major phase of work. Frequent meetings keep everyone involved in the project on the same page as far as deadlines and potential roadblocks go, and frequently lead to interesting ideas from other reporters who don’t directly deal with the data.
Mistake No. 9 – Treating visualizations like an end goal
Most data projects start off with a standard list of calculations such as finding means, medians, ranges and minimums and maximums in the data set. But after crunching the easy numbers, it can be tricky to tell which direction to explore next.
Mussenden recommends creating some easy visualizations, like graphs and charts within the Excel program, to help spot patterns that can lead to story ideas or more questions.
“This helps me to see trends in the data that I couldn’t see with numbers alone,” he said. You may end up improving the exploratory visualizations and publishing them with the story as well.
Mistake No. 10 – Not knowing when to ask for help
Even if you’re the only data reporter in your newsroom, you have plenty of resources when you get stuck on a project.
Most programs that you’ll work with have online tutorials or user forums that give good advice (Try this Excel forum, or this MySQL forum to start). If you’re an IRE/NICAR member, you can also view tip sheets for specific topics or email the listserv with a question.
For more tech-based questions, Mussenden said to not hesitate to reach out to other reporter on Twitter through direct messages. Prominent tech reporters often tweet advice, and may be willing to answer questions by email as well.
For more, check out API’s new Strategy Study: “Diving into Data Journalism: Strategies for getting started or going deeper”