Understanding the Data

Big data, open data, metadata, data analysis, data visualisation, data this, data that, data everything. These may not traditionally constitute phrases instinctively associated with journalism but in the last few years, a revolution in journalism has been taking place. Data journalism is all the rage. Dare I say it, it is on its way to becoming – wait for it – mainstream.

Websites and tools making it easier to find, share, work with, analyse and visualise data online are springing up left, right and centre. (Data journalism expert and author Paul Bradshaw offers a comprehensive series of blog posts on the various aspects of data journalism here  – follow the link at the bottom of each post to see the next part).

Websites offering tutorials, online courses and an endless list of resources on data journalism have risen to prominence of late. In the last 18 months alone, three ebooks introducing the world of data journalism have been published and data journalism even has its own hashtag – a true sign that something has gone big – #ddj. Of course, Tim Berners-Lee did see this coming, so who is going to argue with the guy who invented the world wide web? Certainly not me.

Looking beyond the slick new tools that one can use to find, analyse and produce a strikingly visually beautiful dataset in no time, it is essential to take a step back first.

The most important part of making data journalism work, and what has been one of the main pitfalls within traditional journalism, is understanding the data itself. To be in a position to analyse data, journalists need to know what each figure represents and how it was calculated. The way data has been collected and calculated in its rawest form will impact on its reliability and what conclusions can be drawn from it.

Is the average the mean, median or mode? They can be used interchangeably and to suit the argument that accompanies a report. Are there any differences to how the data we were comparing it with was collected? Is it a different survey that used different sampling techniques and thus risks skewing our data?

Last week I read Darrell Huff’s ‘How to Lie with Statistics’, a short yet timeless classic that examines all the possible misuses of data, whether indented or unintended. At the end, Huff sets five key criteria journalists need to take into account when working with data, what he calls the “five simple questions” that may have been written half a century ago, but these still hold true for modern data journalists.

1)   Who says so?

2)   How does he know?

3)   What’s missing?

4)   Did somebody change the subject?

5)   Does it make sense?

Not all these questions will apply to every dataset that newsrooms or journalists will have access to, but essentially questioning the data and its ‘background’ so to speak, is paramount.

Stopping and thinking why a particular anomaly is present in the data, or whether the data actually makes sense is something probably not done enough within the media, at least when talking about the UK, but this probably extends further.

To use an example from last month, data obtained by the Huffington Post UK through a Freedom of Information request showed nearly 300,000 “attempts to access websites categorised as pornography” made from computers within the UK Parliament in the past year, official records show. Many newspapers went with the statistic of 820 attempts per day.

The data was broken down as follows:

Attempts to access websites classed as pornography on the Parliamentary Network

  • May 2012: 2141
  • June 2012: 2261
  • July 2012: 6024
  • August 2012: 26,952
  • September 2012: 15,804
  • October 2012: 3391
  • November 2012: 114,844
  • December 2012: 6918
  • January 2013: 18494
  • February 2013: 15
  • March 2013: 22,470
  • April 2013: 55,552
  • May 2013: 18,346
  • June 2013: 397
  • July 2013: 15,707

The data shows a massive spike in attempts during November 2012, a figure that certainly should have come under greater scrutiny. Most reports do mention the discrepancy but do not go as far as attempting to explain it. It turns out that the most logical explanation, which appeared later on twitter and explained in a blog post by FOI and Data Protection expert Jon Baines, was that:

“the November 2012 spike coincided with intense political and media interest in the topic of sexual offences, following as the scandal involving Jimmy Savile broke. This is very plausible, and suggests that, far from users of parliamentary systems shirking their responsibilities by browsing for smut, they were actually trying – apparently unsuccessfully, and probably with no small frustration – to find out more about a serious and current news item. But that makes for a dull story.”

While this does not seem to have been officially verified, it certainly seems the most rational reason for the figures – but as Jon Baines suggests, this is much less of a story.

Also, given that in the whole of February there only seem to be 15 attempts to access “websites classed as pornography”, an average of more than 800 attempts per day seems rather misleading given that for February it was around one attempt every two days.

No doubt, the scope for good, data driven journalism is huge, with big and open data more readily available, while the new tools to visualise it quickly and more effectively a real asset for journalists.

However, we should not forget or disregard the first and most important step in the process of a data journalist.

Understanding the data.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s