Getting data from a PDF

Tabula tutorial

Tabula, one of the five tools I examine in my article

We live in a world where PDF is king. Perhaps we could even go as far as calling it the tyranny of the PDF.

Developed in the early 90s as a way to share documents among computers running incompatible software, the Portable Document Format (PDF) offers a consistent appearance on all devices, ensuring content control and making it difficult for others to copy the information contained within.

However, for a data journalist whose job depends on being able to extract bulk data for analysis and visualisation, PDFs as the filetype of choice does not tend to go down well.

In a field of journalism where the spreadsheet needs to rule the roost, I look at a few ways of turning data enclosed within PDFs to spreadsheets (excel xls or CSV), primed for data analysis.

What’s always important to remember in trying to get data out of PDF files is that there is no single catch-all way that works for every occasion, sometimes it’s just a matter of trying each one until you find the one that works.

For the rest of the article and the tutorials, published on the Interhacktives website, please click here. 


How to verify a photo – Google image search

When it comes to photos and the internet, the mantra is “if it looks too good to be true, then it probably isn’t true”.

Surprisingly, that’s something many people, including journalists, often ignore, and as a result they tend not to apply the care that online photo verification requires.

From disaster movie wallpapers masquerading as hurricane Sandy pictures to an image of a radioactively mutated 50 metre giant squid, internet photo hoaxes are regular occurrences, with journalists often joining gullible social media users by falling into the trap.

There are a number of ways to test whether a photo is genuine, and while some cleverly photoshopped images may escape scrutiny it is always worth checking up on an image you have doubts about.

This is where Google image search comes handy.

Google image search

Google image search

While it is not completely full-proof, running a picture through Google’s image archive should be the first stage of the process, and it will often allow you to dismiss a photo as being fake in a matter of seconds.

A Google image search can be done in one of two ways: either by dragging the photo to a tab or window in your browser open on the Google image search page, or by using a chrome extension.

Google image search – Method 1

Go to and click the image button (PHOTO 1). Then simply drag the photo into the image search box, as in the video.

Google image search – Method 2

Install this chrome extension and right click on an image found online to run it through Google image search.

Google image search extension

Google image search extension

The results it gives will show the extensive history of the image, including when and where this image has appeared previously and what other, visually similar images are available. This can offer an indication of whether the image has been tampered with.

In our example, the photograph of the shark supposedly swimming in the streets of New Jersey can easily be identified as fake, since articles linking to the image were available a long time before Hurricane Sandy took place.

This may seem like an easy example to pick, and it probably is. However, a lot of the fake images that start trending at the time of, for example, an extreme weather event have appeared online in the past, either as an internet hoax or images of an actual event that may have taken place at a completely different place and time. This just shows how easily they can be debunked with a simple Google image search.

Here’s more on that from this Storyful blog post, taking you through how easy it was to debunk fake photographs that appeared during the lead up to Hurricane Sandy.

Other useful resources for photo verification:

Tin Eye reverse image search

Poynter – Three ways to spot if an image has been manipulated

This article was initially published on here

The ultimate guide to liveblogging

Example of a liveblog

BBC liveblog of the Boston Marathon bombings

If there is one thing that covering breaking news online has brought the world, it’s the liveblog. Liveblogging has become the default format for engaging audiences in ongoing news stories, allowing websites to compete with rolling news TV coverage.

According to a  City Journalism School study conducted last year, liveblogs on the Guardian website were receiving 300 per cent more views and 233 per cent more visitors than conventional online news articles on the same subject. They also outperformed online picture galleries, getting 219 per cent more visitors.

Whether it’s an extreme weather event, a breaking financial story,  a US election or events like the Oscars and the Grammys, a liveblog is the go-to format to cover it.

I recently had the opportunity to be one of the official livebloggers at the digital journalism conference news:rewired, organised by

Here are a few tips and lessons I learnt about liveblogging.

Don’t be fooled by the name – prepare prepare prepare

While the word liveblog may suggest that all the work takes place realtime, one of the most important things in a great liveblog is the effort you put in beforehand. Researching the story, finding the individuals involved and  keeping up to date with any latest developments from them, setting up your social media lists, preparing links to give context are all essential to a high quality liveblog. By making sure you actually do this before an event gets started or as soon as a story breaks will also set you up for a much easier time when things get going.

The day before news:rewired, I prepared notes on each talk that I would be liveblogging, including the twitter handles of those speaking, as well as some background into the organisations and their work, anticipating that some of the articles may come up during the discussion. Having the link ready and primed to post instead of having to look for it while the discussion was going on was a real help.

preparing for a liveblog

Preparation is key to a good liveblog

Some liveblog platforms even give you the option of creating a ‘raw’ feed where you can prepare posts and pull them into the live feed when you need to, making it even simpler when the time comes. ScribbleLive, the platform we used, offered this option and it made life so much easier.

Don’t try to do everything on your own – Use the crowd

using the crowd in a liveblog

Use the crowd to help you liveblog an event

You can’t do everything alone – and nor should you try to. Especially when reporting on a short length live event like a conference, or a football match, there is real value in gathering and presenting a variety of opinions. There are always other people there who can offer a different – and at times more specialist perspective – and really enhance your liveblog while ensuring you don’t miss anything. Remembering that it’s not necessarily a competition and linking to others will make your liveblog much more relevant. During the news:rewired conference, experienced former reporter and currently Wall Street Journal (WSJ) social media editor Sarah Marshall was covering the event through an open google doc. Linking to it regularly and checking anything I may have missed was a real asset.

Vary your style

It took me a couple of attempts to manage to get a good variety in the length of my posts – in the first liveblog I did I triedtoo hard to get everything on a particular issue in a single post before pushing it out, while as a reaction in the second liveblog I was perhaps posting too frequently, interrupting the flow and providing contextless posts.

The ideal is to post short, snappy, interesting facts as they come in, as with  breaking news when covering a news liveblog. Varying it with longer texts providing analysis or background whenever possible is vital to keep the audience engaged, while  links to additional information are also welcome. Liveblogs of cricket matches do this very well, providing over by over updates as well as a lot of analysis, background, audience engagement and plenty of randomness, which is always fun.

Make it visual

liveblog visual

Keeping a liveblog visual is key

Long and continuous blocks of text are hard to keep up with, while breaking up the text with headings, subheading, bullet points and lists – as well as adding images and videos –  is crucial in providing some colour and scannability to whatever you are covering. So there are a number of reasons to focus on keeping your liveblog visual. Whether it’s breaking news and giving a feeling of the situation on the ground or an awards ceremony and giving your audience an insight to what is happening there, keep the photos and videos coming. Keep tabs on instagram (using third-party services like Gramfeed and Statigram) hashtags, twitter search filtered just by pictures and video, vines and use your own phone to take a quick snap if that’s an option – anything to keep the content interesting and your audience engaged.

Think of the future generations

 Well maybe not quite future generations, but actually it’s worth realising that a liveblog can often provide a reference long after an event. Turning a liveblog into an ‘as it happened’ post allows people an in-depth catch up on a news story or an event. Taking this into account it’s important to add a summary, key points and links at the top after the event. Collecting links and useful bookmarks could be extremely helpful for those interested so that they can read up later on topics they are interested in when they have more time on their hands.

Here are my bookmarks from news:rewired, collected with pinboard and worth taking a look at if you’re interested in what was said during the conference.

Each talk also has it’s own tag, so for the BuzzFeed keynote speech on making shareable content, the tag is shareablecontent,  for the data journalism on a budget talk the tag is ddj, for short form video it is imaginatively tagged shortformvideo and for Instagram – wait for it – it’s instagram. 

What I learnt from my first hackathon

journalists and developers working together

The Interhacktives team busy working on their idea at Build the News

Photo credit: MattieTK/Flickr 

Well, officially it wasn’t a hackathon but, as a journalist and not a developer, Build the News – a two-day event for student journalists and web developers to team up and compete in the production of a digital journalism project – was as close to one as I’d got.

Known as events where computer programmers and others involved in software development collaborate to produce projects in a short and intense period of time, hackathons (or hackdays) where journalists and coders work in tandem offer great potential for the future of digital newsrooms.

Since September I’ve become a big fan big fan of Hacks/Hackers meet ups here in London, but there is a fundamental difference between listening to others speak about about something they’ve done or built (however helpful and eye-opening it can be) and actually attempting to build it yourself with them.

Build the News, which was organised by The Times digital development team this weekend, was a great experience, facilitating for some brilliant innovative ideas and projects, as well as plenty of fun.

I thought I’d write a few brief observations from the weekend while it was still fresh in my mind, so as a result I apologise beforehand in case this post turns out a little incoherent, it’s been a long week. These are things I’ve learnt from Build the News to apply for future journalism work and projects, not necessarily lessons for future hackathons.

1) Make friends with developers

By far the most important lesson. Talk to them, sit with them, listen to them talk about how they’ve done something – even though you may spend the next few hours trying to understand exactly what they said. In order to understand what is and is not possible to do online, this is step one. Learn about APIs and what they do, ask what  GitHub is and how it works (still trying to figure both those two questions out myself), try to find out as much about the different programming language and generally make a real effort to keep up instead of  switching off, as you may naturally be inclined to do. I am still struggling with this, but the only real option is to learn through interaction. This is certainly no easy process, but one true digital journalists cannot afford to ignore.

2) Open source your work as a journalist

Journalists are traditionally fiercely protective of the process behind their work and given the nature of the industry, often with good reason. The developer community generally takes a very different approach to this and is certainly one we could learn from. Publicly opening up the thinking behind your work, as well as the process involved in the developing phase of your idea, generally gives you a different perspective of what you are doing, as well as generating interest as well as potential for feedback. Halfway through the event our group, which consisted of few students from City University’s Interactive Journalism MA, belatedly set up a tumblr blog to detail our experience, thoughts, ideas, and overall progression of our project. This turned out to be extremely helpful to us in terms of understanding our own idea better by communicating it, as well as encouraging interaction. Obviously opening up your work process in journalism is not always applicable, but whenever possible the more you do it, the more you and others can potentially learn and improve.

screenshot from interhacktives build the news tumblr

3) Be nosy

People – and especially tech savvy journalists and programmers – all use technology in different ways. Staring at someone’s screen while they are working may seem bad etiquette, but it can offer an insight into a new world. Whether it’s a programme, an app, a browser extension or even something as minor as a keyboard shortcut, you can learn a new way of doing something online, improve how you use a particular programme or save time from the mundane tasks you run on your device. Look at how others work and don’t be afraid to ask what they are doing and how they did that when they talk about, or show you, their work. I feel the need to clarify – always within reason.

I’m sure there could be many more points I could make here, but I’ll stop here and hope to write an updated post once I’ve been to a few more similar events.

Interview with Sarah Hartley, Editor of Contributoria

screenshot of Contributoria

Contributoria: A new model for journalism?

What do Edward Snowden, Corn Exchanges, erotica for women and fish and chips all have in common?

They are just some of the topics that feature in the first issue of Contributoria, the new crowdfunded collaborative journalism platform launched last month.

Backed by the Guardian Media Group (GMG), Contributoria allows its community of journalists and interested readers to decide what articles they would like to see written and support each pitch accordingly – with the community involved in all stages of an article’s development.

The underlying aim is to enable the creation of transparent, high-quality collaborative journalism that might otherwise not have been produced.

I spoke to editor Sarah Hartley about how Contributoria is making collaborative, transparent journalism work.

You can read the interview and more about Contributoria here

Finding meaning in the metrics

In the week or so where the Internet was going through its ‘ best *insert literally any word here* of 2013’ phase, I jumped on the bandwagon and wrote a post about the best-received stories of the year gone by on the interhacktives website.

*Interhacktives is the website the students on the Interactive Journalism MA at City University London.

Essentially the post was borne out of a growing desperation to keep the site updated during a slow couple of weeks where work experience and the badly needed Christmas holidays put writing for interhacktives on the back burner for a while.

At the same, this quick analysis of the type of articles that do well for the website offered valuable insight in refining our content strategy and provide added focus for the year ahead.

After Adam Tinworth,  our lecturer for the Social Media and Community Engagement module, pointed to what we could learn from analysing this type of data, I thought it was worth delving into the analytics a little further to find some more meaning in the metrics.

The most-read post published on interhacktives – by some distance – was an article on the top ten tools for data journalism. Not only did it receive more pageviews, but equally as important was the fact that readers spent around two and a half times longer (7 mins and 25 seconds) on that post than is the average for all pages on the site (2 minutes and 56 seconds).

This is a great example of a type of article that can live through time and keep getting pageviews months after initial publication.

As data driven journalism becomes more popular in the industry and as upcoming journalists join media professionals in trying to stay up to date with the skillset needed to do some basic-level data analysis and visualisation, the article’s prominence is no surprise.

pageviews - trends over time

Google Analytics showing pageviews over time for the Top 10 Data Journalism Tools

A look at its popularity over the months shows spikes at different times, with new tweets about it from other sources coming after its initial publication also interesting.

screenshot of a tweet about the top ten journalism article

Top Ten Data Journalism Tools

Another popular story, the third most viewed story of 2013, was in fact a post written in April 2012 about making a website compliant with EU cookie law. Looking at the source of its traffic over the 13 month-period, more than two out of three views came from Google, as this was a topic that bloggers and others were presumably still searching for. It is in this context that this article’s enduring popularity makes sense.

traffic source data for interhacktives article

Traffic source data for the third most popular post of 2013 on interhacktives

Two more recent how-to guides to making a choropleth map and using Raw to make advanced data visualisations are regularly generating traffic over the last couple of months, often featuring on the trending content widget on the homepage.

Rank Article
1 Top Ten Tools for Data Journalism
2 Who did it best: Data coverage of the 2013 local elections
3 Your website, now illegal: How to comply with the EU cookie law
4 How to Make a choropleth map with google fusion
5 A beginners pre-guide to data journalism
6 Friday Interview: Anne Marie Tomchak, presenter of BBC Trending
7 How to make an alluvial diagram
8 Making data accessible: Interview with Nick Scott from
9 Pivot Tables are your best friend
10 Interview with Andrew Hill of CartoDB

With most interviews for example, they may receive attention at the time, especially via social media, but are unlikely to keep generating traffic to your time in bulk.

Google analytics graph of an interview article on interhacktives

Trend over time graph for an interview article

In fact, analysis of the top 10 articles from January 1st 2013 to February 1st 2014 in terms of pageviews shows that the majority are predominantly timeless, durable pieces of content of use to readers beyond their publication date.  They are what you would describe as ‘stock content’.

Fellow coursemate Sophie Murray Morris offers an excellent analysis of stock and flow content, a concept originating from economics.

Here’s what she says:

“…stock content is durable. Examples of stock content include podcasts, videos, guides and research work.

Flow content is the stream of daily and sub-daily updates. For instance, news articles, surveys, live blogs and social media updates.

While flow content helps to keep newspapers or brands in the public eye, stock content drives steady and continuous traffic to websites over a long period of time. This is why it is really important not to remove good-quality archived content from a website. Good quality archived content can still drive views in if people are researching the topic, for instance.”

Articles explaining the difference between Sunni and Shia Muslims are a great example of stock content that will likely drive views long after the publication date.

screenshot of google search difference between sunni and shia

Difference between sunni and shia google search

As the question undoubtedly will crop up regularly across time, any explainers on the issue will regularly attract traffic. The first two search results are from the BBC from 2009 and 2011, while the Economist’s May 2013 guide comes in third. Both websites will certainly get hits on their site regularly from this one-off explainer based on people’s searches.

Perhaps in the media industry content such as explainers, how-to guides and reviews of apps and tools are often perceived to be of secondary importance, to accompany a major development or news piece. That maybe so, but they given the nature of the internet, they can live much longer online than the news article and are a core part of the journalistic task to inform the population.

At interhacktives have perhaps been guilty of not focusing enough on this and the potential the website offers to create long-lasting stock content based on the skills we are regularly taught and experiment with as part of the course.

Over the next few months, that is something we should perhaps turn our attention to a little more and leave a lasting legacy on the interhacktives website, hopefully ensuring traffic for the site many months after our involvement with it ends.

How I fell in love with Reddit: the power of the app

For it’s hardcore fans – and there are many of them about – it’s Reddit’s no frills, simple, old-school design that’s a major part of its appeal.

For me however, it was just the reason I could never get into it.

I get it, Reddit is a website delivering an almost entirely content based experience and focusing on functionality over aesthetics.

As one redditor replied when someone dared to ask why reddit was so popular given how ugly it is:

I think people like the massive set of sub’s and the intelligent conversation. For that you don’t have to be picky about formatting. It may not be the prettiest, but it works fine for many people.

If you want pretty sidebars go to Facebook, you’ll just give up the IQ level and actual answers to questions.

Reddit screenshot

How Reddit looks on a browser

Honestly, I do get it. But however much I tried to force myself to use Reddit and become part of what is undoubtedly a fascinating community, we just didn’t click. Even after the major controversy and fierce criticism of Reddit in the aftermath of the Boston bombings, it remains a serious social news hub, at times an excellent platform for debate and a treasure trove of interesting and rather random material, very useful from the perspective of a journalist and someone interested in community engagement.

Despite all that, I just could not get past it’s ugliness. It’s browser version is clunky, archaic and uninspiring.

The turning point in my relationship with Reddit was the moment I started experiment with iPad apps for reddit.

There is no official client and the third party apps available are far from perfect. However, they are pretty.

Reddit for iPad

iAlien for Reddit, the free app I’ve been using, is available on both iphone and ipad. While according to reviews from more experienced Redditors the comment editing and deleting functions within the app are seriously lacking, its smooth, slick and very attractive user interface has finally given me a window into the weird and wonderful world of reddit that I want to browse through for hours on end.

This slideshow requires JavaScript.

I’ve learn that putting an inflatable blood pressure meter around your neck and pumping it is probably not a great idea, laughed at this very flat Russian dwarf hamster sitting on a couch and found this truly heartwarming and powerful image and accompanying story of a 65-year-old man with Downs Syndrome.

I am slowly understanding this fascination with “the front page of the internet”, albeit in a way that most traditional redditors would consider sacrilegious.

At the moment, I’m everything the true reddit fan despises, I’m a true lurker. However, this is just the start, as I get hooked to Reddit, I’m sure I will begin contributing as well, it’s just a matter of time.

Diversifying user experience on other platforms

Other than how fickle I am, my Reddit experience indicates the power that social media mobile and tablet apps have in diversifying their user experience to attract different audiences with different ideas and needs in utilising it.

Offering an identical experience on different platforms may work for some sites, applications or social networks, but reaching out to a completely different type of users may also be worth looking into.

Another example of this is Flow, an iPad compatible Instagram app. The application, specifically crafted for the iPad’s screen describes itself as “the missing iPad app for Instagram”, was made by digital design studio Codegent because they “couldn’t wait any longer for an iPad compatible Instagram app so we built our own”.

flow app from iPad screenshot

Flow app for iPad

Thinking beyond the original use and platforms networks and apps were originally made for could be an opportunity worth at the very least looking into and at best a huge growth area for a different audience.