Getting data from a PDF

Tabula tutorial

Tabula, one of the five tools I examine in my article

We live in a world where PDF is king. Perhaps we could even go as far as calling it the tyranny of the PDF.

Developed in the early 90s as a way to share documents among computers running incompatible software, the Portable Document Format (PDF) offers a consistent appearance on all devices, ensuring content control and making it difficult for others to copy the information contained within.

However, for a data journalist whose job depends on being able to extract bulk data for analysis and visualisation, PDFs as the filetype of choice does not tend to go down well.

In a field of journalism where the spreadsheet needs to rule the roost, I look at a few ways of turning data enclosed within PDFs to spreadsheets (excel xls or CSV), primed for data analysis.

What’s always important to remember in trying to get data out of PDF files is that there is no single catch-all way that works for every occasion, sometimes it’s just a matter of trying each one until you find the one that works.

For the rest of the article and the tutorials, published on the Interhacktives website, please click here.