Extract Tables from PDFs with Tabula
It sometimes happens that the dataset you are interested in is only available as a PDF document. Don’t despair, you can likely use Tabula to extract tables and save them as CSV files.
Tabula is a free tool that runs on Java, and is available for Mac, Windows, and Linux computers. It runs on your local machine and does not send your data to the cloud, so you can also use it for sensitive documents.
Note: Keep in mind that PDFs generally come in two flavors, image-based and text-based. You know your PDF is text-based if you can use cursor to select and copy-paste text. These are great for Tabula. Image-based PDFs are those that were created from scanning documents. Before they can be processed with Tabula, you will need to use an optical character recognition (OCR) software, such as Adobe Acrobat, to create a text-based PDF.
Set Up Tabula
Download the newest version of Tabula. You can use download buttons on the left-hand side, or scroll down to the Download & Install Tabula section to download a copy for your platform.
Unlike most other programs, Tabula does not require installation. Just unzip the downloaded archive, and double-click the icon. If prompted with a security message (such as “Tabula is an app downloaded from the internet. Are you sure you want to open it?”), follow the instruction to proceed (on a Mac, click Open; you might have to go to System Preferences > Security & Privacy, and resolve the issue there).
Your default system browser should open, like shown in Figure 4.5.
The URL will be something like
http://127.0.0.1:8080/, meaning Tabula is running on your local machine.
127.0.0.1, also known as
localhost, is the hostname for your machine.
8080 is called port
(it’s okay if you see a different port—most likely because 8080 was taken by some other
program running on your computer). If for any reason you decide to use a different browser,
just copy-paste the URL.
Load a PDF and Autodetect Tables
Since the beginning of the Covid-19 pandemic, the Department of Public Health in Connecticut has been issuing daily PDFs with case and death count by town. For the demonstration, we will use one of those PDFs from May 31, 2020.
- Select the PDF you want to extract data from by clicking the blue Browse… button.
- Click Import. Tabula will begin analyzing the file.
- As soon as Tabula finishes loading the PDF, you will see a PDF viewer with individual pages. The interface is fairly clean, with only four buttons in the header.
- The easiest first step is to let Tabula autodetect tables. Click the relevant button in the header. You will see that each table is highlighted in red, like shown in Figure 4.6.
Manually Adjust Selections and Export
- Click Preview & Export Extracted Data green button to see how Tabula thinks the data should be exported.
- If the preview tables don’t contain the data you want, try switching between Stream and Lattice extraction methods in the left-hand-side bar.
- If the tables still don’t look right, or you with to remove some tables that Tabula auto-detected, hit Revise selection button. That will bring you back to the PDF viewer.
- Now you can Clear All Selections and manually select tables of interest. Use drag-and-drop movements to select tables of interest (or parts of tables).
- If you want to “copy” selection to some or all pages, you can use Repeat this Selection dropdown, which appears in the lower-right corner of your selections, to propagate changes. This is extremely useful if your PDF consists of many similarly-formatted pages.
- Once you are happy with the result, you can export it. If you have only one table, we recommend using CSV as export format. If you have more than one table, consider switching export format to zip of CSVs. This way each table will be saved as an individual file, rather than all tables inside one CSV file.
Once you exported your data, you can find it in a Downloads folder on your computer (or wherever you chose to save it). It is ready to be opened in Google Sheets or Microsoft Excel, analyzed, and visualized! In the following section, we are going to look how to clean up messy datasets with OpenRefine.