Extract Tables from PDFs with Tabula
It sometimes happens that the dataset you are interested in is only available as a PDF document. Don’t despair, you can likely use Tabula to extract tables and save them as CSV files. Keep in mind that PDFs generally come in two flavors: text-based and image-based. If you can use cursor to select and copy-paste text in your PDF, then it’s text-based, which is great because you can process it with Tabula. But if you cannot select and copy-paste items inside a PDF, then it’s image-based, meaning it was probably created as a scanned version of the original document. You need to use optical character recognition (OCR) software, such as Adobe Acrobat Pro or another OCR tool, to convert an image-based PDF into a text-based PDF. Furthermore, Tabula can only extract data from tables, not charts or other types of visualizations.
Tabula is a free tool that runs on Java in your browser, and is available for Mac, Windows, and Linux computers. It runs on your local machine and does not send your data to the cloud, so you can also use it for sensitive documents.
To get started, download the newest version of Tabula. You can use download buttons on the left-hand side, or scroll down to the Download & Install Tabula section to download a copy for your platform. Unlike most other programs, Tabula does not require installation. Just unzip the downloaded archive, and double-click the icon. If you work on a Mac, you may see a warning that states, “Tabula is an app downloaded from the internet. Are you sure you want to open it?” If so, click Open. Or you may have to go to System Preferences > Security & Privacy > General tab, and click the Open Anyway button in the lower half of the window to open the app the first time.
When you start up Tabula, the default system browser will open, as shown in Figure 5.7. Tabula runs on your local computer, not the internet. The URL in the browser will be something like
http://127.0.0.1:8080/. The first portion is the localhost or hostname for your computer, and
8080 refers to the port number. If you see a different port number, that’s fine, and just means that the initial number is already in use by some other program on your computer. If for any reason you decide to use a different browser, just copy-and-paste the URL.
Now let’s upload a sample text-based PDF and detect any tables we wish to extract. In the beginning of the Covid-19 pandemic, the Department of Public Health in Connecticut issued data on cases and deaths only in PDF document format. For this demonstration, you can use our sample text-based PDF from May 31, 2020, or provide your own.
- Select the PDF you want to extract data from by clicking the blue Browse… button.
- Click Import. Tabula will begin analyzing the file.
- As soon as Tabula finishes loading the PDF, you will see a PDF viewer with individual pages. The interface is fairly clean, with only four buttons in the header.
- Click the Autodetect Tables button to let Tabula look for relevant data. The tool highlights each table it detects in red, as shown in Figure 5.8.
Now let’s manually adjust our selected tables and export the data.
- Click Preview & Export Extracted Data green button to see how Tabula thinks the data should be exported.
- If the preview tables don’t contain the data you want, try switching between Stream and Lattice extraction methods in the left-hand-side bar.
- If the tables still don’t look right, or you with to remove some tables that Tabula auto-detected, hit Revise selection button. That will bring you back to the PDF viewer.
- Now you can Clear All Selections and manually select tables of interest. Use drag-and-drop movements to select tables of interest (or parts of tables).
- If you want to “copy” selection to some or all pages, you can use Repeat this Selection dropdown, which appears in the lower-right corner of your selections, to propagate changes. This is extremely useful if your PDF consists of many similarly-formatted pages.
- Once you are happy with the result, you can export it. If you have only one table, we recommend using CSV as export format. If you have more than one table, consider switching export format in the drop-down menu to zip of CSVs.This way each table will be saved as an individual file, rather than all tables inside one CSV file.
Once you exported your data, you can find it in the Downloads folder on your computer (or wherever you chose to save it), where it is ready to open with a spreadsheet tool to analyze and visualize.
In the following section, we are going to look how to clean up messy datasets with OpenRefine.