We use GitHub issues to keep track of all issues. Please do not report bugs or issues in this blog’s comments. Instead, post them on GitHub as an issue. Before submitting a comment with an issue, please use GitHub search to look for existing issues (both open and closed) that may be similar.
The PDF (Portable Document Format) was born out of The Camelot Project to create “a universal way to communicate documents across a wide variety of machine configurations, operating systems and communication networks”. Basically, the goal was to make documents viewable on any display and printable on any modern printer. PDF was built on top of PostScript (a page description language), which had already solved this “view and print anywhere” problem. PDF encapsulates the components required to create a “view and print anywhere” document. These include characters, fonts, graphics and images.
A PDF file defines instructions to place characters (and other components) at precise x,y coordinates relative to the bottom-left corner of the page. Words are simulated by placing some characters closer than others. Similarly, spaces are simulated by placing words relatively far apart. How are tables simulated then? You guessed it correctly — by placing words as they would appear in a spreadsheet.
The PDF format has no internal representation of a table structure, which makes it difficult to extract tables for analysis. Sadly, a lot of open data is stored in PDFs, which was not designed for tabular data in the first place!
Camelot: PDF table extraction for humans
Today, we’re pleased to announce the release of Camelot, a Python library and command-line tool that makes it easy for anyone to extract data tables trapped inside PDF files! You can check out the documentation at Read the Docs and follow the development on GitHub.
How to install Camelot
Installation is easy! After installing the dependencies, you can install Camelot using pip (the recommended tool for installing Python packages):
$ pip install camelot-py
How to use Camelot
Extracting tables from a PDF using Camelot is very simple. Here’s how you do it. (Here’s the PDF used in the following example.)
>>> import camelot >>> tables = camelot.read_pdf('foo.pdf') >>> tables <TableList n=1> >>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html >>> tables[0] <Table shape=(7, 7)> >>> tables[0].parsing_report { 'accuracy': 99.02, 'whitespace': 12.24, 'order': 1, 'page': 1 } >>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html >>> tables[0].df # get a pandas DataFrame!
You can also check out the command-line interface.
Why use Camelot?
- Camelot gives you complete control over table extraction by letting you tweak its settings.
- Bad tables can be discarded based on metrics like accuracy and whitespace, without ever having to manually look at each table.
- Each table is a pandas DataFrame, which seamlessly integrates into ETL and data analysis workflows.
- You can export tables to multiple formats, including CSV, JSON, Excel and HTML.
Okay, but why another PDF table extraction library?
TL;DR: Total control for better table extraction
Many people use open (Tabula, pdf-table-extract) and closed-source (smallpdf, pdftables) tools to extract tables from PDFs. But they either give a nice output or fail miserably. There is no in between. This is not helpful since everything in the real world, including PDF table extraction, is fuzzy. This leads to the creation of ad-hoc table extraction scripts for each type of PDF table.
We created Camelot to offer users complete control over table extraction. If you can’t get your desired output with the default settings, you can tweak them and get the job done!
You can check out a comparison of Camelot’s output with other open-source PDF table extraction libraries.
The longer read
We’ve often needed to extract data trapped inside PDFs.
The first tool that we tried was Tabula, which has nice user and command-line interfaces, but it either worked perfectly or failed miserably. When it failed, it was difficult to tweak the settings — such as the image thresholding parameters, which influence table detection and can lead to a better output.
We also tried closed-source tools like smallpdf and pdftables, which worked slightly better than Tabula. But then again, they also didn’t allow tweaking and cost money.
When these full-blown PDF table extraction tools didn’t work, we tried pdftotext (an open-source command-line utility). pdftotext extracts text from a PDF while preserving the layout, using spaces. After getting the text, we had to write Python scripts with complicated regexes (regular expressions) to convert the text into tables. This wasn’t scalable, since we had to change the regexs for each new table layout.
We clearly needed a tweakable PDF table extraction tool, so we started developing one in December 2015. We started with the idea of giving the tool back to the community, which had given us so many open-source tools to work with.
We knew that Tabula classifies PDF tables into two classes. It has two methods to extract these different classes: Lattice (to extract tables with clearly defined lines between cells) and Stream (to extract tables with spaces between cells). We named Camelot’s table extraction flavors, Lattice and Stream, after Tabula’s methods.
Tabula uses a combination of scraping the vector elements and raster lines. Since we wanted to use Python, OpenCV was the obvious choice to do image processing. After more exploration, we settled on morphological transformations, which gave the exact line segments. From here, representing the table trapped inside a PDF was straightforward.
To get more information on how Lattice and Stream work in Camelot, check out the “How It Works” section of the documentation.
How we use Camelot
We’ve battle tested Camelot by using it in a variety of projects, both for one-off and automated table extraction.
For Atlan Grid, our curated data from 600+ sources and partners, we identified open data sources (primarily PDF reports) for each of the 17 Sustainable Development Goals. For example, one of our sources for Goal 3 (“Good Health and Well-Being for People”) is the National Family Health Survey (NFHS) report released by IIPS. To get data from these PDF sources, we created an internal web interface built on top of Camelot, where our data analysts could upload PDF reports and extract tables in their preferred format.
We also set up an ETL workflow using Apache Airflow to track disease outbreaks in India. The workflow scrapes the Integrated Disease Surveillance Programme (IDSP) website for weekly PDFs of disease outbreak data, and then it extracts tables from the PDFs using Camelot, sends alerts to our team, and loads the data into a data warehouse.
To infinity and beyond!
Camelot has some limitations. (We’re developing solutions!) Here are a couple of them:
- When using Stream, tables aren’t autodetected. Stream treats the whole page as a single table, which gives bad output when there are multiple tables on the page.
- Camelot only works with text-based PDFs and not scanned documents. (As Tabula explains, “If you can click-and-drag to select text in your table in a PDF viewer… then your PDF is text-based”.)
You can check out the GitHub repository for more information.
You can help too — every contribution counts! Check out the Contributor’s Guide for guidelines around contributing code, documentation or tests, reporting issues and proposing enhancements. You can also head to the issue tracker and look for issues labeled “help wanted” and “good first issue”.
We urge organizations to release open data in a “data friendly” format like the CSV. But while tables are trapped inside PDF files, there’s Camelot 🙂
Note: This blog was updated on 2nd November 2018 after we learnt that Tabula uses a combination of scraping the vector elements and raster lines, and not the Hough Transform as mentioned in this blog.
Photo by Jason Wong on Unsplash
37 Comments
Pingback: Announcing Camelot, a Python Library to Extract Tabular Data from PDFs » @FinTechLog
Pingback: New top story on Hacker News: A Python Library to extract tabular data from PDFs – Latest news
Pingback: New top story on Hacker News: A Python Library to extract tabular data from PDFs - EYFnews
Pingback: New top story on Hacker News: A Python Library to extract tabular data from PDFs – Golden News
Pingback: New top story on Hacker News: A Python Library to extract tabular data from PDFs – News about world
Pingback: New top story on Hacker News: A Python Library to extract tabular data from PDFs – Hckr News
Pingback: A Python Library to extract tabular data from PDFs – Hacker News Robot
Pingback: New best story on Hacker News: A Python Library to extract tabular data from PDFs – letest news
Pingback: New best story on Hacker News: A Python Library to extract tabular data from PDFs – Fiverr Alternative
Pingback: A Python Library to extract tabular data from PDFs – Infinity News
Pingback: A Python Library to extract tabular data from PDFs
Pingback: A Python Library to extract tabular data from PDFs | toppertrick
Hello vinayak
camelot.read_pdf(‘foo.pdf’) is not working. Is there any change in the lastest version as i just downloaded it today only?
It is giving error that it cannot find the file (whereas the file is present there)
Hi Ravender! Can you add this issue on the issue tracker[1], with all the necessary information as specified in the contributing guidelines[2]? That will help me fix it sooner. Thanks!
[1] https://github.com/atlanhq/camelot/issues
[2] https://camelot-py.readthedocs.io/en/master/dev/contributing.html#filing-issues
Why don’t you guys compare PDFPlumber extraction part with Camelot extract part. Also in your results you are not able to extract merged cells properly. CSV will not be able to handle it, so you might need to think of Excel output
is there any way in which we can access table and cell coordinates? like in tabula we get the json with in the format specified below?
{u’width’: 44.12999725341797, u’top’: 166.88, u’height’: 10.020000457763672, u’text’: u’suraj’, u’left’: 102.62}
Hey Suraj! I see that you opened the same issue on GitHub, I’ve answered it there. Here’s the link: https://github.com/atlanhq/camelot/issues/172
Hi Vinayak,
Am not able import camelot after successful install. Can you please help me out ?
—————————————————————–
>>> import camelot
Traceback (most recent call last):
File “”, line 1, in
File “/home/vineet/anaconda/lib/python2.7/site-packages/camelot/__init__.py”, line 8, in
from .io import read_pdf
File “/home/vineet/anaconda/lib/python2.7/site-packages/camelot/io.py”, line 4, in
from .handlers import PDFHandler
File “/home/vineet/anaconda/lib/python2.7/site-packages/camelot/handlers.py”, line 9, in
from .parsers import Stream, Lattice
File “/home/vineet/anaconda/lib/python2.7/site-packages/camelot/parsers/__init__.py”, line 4, in
from .lattice import Lattice
File “/home/vineet/anaconda/lib/python2.7/site-packages/camelot/parsers/lattice.py”, line 18, in
from ..image_processing import (adaptive_threshold, find_lines,
File “/home/vineet/anaconda/lib/python2.7/site-packages/camelot/image_processing.py”, line 5, in
import cv2
ImportError: No module named cv2
>>>
—————————————————————–
You need to install OpenCV to run Camelot, which you can do by installing Camelot with “pip install camelot-py[cv]”. If you face any issue, please file it on GitHub.
Hi Vinayak,
I am getting the following error:
” File “C:\Users\UserName\AppData\Roaming\Python\Python36\site-packages\camelot\image_processing.py”, line 38, in adaptive_threshold
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
error: OpenCV(3.4.4) C:\projects\opencv-python\opencv\modules\imgproc\src\color.cpp:181: error: (-215:Assertion failed) !_src.empty() in function ‘cv::cvtColor'”
I am using it as follows:
import camelot
tables = camelot.read_pdf(“C:\\Users\\UserName\\Desktop\\foo.pdf”)
Please tell me how to solve the issue.
Hi Bhupender, looks like you have an incompatible OpenCV version. Can you check out the Contributor’s Guide (https://camelot-py.readthedocs.io/en/master/dev/contributing.html#bug-reports) and file an issue on the GitHub repo (https://github.com/atlanhq/camelot/issues)? We can take it from there.
Hi Vinayak,
I have some PDFs where a table starts in page 1 and ends in page. That is , a table in pdf spans 2 pages. Tabula doesn’t seem to give a good result in that case. Is something like this feasible with Camelot.
Hi Vedant, Camelot can give you a pandas DataFrames for both tables which you can then append in your Python code.
Pingback: 2 – Announcing flyio, an R package to interact with data in the cloud | Traffic.Ventures Social
Hi Vinayak,
I am getting an error while running your exemple. Can you help me out?
PS.: I installed python3-ghostscript
Traceback (most recent call last):
File “C:\Users\tomas\Desktop\Google Drive\bolsa\python\short term\pGSPack\lib\site-packages\camelot\parsers\lattice.py”, line 193, in get_executable
raise ValueError
ValueError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File “C:/Users/tomas/Desktop/Google Drive/lang/z-pdfcamelot.py”, line 6, in
tables = camelot.read_pdf(pdf)
File “C:\Users\tomas\Desktop\Google Drive\bolsa\python\short term\pGSPack\lib\site-packages\camelot\io.py”, line 101, in read_pdf
tables = p.parse(flavor=flavor, suppress_stdout=suppress_stdout, **kwargs)
File “C:\Users\tomas\Desktop\Google Drive\bolsa\python\short term\pGSPack\lib\site-packages\camelot\handlers.py”, line 156, in parse
t = parser.extract_tables(p, suppress_stdout=suppress_stdout)
File “C:\Users\tomas\Desktop\Google Drive\bolsa\python\short term\pGSPack\lib\site-packages\camelot\parsers\lattice.py”, line 361, in extract_tables
self._generate_image()
File “C:\Users\tomas\Desktop\Google Drive\bolsa\python\short term\pGSPack\lib\site-packages\camelot\parsers\lattice.py”, line 220, in _generate_image
gs = get_executable()
File “C:\Users\tomas\Desktop\Google Drive\bolsa\python\short term\pGSPack\lib\site-packages\camelot\parsers\lattice.py”, line 206, in get_executable
‘Please make sure that Ghostscript is installed’
camelot.parsers.lattice.GhostscriptNotFound: Please make sure that Ghostscript is installed and available on the PATH environment variable
Hi Tomas! As the error indicates, you don’t have Ghostscript installed. Please follow the instructions here to install it: https://camelot-py.readthedocs.io/en/master/user/install-deps.html
Hello vinayak
i am trying to use your code but it throwing an import error “ImportError: cannot import name ‘TableList’ from ‘camelot.core’ (C:\Users\NITESH\PycharmProjects\pdf_to_excel\venv\lib\site-packages\camelot\core\__init__.py)” . Could you please suggest me how to fix it. i imported cv2 before importing camelot.
Thanks in Advance.
Hi Nitesh,
You can find the fix in this issue’s comments. https://github.com/atlanhq/camelot/issues/142
Hi Vinayak,
Camelot is too good and working perfectly, only problem i faced during converting pdf to excel is every header of an excel is repeating twice.
Hi RP,
Please check out the Contributor’s Guide (https://camelot-py.readthedocs.io/en/master/dev/contributing.html#filing-issues) on guidance around filing issues and open an issue on Github.
Hi,
Thaknks for good work, I have one doubt, I am uploading a pdf having multiple pages using read_pdf.
I am getting output of only one page, How to get the output of every page.
Hey Kallol, can you please post this on Github issues or Stack Overflow? You can check out the Contributor’s Guide (https://camelot-py.readthedocs.io/en/master/dev/contributing.html#filing-issues) for guidance around getting support and filing issues.
It does not work
#!/usr/bin/env python
import camelot
tables = camelot.read_pdf(‘upower15.pdf’)
print tables
Hey Jose, can you please post this on Github issues? You can check out the Contributor’s Guide (https://camelot-py.readthedocs.io/en/master/dev/contributing.html#filing-issues) for guidance around getting support and filing issues.
Thanks for the very informative article Vinayak.
Camelot is very effective in extracting tables from pdfs and I was successful int implementing your code on multiple pdfs. I was hoping to see if there is a way to extract the name of the table (usually either above or below the table in pdfs) along with the table itself. I see that the page number can be extracted at the moment by the way
Hi Vijay, thanks for reaching out. Right now, there’s no clean way to extract table titles.
Hi, is that possible to extract other contents of pdf along with tables with Camelot?