Python在pdf中提取表格

1 Tabula-py

https://github.com/chezou/tabula-py 1.9k star

1	pip install tabula-py

1 2	import tabula dfs = tabula.read_pdf(pdf_path, stream=True)

它使用到 java 库，会使开发环境变得比较重

https://github.com/atlanhq/camelot 3.5k star

https://github.com/camelot-dev/camelot 2.4k star

1	pip install camelot-py

需要安装 opencv，opengl，安装起来比较麻烦，比较重。

https://github.com/jsvine/pdfplumber 4.7k star

1	pip install pdfplumber

这是一个 pdf 解析库，不仅用于解析表格；能用于解析表，但比较粗糙，多一些空格，多余的列... 需要自行处理，不过行数列数还对得上，可以直接转成 DataFrame。