python代码异常对照表格,Python表格错误（Pandas错误？）

最新推荐文章于 2022-09-20 13:43:50 发布

Choice林

最新推荐文章于 2022-09-20 13:43:50 发布

阅读量233

点赞数

文章标签： python代码异常对照表格

After some reading online I have decided to use tabula-py to extract tables from pdf files. We use Anaconda and I just installed tabula-py 1.1.1.

I wanted to start out with a simple script and see what it would do with a single page pdf file with some text and two tables ("table_p16.pdf").

The code:

from tabula import read_pdf

df = read_pdf("table_p16.pdf")

The error:

Picked up JAVA_TOOL_OPTIONS: -Djava.security.properties=c:\Windows\Sun\Java\Deployment\sam.security

Traceback (most recent call last):

File "H:/Personlich/SVN/blademat_tb/blademat_toolbox/utility/read_pdf.py", line 41, in

df = read_pdf("table_p16.pdf")

File "C:\Users\xxxxxxxxxxxx\AppData\Local\Continuum\Anaconda3\envs\test_env\lib\site-packages\tabula\wrapper.py", line 117, in read_pdf

return pd.read_csv(io.BytesIO(output), **pandas_options)

File "C:\Users\xxxxxxxxxxxx\AppData\Local\Continuum\Anaconda3\envs\test_env\lib\site-packages\pandas\io\parsers.py", line 709, in parser_f

return _read(filepath_or_buffer, kwds)

File "C:\Users\xxxxxxxxxxxx\AppData\Local\Continuum\Anaconda3\envs\test_env\lib\site-packages\pandas\io\parsers.py", line 455, in _read

data = parser.read(nrows)

File "C:\Users\xxxxxxxxxxxx\AppData\Local\Continuum\Anaconda3\envs\test_env\lib\site-packages\pandas\io\parsers.py", line 1069, in read

ret = self._engine.read(nrows)

File "C:\Users\xxxxxxxxxxxx\AppData\Local\Continuum\Anaconda3\envs\test_env\lib\site-packages\pandas\io\parsers.py", line 1839, in read

data = self._reader.read(nrows)

File "pandas/_libs/parsers.pyx", line 902, in pandas._libs.parsers.TextReader.read

File "pandas/_libs/parsers.pyx", line 924, in pandas._libs.parsers.TextReader._read_low_memory

File "pandas/_libs/parsers.pyx", line 978, in pandas._libs.parsers.TextReader._read_rows

File "pandas/_libs/parsers.pyx", line 965, in pandas._libs.parsers.TextReader._tokenize_rows

File "pandas/_libs/parsers.pyx", line 2208, in pandas._libs.parsers.raise_parser_error

pandas.errors.ParserError: Error tokenizing data. C error: Expected 8 fields in line 9, saw 9

Things I have tried:

Since the error seems to show problems with pandas I tried to read a

single page pdf with one table. The same error holds.

Set user variable PATH to Java. Did not change anything. Can't set

system variable PATH to Java, since it is currently used for our SVN

programm.

Different code lines, with the same error:

df = read_pdf(r"table_p9.pdf")

df = read_pdf("table_p9.pdf", output_format='json')

I hope someone can chip in and help me figure out where the problem lies. It could be a Java issue, but I am not that familiar with the required Java interaction. Your help is much appriciated.

Edit

I tried different tables and some seem to be working. It has been difficult to identify what type of tables work. Some with 'merged' columns and others with 'merged' rows seem to work. But clearly not all. Also, I have not been able to read multiple tables (2 or 3) using the argument multiple_tables=True.

Is there any source to what kind of tables Tabula can handle? And this makes me wonder whether Tabula is the right program to use. After all the reading I did, I was under the impression that Tabula would be good at this. The tables it seems to struggle with are not complex.

Is there a clear and simple source on how to maximize the use of Tabula? Or otherwise tips on how to deal with tables that Tabula struggles with?

Regards,

Gabriel

解决方案

This is the rough guideline for tabula (or tabula-py) options.

1) Having merged cells with a lined table

You can use lattice=True option. With lattice mode, tabula handles line of tables appropriately. Note that, you might need post editing some kind of fillna for merged cells. I experienced some merged columns is extracted with left-justified.

AFAIK, it's pretty hard for tabula to extract merged cell without line of table.

General tuning points for tabula are lattice, stream, guess.

2) Having multiple tables within one or more pages

It's tabula-py specific option, you have to use multiple_tables=True option.

By default, tabula-py tries to extract tables via CSV. While this approach can get benefits from pandas.read_csv function like inferring of column names. read_csv assumes a single table (same column size table) in a PDF. pandas.read_csv with different size of columns causes ParserError.

On the other hand, with multiple_tables option, tabula-py creates DataFrame via JSON, which can represent multiple tables.

One more option. From tabula-py 1.3.0, you can use Tabla app templates with tabula-py. Getting area data from template, you could extract more appropriately with accurate area info.

Choice林

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python代码异常对照表格,Python表格错误（Pandas错误？）

After some reading online I have decided to use tabula-py to extract tables from pdf files. We use Anaconda and I just installed tabula-py 1.1.1.I wanted to start out with a simple script and see what...
复制链接

扫一扫

python代码异常对照表格,Python表格错误（Pandas错误？）

“相关推荐”对你有帮助么？