python代码异常对照表格,Python表格错误(Pandas错误?)

After some reading online I have decided to use tabula-py to extract tables from pdf files. We use Anaconda and I just installed tabula-py 1.1.1.

I wanted to start out with a simple script and see what it would do with a single page pdf file with some text and two tables ("table_p16.pdf").

The code:

from tabula import read_pdf

df = read_pdf("table_p16.pdf")

The error:

Picked up JAVA_TOOL_OPTIONS: -Djava.security.properties=c:\Windows\Sun\Java\Deployment\sam.security

Traceback (most recent call last):

File "H:/Personlich/SVN/blademat_tb/blademat_toolbox/utility/read_pdf.py", line 41, in

df = read_pdf("table_p16.pdf")

File "C:\Users\xxxxxxxxxxxx\AppData\Local\Continuum\Anaconda3\envs\test_env\lib\site-packages\tabula\wrapper.py", line 117, in read_pdf

return pd.read_csv(io.BytesIO(output), **pandas_options)

File "C:\Users\xxxxxxxxxxxx\AppData\Local\Continuum\Anaconda3\envs\test_env\lib\site-packages\pandas\io\parsers.py", line 709, in parser_f

return _read(filepath_or_buffer, kwds)

File "C:\Users\xxxxxxxxxxxx\AppData\Local\Continuum\Anaconda3\envs\test_env\lib\site-packages\pandas\io\parsers.py", line 455, in _read

data = parser.read(nrows)

File "C:\Users\xxxxxxxxxxxx\AppData\Local\Continuum\Anaconda3\envs\test_env\lib\site-packages\pandas\io\parsers.py", line 1069, in read

ret = self._engine.read(nrows)

File "C:\Users\xxxxxxxxxxxx\AppData\Local\Continuum\Anaconda3\envs\test_env\lib\site-packages\pandas\io\parsers.py", line 1839, in read

data = self._reader.read(nrows)

File "pandas/_libs/parsers.pyx", line 902, in pandas._libs.parsers.TextReader.read

File "pandas/_libs/parsers.pyx", line 924, in pandas._libs.parsers.TextReader._read_low_memory

File "pandas/_libs/parsers.pyx", line 978, in pandas._libs.parsers.TextReader._read_rows

File "pandas/_libs/parsers.pyx", line 965, in pandas._libs.parsers.TextReader._tokenize_rows

File "pandas/_libs/parsers.pyx", line 2208, in pandas._libs.parsers.raise_parser_error

pandas.errors.ParserError: Error tokenizing data. C error: Expected 8 fields in line 9, saw 9

Things I have tried:

Since the error seems to show problems with pandas I tried to read a

single page pdf with one table. The same error holds.

Set user variable PATH to Java. Did not change anything. Can't set

system variable PATH to Java, since it is currently used for our SVN

programm.

Different code lines, with the same error:

df = read_pdf(r"table_p9.pdf")

df = read_pdf(r"table_p9.pdf")

df = read_pdf("table_p9.pdf", output_format='json')

I hope someone can chip in and help me figure out where the problem lies. It could be a Java issue, but I am not that familiar with the required Java interaction. Your help is much appriciated.

Edit

I tried different tables and some seem to be working. It has been difficult to identify what type of tables work. Some with 'merged' columns and others with 'merged' rows seem to work. But clearly not all. Also, I have not been able to read multiple tables (2 or 3) using the argument multiple_tables=True.

Is there any source to what kind of tables Tabula can handle? And this makes me wonder whether Tabula is the right program to use. After all the reading I did, I was under the impression that Tabula would be good at this. The tables it seems to struggle with are not complex.

Is there a clear and simple source on how to maximize the use of Tabula? Or otherwise tips on how to deal with tables that Tabula struggles with?

Regards,

Gabriel

解决方案

This is the rough guideline for tabula (or tabula-py) options.

1) Having merged cells with a lined table

You can use lattice=True option. With lattice mode, tabula handles line of tables appropriately. Note that, you might need post editing some kind of fillna for merged cells. I experienced some merged columns is extracted with left-justified.

AFAIK, it's pretty hard for tabula to extract merged cell without line of table.

General tuning points for tabula are lattice, stream, guess.

2) Having multiple tables within one or more pages

It's tabula-py specific option, you have to use multiple_tables=True option.

By default, tabula-py tries to extract tables via CSV. While this approach can get benefits from pandas.read_csv function like inferring of column names. read_csv assumes a single table (same column size table) in a PDF. pandas.read_csv with different size of columns causes ParserError.

On the other hand, with multiple_tables option, tabula-py creates DataFrame via JSON, which can represent multiple tables.

One more option. From tabula-py 1.3.0, you can use Tabla app templates with tabula-py. Getting area data from template, you could extract more appropriately with accurate area info.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值