python error tokenizing data_Python: eliminate extra comma (Error tokenizing data. C error: Expected...

最新推荐文章于 2024-01-20 23:30:00 发布

weixin_39757743

最新推荐文章于 2024-01-20 23:30:00 发布

阅读量157

点赞数

文章标签： python error tokenizing data

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_39757743/article/details/111798524

版权

The file from the URL in your post contains additional commas for some items in the GICS industry group column. The first occurs at line 31 in the file:

ABUNDANT PRODUCE LIMITED,ABT,Food, Beverage & Tobacco

Normally, the 3rd item should be surrounded by quotes to escape breaking on the comma, such as:

ABUNDANT PRODUCE LIMITED,ABT,"Food, Beverage & Tobacco"

For this situation, because the first 2 columns appear to be clean, you can merge any additional text into the 3rd field. After this cleaning, load it into a data frame.

You can do this with a generator that will pull out and clean each line one at a time. The pd.DataFrame constructor will read in the data and create a data frame.

import pandas as pd

def merge_last(file_name, skip_lines=0):

with open(file_name, 'r') as fp:

for i, line in enumerate(fp):

if i < 2:

continue

x, y, *z = line.strip().split(',')

yield (x,y,','.join(z))

# create a generator to clean the lines, skipping the first 2

gen = merge_last('ASXListedCompanies.csv', 2)

# get the column names

header = next(gen)

# create the data frame

df = pd.DataFrame(gen, columns=header)

df.head()

returns:

Company name ASX code GICS industry group

0 MOQ LIMITED MOQ Software & Services

1 1-PAGE LIMITED 1PG Software & Services

2 1300 SMILES LIMITED ONT Health Care Equipment & Services

3 1ST GROUP LIMITED 1ST Health Care Equipment & Services

4 333D LIMITED T3D Commercial & Professional Services

And the rows with the extra commas are preserved:

df.loc[27:30]

# returns:

Company name ASX code GICS industry group

27 ABUNDANT PRODUCE LIMITED ABT Food, Beverage & Tobacco

28 ACACIA COAL LIMITED AJC Energy

29 ACADEMIES AUSTRALASIA GROUP LIMITED AKG Consumer Services

30 ACCELERATE RESOURCES LIMITED AX8 Class Pend

Here is a more generalized generator that will merge after a given number of columns:

def merge_last(file_name, merge_after_col=2, skip_lines=0):

with open(file_name, 'r') as fp:

for i, line in enumerate(fp):

if i < 2:

continue

spl = line.strip().split(',')

yield (*spl[:merge_after_col], ','.join(spl[merge_after_col:]))

weixin_39757743

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python error tokenizing data_Python: eliminate extra comma (Error tokenizing data. C error: Expected...

The file from the URL in your post contains additional commas for some items in the GICS industry group column. The first occurs at line 31 in the file:ABUNDANT PRODUCE LIMITED,ABT,Food, Beverage &am...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。