python error tokenizing data_Python: eliminate extra comma (Error tokenizing data. C error: Expected...

The file from the URL in your post contains additional commas for some items in the GICS industry group column. The first occurs at line 31 in the file:

ABUNDANT PRODUCE LIMITED,ABT,Food, Beverage & Tobacco

Normally, the 3rd item should be surrounded by quotes to escape breaking on the comma, such as:

ABUNDANT PRODUCE LIMITED,ABT,"Food, Beverage & Tobacco"

For this situation, because the first 2 columns appear to be clean, you can merge any additional text into the 3rd field. After this cleaning, load it into a data frame.

You can do this with a generator that will pull out and clean each line one at a time. The pd.DataFrame constructor will read in the data and create a data frame.

import pandas as pd

def merge_last(file_name, skip_lines=0):

with open(file_name, 'r') as fp:

for i, line in enumerate(fp):

if i < 2:

continue

x, y, *z = line.strip().split(',')

yield (x,y,','.join(z))

# create a generator to clean the lines, skipping the first 2

gen = merge_last('ASXListedCompanies.csv', 2)

# get the column names

header = next(gen)

# create the data frame

df = pd.DataFrame(gen, columns=header)

df.head()

returns:

Company name ASX code GICS industry group

0 MOQ LIMITED MOQ Software & Services

1 1-PAGE LIMITED 1PG Software & Services

2 1300 SMILES LIMITED ONT Health Care Equipment & Services

3 1ST GROUP LIMITED 1ST Health Care Equipment & Services

4 333D LIMITED T3D Commercial & Professional Services

And the rows with the extra commas are preserved:

df.loc[27:30]

# returns:

Company name ASX code GICS industry group

27 ABUNDANT PRODUCE LIMITED ABT Food, Beverage & Tobacco

28 ACACIA COAL LIMITED AJC Energy

29 ACADEMIES AUSTRALASIA GROUP LIMITED AKG Consumer Services

30 ACCELERATE RESOURCES LIMITED AX8 Class Pend

Here is a more generalized generator that will merge after a given number of columns:

def merge_last(file_name, merge_after_col=2, skip_lines=0):

with open(file_name, 'r') as fp:

for i, line in enumerate(fp):

if i < 2:

continue

spl = line.strip().split(',')

yield (*spl[:merge_after_col], ','.join(spl[merge_after_col:]))

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
出现"Error tokenizing data. C error: Expected 3 fields in line 4, saw 5"错误是因为在第4行的数据中,预期有3个字段,但实际上看到了5个字段。这意味着数据在该行中的格式不符合预期。 其中引用和引用提到了相同的错误信息,而引用在给出了更具体的错误描述,即"pandas.errors.parserror:标记数据时出错。C错误:第28行中需要3个字段,见4"。这意味着这个错误是由pandas库的解析器引发的。 要解决这个问题,您可以采取以下步骤: 1. 检查第4行的数据,确保它们按照使用的分隔符正确地分隔成了3个字段。您可以使用文本编辑器或Python的read_csv函数来检查和处理数据。 2. 确保数据中没有额外的分隔符或缺失的字段。如果有缺失的字段,您可以考虑使用适当的默认值或删除该行。 3. 如果您在读取数据时使用了自定义的分隔符,确保分隔符与实际数据中使用的分隔符一致。 4. 如果数据中包含引号或其他特殊字符,并且这些字符没有正确转义,也可能导致解析错误。在这种情况下,您可以尝试使用合适的转义字符或引号选项来解析数据。 总之,"Error tokenizing data. C error: Expected 3 fields in line 4, saw 5"错误表明您的数据在第4行的格式不符合预期。通过检查数据并确保其格式正确,您应该能够解决这个问题。<span class="em">1</span><span class="em">2</span><span class="em">3</span> #### 引用[.reference_title] - *1* *2* *3* [Python报错:pandas.errors.ParserError: Error tokenizing data. C error: Expected 3……](https://blog.csdn.net/shuiyixin/article/details/88930359)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_1"}}] [.reference_item style="max-width: 100%"] [ .reference_list ]

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值