TypeError: 预期字符串或缓冲区

在 Python 中运行一个大型函数时遇到 TypeError 异常。

  • 该函数用于处理一个包含大量文本的大文件,将文件分割成发言人和演讲,然后进一步处理演讲中的各个段落。
  • 在运行 driver 函数时出现错误:Traceback (most recent call last): File "<pyshell#159>", line 1, in <module>\ndriver("C:/Users/mboogie/Documents/Congressional Hearings/NHTF Project/Test Set", 'CHRG-107hhrg70750.htm', 'CHRG-107hhrg70750.csv', 'Paragraphs.csv') File "<pyshell#158>", line 9, in driver\nspeaker = re.findall("\\n Mr. [A-Z][a-z]+\.|\\n Ms. [A-Z][a-z]+\.|\\n Congressman [A-Z][a-z]+\.|\\n Congresswoman [A-Z][a-z]+\.|\\n Chairwoman [A-Z][a-z]+\.|\\n Chairman [A-Z][a-z]+\.", hearing) File "C:\Python27\lib\re.py", line 177, in findall\n return _compile(pattern, flags).findall(string)TypeError: expected string or buffer
  1. 解决方案
    • 分析错误信息,发现错误与函数 driver 中调用 re.findall() 函数有关。
    • re.findall() 函数期望第二个参数是一个字符串或缓冲区,但在代码中,第二个参数是一个列表 hearing。
    • 将 hearing 转换为字符串,即可解决问题。
    • 由于代码中还有其他问题,因此对代码进行了重构和优化,使其更加清晰和易于理解。
    • 最终,提供了两种可能的解决方案,供用户根据自己的需要选择。

代码例子:

# 解决方案 1
def driver(folder, input_filename, output_filename1, output_filename2):
    os.chdir(folder)
    with open(input_filename, 'r') as f:
        Hearing = f.read()
    hearing = BeautifulSoup(Hearing)
    hearing = hearing.get_text()
    hearing = hearing.split("RESPONSE TO WRITTEN")
    hearing = str(hearing)  # 将 hearing 转换为字符串
    speakers = re.findall("\\n    Mr. [A-Z][a-z]+\.|\\n    Ms. [A-Z][a-z]+\.|\\n    Congressman [A-Z][a-z]+\.|\\n   Congresswoman [A-Z][a-z]+\.|\\n   Chairwoman [A-Z][a-z]+\.|\\n   Chairman [A-Z][a-z]+\.", hearing)
    speakers = list(set(speakers))
    # ...
    # 代码的其余部分

# 解决方案 2
# 对代码进行了重构和优化,使其更加清晰和易于理解
def load_hearing_response(fname, split_on='    Present:'):
    with open(fname, 'rU') as inf:
        html = inf.read()
    txt  = BeautifulSoup(html).get_text()
    return txt.rsplit(split_on, 1)[-1]

def un_hard_wrap(txt, reg=HARD_WRAP):
    return reg.sub('', txt)

def get_speeches(txt):
    speakers = [Speaker(NAME(sp), sp.start(), sp.end()) for sp in SPEAKERS.finditer(txt)]
    speakers.append(Speaker('', len(txt), None))  # tail sentinel for pairwise processing
    return [(this.name, txt[this.name_end:nxt.name_start]) for this,nxt in pairwise(speakers)]

def write_csv(fname, data, header=None):
    with open(fname, 'wb') as outf:
        out_csv = csv.writer(outf)
        if header is not None:
            out_csv.writerow(header)
        out_csv.writerows(data)

def main():
    # get text of Congressional hearing responses
    DIR = r'C:\Users\Documents\Congressional Hearings\NHTF Project\Test Set'
    txt = load_hearing_response(os.path.join(DIR, 'CHRG-107hhrg70750.htm'))
    txt = un_hard_wrap(txt)
    # break into speeches
    speeches = get_speeches(txt)
    # write (speaker, speech) pairs to a .csv file
    write_csv(os.path.join(DIR, 'CHRG-107hhrg70750.csv'), speeches, ['Speaker', 'Speech'])
    # write paragraphs of speeches to a .csv file
    paragraphs = ([para.strip()] for speaker,speech in speeches for para in speech.split('\n') if para.strip())
    write_csv(os.path.join(DIR, 'Paragraphs.csv'), paragraphs, ['Paragraphs'])

if __name__=="__main__":
    main()
  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值