TypeError: 预期字符串或缓冲区

qq^^614136809

于 2024-08-14 16:04:15 发布

阅读量151

点赞数 1

文章标签： python

本文链接：https://blog.csdn.net/D0126_/article/details/141193641

版权

在 Python 中运行一个大型函数时遇到 TypeError 异常。

该函数用于处理一个包含大量文本的大文件，将文件分割成发言人和演讲，然后进一步处理演讲中的各个段落。
在运行 driver 函数时出现错误：Traceback (most recent call last): File "<pyshell#159>", line 1, in <module>\ndriver("C:/Users/mboogie/Documents/Congressional Hearings/NHTF Project/Test Set", 'CHRG-107hhrg70750.htm', 'CHRG-107hhrg70750.csv', 'Paragraphs.csv') File "<pyshell#158>", line 9, in driver\nspeaker = re.findall("\\n Mr. [A-Z][a-z]+\.|\\n Ms. [A-Z][a-z]+\.|\\n Congressman [A-Z][a-z]+\.|\\n Congresswoman [A-Z][a-z]+\.|\\n Chairwoman [A-Z][a-z]+\.|\\n Chairman [A-Z][a-z]+\.", hearing) File "C:\Python27\lib\re.py", line 177, in findall\n return _compile(pattern, flags).findall(string)TypeError: expected string or buffer

解决方案
- 分析错误信息，发现错误与函数 driver 中调用 re.findall() 函数有关。
- re.findall() 函数期望第二个参数是一个字符串或缓冲区，但在代码中，第二个参数是一个列表 hearing。
- 将 hearing 转换为字符串，即可解决问题。
- 由于代码中还有其他问题，因此对代码进行了重构和优化，使其更加清晰和易于理解。
- 最终，提供了两种可能的解决方案，供用户根据自己的需要选择。

代码例子：

# 解决方案 1
def driver(folder, input_filename, output_filename1, output_filename2):
    os.chdir(folder)
    with open(input_filename, 'r') as f:
        Hearing = f.read()
    hearing = BeautifulSoup(Hearing)
    hearing = hearing.get_text()
    hearing = hearing.split("RESPONSE TO WRITTEN")
    hearing = str(hearing)  # 将 hearing 转换为字符串
    speakers = re.findall("\\n    Mr. [A-Z][a-z]+\.|\\n    Ms. [A-Z][a-z]+\.|\\n    Congressman [A-Z][a-z]+\.|\\n   Congresswoman [A-Z][a-z]+\.|\\n   Chairwoman [A-Z][a-z]+\.|\\n   Chairman [A-Z][a-z]+\.", hearing)
    speakers = list(set(speakers))
    # ...
    # 代码的其余部分

# 解决方案 2
# 对代码进行了重构和优化，使其更加清晰和易于理解
def load_hearing_response(fname, split_on='    Present:'):
    with open(fname, 'rU') as inf:
        html = inf.read()
    txt  = BeautifulSoup(html).get_text()
    return txt.rsplit(split_on, 1)[-1]

def un_hard_wrap(txt, reg=HARD_WRAP):
    return reg.sub('', txt)

def get_speeches(txt):
    speakers = [Speaker(NAME(sp), sp.start(), sp.end()) for sp in SPEAKERS.finditer(txt)]
    speakers.append(Speaker('', len(txt), None))  # tail sentinel for pairwise processing
    return [(this.name, txt[this.name_end:nxt.name_start]) for this,nxt in pairwise(speakers)]

def write_csv(fname, data, header=None):
    with open(fname, 'wb') as outf:
        out_csv = csv.writer(outf)
        if header is not None:
            out_csv.writerow(header)
        out_csv.writerows(data)

def main():
    # get text of Congressional hearing responses
    DIR = r'C:\Users\Documents\Congressional Hearings\NHTF Project\Test Set'
    txt = load_hearing_response(os.path.join(DIR, 'CHRG-107hhrg70750.htm'))
    txt = un_hard_wrap(txt)
    # break into speeches
    speeches = get_speeches(txt)
    # write (speaker, speech) pairs to a .csv file
    write_csv(os.path.join(DIR, 'CHRG-107hhrg70750.csv'), speeches, ['Speaker', 'Speech'])
    # write paragraphs of speeches to a .csv file
    paragraphs = ([para.strip()] for speaker,speech in speeches for para in speech.split('\n') if para.strip())
    write_csv(os.path.join(DIR, 'Paragraphs.csv'), paragraphs, ['Paragraphs'])

if __name__=="__main__":
    main()