多年来,我在Python中使用Unicode时遇到了很多困难,因为我使用日语处理许多文本文件,所以我熟悉使用.encode(“utf-8”)将日语文本从日语显示回日语显示XXXX . 我没有得到任何编码/解码错误 . 但是我从unicode文件中读取文本,操作,然后写回新文件被表示为u'xxxx的字符串而不是原始的日文文本 . 我已经在多个地方尝试过.encode()和.decode(),并且每次都没有使用它们,每次都有相同的结果 . 欢迎任何建议 .
具体来说,我正在使用Scrapy库编写一个蜘蛛,它从它抓取的文件中获取文本,提取一些文本来构造新文件的文件名,然后将HTML文件的第一个div作为字符串写入新文件中文件 .
更令我困惑的是,我用来创建文件名的文本都用日语呈现,文件名本身也是如此 . 是因为我在div上使用str(),我将u'xxxx作为我文件的内容?请在代码末尾看到这一行 .
这是我的完整代码(请忽略它的一些hacky):
def parse_item(self, response):
original = 0
author = "noauthor"
title = "notitle"
year = "xxxx"
publisher = "xxxx"
typer = "xxxx"
ispub = 0
filename = response.url.split("/")[-1]
if "_" in filename:
filename = filename.split("_")[0]
if filename.isdigit():
title = response.xpath("//h1/text()").extract()[0].encode("utf-8")
author = response.xpath("//h2/text()").extract()[0].encode("utf-8")
ID = filename
bibliographic_info = response.xpath("//div[2]/text()").extract()
for subyear in bibliographic_info:
ispub = 0
subyear = subyear.encode("utf-8").strip()
if "初出:" in subyear:
publisher = subyear.split(":")[1]
original = 1
ispub = 1
if "入力:" in subyear:
typer = subyear.split(":")[1]
if len(subyear) > 1 and (original == 1) and (ispub == 0):
counter = 0
while counter < len(subyear):
if subyear[counter].isdigit():
break
counter+=1
if counter != len(subyear):
year = subyear[counter:(counter+4)]
original = 0
body = str(response.xpath("//div[1]/text()").extract())
new_filename = author + "_" + title + "_" + publisher + "_" + year + "_" + typer + ".html"
file = open(new_filename, "a")
file.write(body.encode("utf-8")
file.close()