编写了一个网易新闻的爬虫,在 Python2.7 下保存的文件中文显示没有问题。
在python 3.5下中文变成字节码。如下所示:
b'\xe5\x85\xa8\xe7\xab\x99' b'http://news.163.com/special/0001386F/rank_whole.html'
b'\xe6\x96\xb0\xe9\x97\xbb' b'http://news.163.com/special/0001386F/rank_news.html'
b'\xe5\xa8\xb1\xe4\xb9\x90' b'http://news.163.com/special/0001386F/rank_ent.html'
b'\xe4\xbd\x93\xe8\x82\xb2' b'http://news.163.com/special/0001386F/rank_sports.html'
b'\xe8\xb4\xa2\xe7\xbb\x8f' b'http://money.163.com/special/002526BH/rank.html'
爬虫代码如下:
# -*- coding: utf-8 -*-
import os
import sys
#import urllib
import requests
import re
from lxml import etree
from openpyxl import Workbook
def StringListSave(save_path, filename, slist):
if not os.path.exists(save_path):
os.makedirs(save_path)
path = save_path+"/"+filename+".txt"
with open(path, "w+") as fp:
for s in slist:
fp.write("%s\t\t%s\n" % (s[0].encode("utf8"), s[1].encode("utf8")))
def Page_Info(myPage):
'''Regex'''
mypage_Info = re.findall(r'<div class="titleBar" id=".*?"><h2>(.*?)</h2><div class="more"><a href="(.*?)">.*?</a></div></div>', myPage, re.S)
return mypage_Info
def New_Page_Info(new_page):
'''Regex(slowly) or Xpath(fast)'''
# new_page_Info = re.findall(r'<td class=".*?">.*?<a href="(.*?)\.html".*?>(.*?)</a></td>', new_page, re.S)
# # new_page_Info = re.findall(r'<td class=".*?">.*?<a href="(.*?)">(.*?)</a></td>', new_page, re.S) # bugs
# results = []
# for url, item in new_page_Info:
# results.append((item, url+".html"))
# return results
dom = etree.HTML(new_page)
new_items = dom.xpath('//tr/td/a/text()')
new_urls = dom.xpath('//tr/td/a/@href')
assert(len(new_items) == len(new_urls))
return zip(new_items, new_urls)
def Spider(url):
i = 0
print("downloading ", url)
myPage = requests.get(url).content.decode("gbk")
# myPage = urllib2.urlopen(url).read().decode("gbk")
myPageResults = Page_Info(myPage)
save_path = u"网易新闻抓取"
filename = str(i)+"_"+u"新闻排行榜"
StringListSave(save_path, filename, myPageResults)
i += 1
for item, url2 in myPageResults:
print("Downloading ", url2)
new_page = requests.get(url2).content.decode("gbk")
# new_page = urllib2.urlopen(url).read().decode("gbk")
newPageResults = New_Page_Info(new_page)
filename = str(i)+"_"+item
StringListSave(save_path, filename, newPageResults)
i += 1
if __name__ == '__main__':
print("start")
start_url = "http://news.163.com/rank/"
Spider(start_url)
print("end")
解决办法:
with open(path, "w+",encoding='utf-8') as fp: for s in slist: fp.write("%s\t\t%s\n" % (s[0], s[1]))
再网上查到了这里的解决办法,发现了:
3:目标文件的编码 要将网络数据流的编码写入到新文件,那么我么需要指定新文件的编码。写文件代码如:
复制代码代码如下:f.write(txt)
,那么txt是一个字符串,它是通过decode解码过的字符串。关键点就要来了:目标文件的编码是导致标题所指问题的罪魁祸首。如果我们打开一个文件:
复制代码代码如下:f = open("out.html","w")
,在windows下面,新文件的默认编码是gbk,这样的话,python解释器会用gbk编码去解析我们的网络数据流txt,然而txt此时已经是decode过的unicode编码,这样的话就会导致解析不了,出现上述问题。 解决的办法就是,改变目标文件的编码:
复制代码代码如下:f = open("out.html","w",encoding='utf-8')