这几天得准备考研复试了,紧张,就没怎么更新博客
把自己这几天爬虫的知识代码总结下
- 本来准备用BeautifulSoup进行获取标签的,但是没找到好的方法,发现自己还是最适合使用xpath…..对内容中的同样li标签进行迭代获取信息
- 单线程太慢了,慢的报警,网络返回实在慢,(实际点开网页又不是太慢,很奇怪,下次有空探究一下)总结上次的的包装好的多进程Pool模块
把资源分配给多个CPU进行同时处理节约时间 - 由于担心多线程下写入会乱序,于是我就先吧资料存起来,最后按list的顺序写入文件,这样就不会担心章节乱序了
#爬取网站盗墓笔记
import re
import requests
from lxml import etree
from functools import reduce
from multiprocessing.dummy import Pool as threadpool
def getbody(dic):
fullurl=url+dic['href']
res=requests.get(fullurl,headers=header)
html=res.content.decode('gbk')
sector=etree.HTML(html)
s=sector.xpath('//div[@id="content"]/text()')
print(dic)
# data=s.xpath('string(.)')
t=list(map(lambda x:x.replace('\xa0',' ')+'\n',s))
dic['body']=t
return dic
header = {
'Connection': 'Keep-Alive',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.8,zh-Hans-CN;q=0.5,zh-Hans;q=0.3',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko',
'Accept-Encoding': 'gzip, deflate',
'Host': 'www.quanshu.net'
}
#这网站貌似对header进行检查,就伪装了一下
url='http://www.quanshu.net/book/9/9055/'
res=requests.get(url,headers=header)
html=res.content.decode('gbk')
sector=etree.HTML(html)
content=sector.xpath('//div[@class="clearfix dirconone"]/li/a')
dic=[]
for i in content:
s={}
s['title']=i.text
s['href']=i.attrib['href']
s['body']=''
dic.append(s)
pool=threadpool(4)
dict=pool.map(getbody,dic)
pool.close()
pool.join()
with open(r'C:\Users\sunqi\Desktop\盗墓笔记2.txt','w') as f:
for i in dict:
f.write('\n'*2+'-'*10+i['title']+'-'*10+'\n'*2)
f.writelines(i['body'])
s=input('爬取完成,按回车退出')