一个比较简单的文库爬虫,所以带来的后遗症也很多明显,比较low比,只能爬取word,txt,ppt别想了,同时不能有折叠的内容,当然vip的内容也不要妄想了,百度吃相还是真难看,有钱真的可以为所欲为!
关键点就在于,协议头,直接用爬虫的协议头才能获取到内容!
header = {'User-agent': 'Googlebot'}
而想要输出为word文档,那就需要使用到 docx 库!
当然格式还是差强人意,有总比没有强吧,你说是吧?!
pip安装 docx 库
pip install python_docx
参考代码:
def get_word(data):
document = Document()
document.add_heading(data[0])
for detail in data[1]:
document.add_paragraph(detail) #添加段落
document.save(f'{data[0]}.docx
附完整代码参考:
#百度文库采集
#20200803微信:huguo00289
#https://wenku.baidu.com/view/312ce9da0129bd64783e0912a216147916117e27.html
# -*- coding: UTF-8 -*-
import requests,re
from lxml import etree
from docx import Document
def get_detail(url):
#url = 'https://wenku.baidu.com/view/312ce9da0129bd64783e0912a216147916117e27.html'
header = {'User-agent': 'Googlebot'}
response = requests.get(url , headers = header).content.decode('gbk')
#print(response)
title_ze=r'
(.+?)_百度文库'div_ze=r'
title=re.findall(title_ze,response,re.S)[0]
div=re.findall(div_ze,response,re.S)[0]
div=etree.HTML(div)
details=div.xpath('//div//text()')
#detail='\n'.join(details)
data=title,details
print(data)
return data
def get_word(data):
document = Document()
document.add_heading(data[0])
for detail in data[1]:
document.add_paragraph(detail) #添加段落
document.save(f'{data[0]}.docx')
if __name__=='__main__':
url="https://wenku.baidu.com/view/cb02b4a91837f111f18583d049649b6648d7092e"
text=get_detail(url)
get_wo