百度文库爬虫，Python爬取百度文库内容输出word文档low版

最新推荐文章于 2024-05-13 09:05:51 发布

二爷记

最新推荐文章于 2024-05-13 09:05:51 发布

阅读量3.6k

点赞数 9

文章标签：百度 python js javascript css

本文链接：https://blog.csdn.net/minge89/article/details/108301915

版权

一个比较简单的文库爬虫，所以带来的后遗症也很多明显，比较low比，只能爬取word，txt，ppt别想了，同时不能有折叠的内容，当然vip的内容也不要妄想了，百度吃相还是真难看，有钱真的可以为所欲为！

关键点就在于，协议头，直接用爬虫的协议头才能获取到内容！

header = {'User-agent': 'Googlebot'}

而想要输出为word文档，那就需要使用到 docx 库！

当然格式还是差强人意，有总比没有强吧，你说是吧？！

pip安装 docx 库

pip install python_docx

文档参考：https://python-docx.readthedocs.io/en/latest/

参考代码：

def get_word(data):
    document = Document()
    document.add_heading(data[0])

    for detail in data[1]:
        document.add_paragraph(detail) #添加段落


    document.save(f'{data[0]}.docx')

附完整代码参考：

#百度文库采集
#20200803微信：huguo00289
#https://wenku.baidu.com/view/312ce9da0129bd64783e0912a216147916117e27.html
# -*- coding: UTF-8 -*-

import requests,re
from lxml import etree
from docx import Document

def get_detail(url):
    #url = 'https://wenku.baidu.com/view/312ce9da0129bd64783e0912a216147916117e27.html'
    header = {'User-agent': 'Googlebot'}
    response = requests.get(url , headers = header).content.decode('gbk')
    #print(response)
    title_ze=r'<title>(.+?)_百度文库</title>'
    div_ze=r'<div class="bd doc-reader">(.+?)<div class="aside">'
    title=re.findall(title_ze,response,re.S)[0]
    div=re.findall(div_ze,response,re.S)[0]
    div=etree.HTML(div)
    details=div.xpath('//div//text()')
    #detail='\n'.join(details)
    data=title,details
    print(data)
    return data



def get_word(data):
    document = Document()
    document.add_heading(data[0])

    for detail in data[1]:
        document.add_paragraph(detail) #添加段落


    document.save(f'{data[0]}.docx')

if __name__=='__main__':
    url="https://wenku.baidu.com/view/cb02b4a91837f111f18583d049649b6648d7092e"
    text=get_detail(url)
    get_word(text)

微信公众号：二爷记

不定时分享python源码及工具

二爷记

关注

9
点赞
踩
49

收藏

觉得还不错? 一键收藏
2
评论
百度文库爬虫，Python爬取百度文库内容输出word文档low版

一个比较简单的文库爬虫，所以带来的后遗症也很多明显，比较low比，只能爬取word，txt，ppt别想了，同时不能有折叠的内容，当然vip的内容也不要妄想了，百度吃相还是真难看，有钱真的...
复制链接

扫一扫