python通过selenium爬取百度文库

最新推荐文章于 2020-11-23 21:41:04 发布

爷来辣

最新推荐文章于 2020-11-23 21:41:04 发布

阅读量958

点赞数 4

仅供学习参考转载请标明出处。个人笔记为自己记录如果对大家有帮助那也挺好写的马虎如果有不对的地方谢谢指正各位大佬别喷哦

本文链接：https://blog.csdn.net/xujiamin0022016/article/details/86003000

版权

python通过selenium爬取百度文库

参考

https://blog.csdn.net/c406495762/article/details/72331737

运行平台： Windows
Python版本： Python3.6

python3.6的docx模块和2.7的不同，直接pip安装会提示缺少依赖

需要先进入pycharm目录安装python_docx-0.8.7-py2.py3-none-any.whl

pip install python_docx-0.8.7-py2.py3-none-any.whl

再安装docx

pip install docx

python_docx-0.8.7-py2.py3-none-any.whl下载地址:

https://download.lfd.uci.edu/pythonlibs/r5uhg2lo/python_docx-0.8.7-py2.py3-none-any.whl

由于网页的百度文库页面复杂，可能抓取内容不全，因此使用User-Agent，模拟手机登录，然后打印文章标题，文章页数，并进行翻页。

谷歌浏览器需要设置User-Agent

# -*- coding: utf-8 -*-
from selenium import webdriver
from bs4 import BeautifulSoup
from docx import Document
from docx.enum.text import WD_ALIGN_PARAGRAPH# 用来居中显示标题
from time import sleep

#目的URL
DEST_URL='https://wenku.baidu.com/view/8962c8dfb9f3f90f76c61b69.html'
#用来保存文档
doc_title = ''
doc_content_list = []
def find_doc(driver, init=True):
    global doc_content_list
    global doc_title
    stop_condition = False
    html = driver.page_source
    soup1 = BeautifulSoup(html, 'lxml')
    if (init is True): # 得到标题
        title_result = soup1.find('div', attrs={'class': 'doc-title'})
        doc_title = title_result.get_text() # 得到文档标题
        # 拖动滚动条
        init_page = driver.find_element_by_xpath( "//div[@class='foldpagewg-text-con']")
        print(type(init_page), init_page)
        driver.execute_script('arguments[0].scrollIntoView();', init_page)
        init_page.click()
        init = False
    else:
        try:
            #按钮剩余多少未读
            #page = driver.find_element_by_xpath( "//div[@class='pagerwg-schedule']")
            #按钮加载更多   点击继续阅读还是会出现点击加载更多直接点击加载更多一了百了
            next_page = driver.find_element_by_class_name("pagerwg-button")
            #下拉到最下方
            station = driver.find_element_by_xpath( "//div[@class='bottombarwg-root border-none']")
            driver.execute_script('arguments[0].scrollIntoView(false);', station)

            # 防止页面加载过慢
            sleep(5)

            next_page.click()

        except:
            #结束条件
            stop_condition = True

    #next_page.send_keys(Keys.ENTER)
    #遍历所有的txt标签标定的文档，将其空格删除，然后进行保存
    content_result = soup1.find_all('p', attrs={'class': 'txt'})
    for each in content_result:
       each_text = each.get_text()
       if ' ' in each_text:
           text = each_text.replace(' ', '')
       else:
          text = each_text
       # 得到正文内容
       doc_content_list.append(text)
    # 防止页面加载过慢
    sleep(5)
    if stop_condition is False:
       doc_title, doc_content_list = find_doc(driver, init)
    return doc_title, doc_content_list
def save(doc_title, doc_content_list):
    document = Document()
    heading = document.add_heading(doc_title, 0)
    heading.alignment = WD_ALIGN_PARAGRAPH.CENTER # 居中显示
    for each in doc_content_list:
        document.add_paragraph(each)
    # 处理字符编码问题
    t_title = doc_title.split()[0]
    #在当前脚本路径存储docx文件
    document.save('百度文库-%s.docx'% t_title)
    print("\n\nCompleted: %s.docx, to read." % t_title)
    driver.quit()
if __name__ == '__main__':
    options = webdriver.ChromeOptions()
    options.add_argument('user-agent="Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Mobile Safari/537.36"')
    driver = webdriver.Chrome(chrome_options=options)
    driver.get(DEST_URL)
    #JavascriptExecutor js = (JavascriptExecutor) driver;
    print("**********START**********")
    title, content = find_doc(driver, True)
    save(title, content)
    driver.quit()

爷来辣

关注

4
点赞
踩
5

收藏

觉得还不错? 一键收藏
打赏
0
评论
python通过selenium爬取百度文库

python通过selenium爬取百度文库参考https://blog.csdn.net/c406495762/article/details/72331737https://blog.csdn.net/c406495762/article/details/72331737运行平台：WindowsPython版本：Python3.6python3.6的docx模块和...
复制链接

扫一扫