Pythont 爬虫

最新推荐文章于 2024-09-12 23:06:08 发布

GDDGHS_

最新推荐文章于 2024-09-12 23:06:08 发布

阅读量502

点赞数 4

文章标签：爬虫 python

本文链接：https://blog.csdn.net/GDDGHS_/article/details/141995628

版权

红楼梦全书爬虫

import requests
# 导入模块
from lxml import etree
def pater(url):
    # 头部文件
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36"
    }
    # 给服务器发送请求
    response = requests.get(url,headers=headers)
    # 返回数据 decode()可以把字节串（bytes）解码为字符串（str）
    return response.content.decode()
#
url = "https://www.shicimingju.com/book/hongloumeng.html"
text = pater(url)
html = etree.HTML(text)
book_names = html.xpath("//h1//text()")[0]
titles = html.xpath('//div[@class="book-mulu"]//a//text()')
hears = html.xpath('//div[@class="book-mulu"]//a//@href')
domin = "https://www.shicimingju.com"

for url in hears:
    title = titles[hears.index(url)]
    print(title)
    url = domin + url
    text = pater(url)
#     # 转成xpath用来解析的对象element
    html = etree.HTML(text)
    content=html.xpath("//div[@class='chapter_content']//p/text()")
    with open (book_names+".txt",'a',encoding='utf-8') as f:
        f.write(title+'\n')
        for con in content:
            f.write(con + '\n')
        print(title)

代码解释

import requests：导入requests库，用于发送HTTP请求。
from lxml import etree：从lxml库中导入etree模块，用于解析HTML文档。
def pater(url)：定义一个名为pater的函数，接收一个url参数。
headers = {...}：设置请求头，模拟浏览器访问网站。
response = requests.get(url, headers=headers)：使用requests库的get方法发送GET请求，获取网页内容。
return response.content.decode()：将获取到的网页内容（字节串）解码为字符串，并返回。
url = "https://www.shicimingju.com/book/hongloumeng.html"：设置要爬取的网址。
text = pater(url)：调用pater函数，传入url参数，获取网页内容。
html = etree.HTML(text)：将获取到的网页内容转换为etree对象，方便后续解析。
book_names = html.xpath("//h1//text()")[0]：使用XPath表达式提取网页中的书名。
titles = html.xpath('//div[@class="book-mulu"]//a//text()')：提取章节标题。
hears = html.xpath('//div[@class="book-mulu"]//a//@href')：提取章节链接。
domin = "https://www.shicimingju.com"：设置网站域名，用于拼接完整的章节链接。
for url in hears:：遍历章节链接列表。
title = titles[hears.index(url)]：根据当前链接在列表中的索引，获取对应的章节标题。
print(title)：打印章节标题。
url = domin + url：拼接完整的章节链接。
text = pater(url)：调用pater函数，传入新的url参数，获取章节内容。
html = etree.HTML(text)：将获取到的章节内容转换为etree对象，方便后续解析。
content = html.xpath("//div[@class='chapter_content']//p/text()")：提取章节正文内容。
with open(book_names+".txt", 'a', encoding='utf-8') as f:：以追加模式打开一个名为书名的txt文件，编码为utf-8。
f.write(title+' ')：将章节标题写入文件。
for con in content:：遍历章节正文内容。
f.write(con + ' ')：将章节正文内容写入文件。
print(title)：打印章节标题

GDDGHS_

关注

4
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
Pythont 爬虫

使用requests库的get方法发送GET请求，获取网页内容。：将获取到的网页内容转换为etree对象，方便后续解析。：调用pater函数，传入新的url参数，获取章节内容。：将获取到的章节内容转换为etree对象，方便后续解析。：定义一个名为pater的函数，接收一个url参数。：将获取到的网页内容（字节串）解码为字符串，并返回。：调用pater函数，传入url参数，获取网页内容。：根据当前链接在列表中的索引，获取对应的章节标题。：设置网站域名，用于拼接完整的章节链接。：拼接完整的章节链接。
复制链接

扫一扫