python爬虫——xpath 爬取一本小说,初级爬虫入门。

import requests
from lxml import etree
import time

'''
思路:
1,确定想要爬取的小说及入口url
2,爬章节链接并通过字符串拼接得到所有章节详情页的
3,爬取书名
4,爬取每章的标题,爬取每章具体内容的文本
6,将每章小说以章节累加,并保存为一个单独的txt文件
'''
# 设置请求头
headers = {'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'}

url = 'http://www.biquge.info/84_84283/'


def get_html(url):
    # 获取网页数据
    html = requests.get(url, headers=headers)
    html.encoding = 'utf-8'
    html_code = html.text
    # 解析网页
    soup = etree.HTML(html_code)
    # 返回解析后的页面内容
    return soup


# 获取各章节目录链接
def get_list(url):
    soup = get_html(url)
    # 查找所有章节的链接
    list_box = soup.xpath('//*[@id="list"]/dl/dd/a/@href')
    # 新建列表用来储存list的url
    book_lists = []
    for i in list_box:
        # 放进列表里
        book_lists.append(url + i)
    return book_lists


# 获取书的名称
def get_book_title(url):
    soup = get_html(url)
    book_title = soup.xpath('//*[@id="info"]/h1/text()')
    book_title = str(book_title)
    return book_title


# 获取文章页 标题
def get_title(url):
    soup = get_html(url)
    title = soup.xpath('//*[@id="wrapper"]/div[4]/div/div[2]/h1/text()')
    return title


# 获取文章页 正文
def get_novel_content(url):
    soup = get_html(url)
    # 获得需要的正文内容
    content = soup.xpath('//*[@id="content"]/text()')
    return content


# 保存到本地
def save_novel(url):
    book_lists = get_list(url)
    # title = get_title(url)
    book_title = get_book_title(url)
    num = 1
    with open(book_title+'.txt', 'a', encoding='utf-8') as f:
        for list_url in book_lists:
            chapter_title = get_title(list_url)
            # 这个地方写的有问题,标题的标签没有清理干净
            for t in chapter_title:
                f.write(t)

            chapter_content = get_novel_content(list_url)
            for c in chapter_content:
                f.write(c+"\n")

            # time.sleep(2)

            print('***第{}章下载完成***'.format(num))
            num += 1

        f.close()


if __name__ == '__main__':
    save_novel(url)




 

参考: 一个妹子在B站的视频+微信链接https://mp.weixin.qq.com/s?__biz=MzIxOTcyMDM4OQ==&mid=2247483927&idx=1&sn=d4c9fcb6becc3e1d26a8d8385d8c2b99&chksm=97d7bdbda0a034ab3faf0f30ed50a1e35a0a9edcceb9b2ae9a0a6c7e4efd72a64cde07df439f&token=1524452913&lang=zh_CN#rd

比较优雅的代码,看的很舒服,思路很清晰:

https://blog.csdn.net/sinat_34937826/article/details/105562463?utm_medium=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-10.nonecase&depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-10.nonecase

 

  • 3
    点赞
  • 27
    收藏
    觉得还不错? 一键收藏
  • 3
    评论
要使用Python爬虫XPath2345历史天气,你可以按照以下步骤进行操作: 1. 首先,你需要使用Python的requests库发送HTTP请求来获网页的源代码。你可以使用以下代码示例: ```python import requests url = 'http://example.com' # 替换为你要的网页URL response = requests.get(url) html_data = response.text ``` 2. 接下来,你需要使用XPath来解析网页的源代码并提所需的数据。根据引用中的示例代码,你可以使用以下代码示例: ```python from lxml import etree tree = etree.HTML(html_data) tr_list = tree.xpath('//table[@class="history-table"]/tr') for tr in tr_list//text()') # 空气质量指数 lst = [d1<span class="em">1</span><span class="em">2</span><span class="em">3</span> #### 引用[.reference_title] - *1* *2* [特定城市的天气数据(2345)](https://blog.csdn.net/qq_40932165/article/details/128685550)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_2"}}] [.reference_item style="max-width: 50%"] - *3* [python爬虫重庆近20年天气信息](https://blog.csdn.net/qq_45935025/article/details/122692968)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_2"}}] [.reference_item style="max-width: 50%"] [ .reference_list ]

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值