python爬取单章节内容

xi_xixixi_

已于 2022-10-18 14:36:21 修改

阅读量491

点赞数 1

文章标签： python 爬虫开发语言

于 2022-10-18 14:33:07 首次发布

本文链接：https://blog.csdn.net/xi_xixixi_/article/details/127360627

版权

这篇博客介绍了如何使用Python进行数据解析，包括XPath、CSS选择器和正则表达式（re）来抓取单个章节的内容。讨论了何时使用CSS和XPath，以及如何通过它们提取数据。同时，讲解了如何将列表转换为字符串、保存数据的方法，以及自动创建文件夹的技巧。

摘要由CSDN通过智能技术生成

爬取单个章节内容

数据解析

1、xpath，css，re

什么时候使用css和xpath？无论css还是xpath都可以跨标签提取。
css选择器，根据标签属性提取数据
xpath，根据标签节点提取数据。
selector = parsel.Selector（reponse.text）#将reponses.text 字符串数据，转成可解析的对象。
当得到的数据有标签的时候。
re:当没有办法使用标签提取的时候，使用re。可以直接对字符串数据进行提取。
selector.css(’ ::text’).get()

使用css获取数据

# 爬取网页上某个章节的内容
import requests
import parsel
from bs4 import BeautifulSoup

url = 'https://hanyu.baidu.com/shici/detail?pid=fe2332c606714577b32b30200d8440b6&from=kg0'
html = requests.get(url)
# print(html.text)
# 将字符串数据转换为可解析的对象。
selector = parsel.Selector(html.text)
# 得到title,选中标题，右键，复制selector path
title = selector.css('#poem-detail-header > h1::text').get()
print(title)

使用xpath获取数据

# 爬取网页上某个章节的内容
import requests
import parsel
from bs4 import BeautifulSoup

url = 'https://hanyu.baidu.com/shici/detail?pid=fe2332c606714577b32b30200d8440b6&from=kg0'
html = requests.get(url)