Python简单小爬虫

最新推荐文章于 2023-07-27 10:42:53 发布

爱吃猫的鱼101

最新推荐文章于 2023-07-27 10:42:53 发布

阅读量211

点赞数 1

分类专栏： Python爬虫文章标签： python 爬虫

本文链接：https://blog.csdn.net/qq_43369592/article/details/115409965

版权

Python爬虫专栏收录该内容

4 篇文章 0 订阅

订阅专栏

Python爬虫—古诗文网

爬取目标：古诗文网的名句及其出处

实现方法：

requests库实现网络请求
xpath实现数据提取

目标分析：

获取页面规则

很明显，所有需要爬取的内容都在 div[@class=“sons”]下，因此只需要遍历该列表即可获得所需内容的位置，接下来进行xpath解析获得所需要的数据。
获取下一页url

分析可知，下一页的url在div[@class="pagesright "]/a[@class=“amore”]/@href 里面，值得注意的是，此处获取到的url是相对地址，因此处理的时候要进行拼接。

落实代码：

# 首先导入所需要用到的包
import requests
from lxml import etree

# 定义一个函数进行网页的请求
def get(url):
    resp = requests.request('get', url=url, headers={
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3823.400 QQBrowser/10.7.4307.400',
        'Cookie': 'aQQ_ajkguid=B4D4C2CC-2F46-D252-59D7-83356256A4DC; id58=e87rkGBclxRq9+GOJC4CAg==; _ga=GA1.2.2103255298.1616680725; 58tj_uuid=4b56b6bf-99a3-4dd5-83cf-4db8f2093fcd; wmda_uuid=0f89f6f294d0f974a4e7400c1095354c; wmda_new_uuid=1; wmda_visited_projects=%3B6289197098934; als=0; cmctid=102; ctid=15; sessid=E454865C-BA2D-040D-1158-5E1357DA84BA; twe=2; isp=true; _gid=GA1.2.1192525458.1617078804; new_uv=4; obtain_by=2; xxzl_cid=184e09dc30c74089a533faf230f39099; xzuid=7763438f-82bc-4565-9fe8-c7a4e036c3ee'
    })
    if resp.status_code == 200:  # 如果响应码为 200  则证明响应成功
        parse(resp.text)  # 调用 parse函数 进行数据的解析
    else:
        print('网页获取失败!!!')


# 定义一个函数进行数据的解析
def parse(html):
    root = etree.HTML(html)  # 获取所有标签
    # 主页面的入口规则
    divs = root.xpath('//div[@class="main3"]/div[@class="left"]/div[@class="sons"]')  
    for div in divs:
        values = div.xpath('.//a[1]/text()')  # 获取到的是个列表
        name = div.xpath('.//a[2]/text()')
        with open('名句.txt', 'a') as file:  # 将数据持久化
            for v, n in zip(values, name):
                print(v + '----' + n, file=file)
    page_xpath = root.xpath('//div[@class="left"]//div[@class="pagesright"]/a[@class="amore"]/@href')[0]  # 结果是个list，因此取里面的第一个元素
    
    # 进行异常检测
    try:
        page_url = 'https://so.gushiwen.cn' + page_xpath  # 进行下一页url的拼接
        get(page_url) # 将拼接好的url交给get函数进行网页的获取
    except IndexError:     
        print('爬--取--完--毕')


if __name__ == '__main__':
    get('https://so.gushiwen.cn/mingjus/')

爱吃猫的鱼101

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
2
评论
Python简单小爬虫

Python爬虫—古诗文网爬取目标：古诗文网的名句及其出处实现方法：requests库实现网络请求xpath实现数据提取目标分析：获取页面规则很明显，所有需要爬取的内容都在 div[@class=“sons”]下，因此只需要遍历该列表即可获得所需内容的位置，接下来进行xpath解析获得所需要的数据。获取下一页url分析可知，下一页的url在div[@class="pagesright "]/a[@class=“amore”]/@href 里面，值得注意的是，此处获取到的ur
复制链接

扫一扫