python爬虫入门

最新推荐文章于 2024-07-27 17:39:19 发布

cumt 方程

最新推荐文章于 2024-07-27 17:39:19 发布

阅读量341

点赞数

文章标签： python 爬虫开发语言

本文链接：https://blog.csdn.net/weixin_45765291/article/details/122547556

版权

爬虫入门

导入requests拓展模块

可以利用cmd命令安装拓展模块
再在pycharm中设置中导入

python实现request请求

url = "https://yz.chsi.com.cn/kyzx/jyxd"
r = requests.get(url).text

requests.get(url)是请求的返回值可以先进行测试，若返回200，则连接成功

解析数据

from lxml import etree # 解析数据
"""
//  :根目录
[]  :谓语-条件
/   :选择元素
@   :提取元素
"""

导入模块然后利用语句选择需要爬取的数据

r = requests.get(url).text
doc = etree.HTML(r)
href = doc.xpath('//ul[@class="news-list"]/li/a/@href')

点击网页按f12查看源码

一步一步锁定位置，先找到ul ，进行选择【@class=“news-list”】然后是li选择a提取用@href提取连接

拼接得到新的网址

for i in href:
    newurl="https://yz.chsi.com.cn"+i
    html=requests.get(newurl).text
    newdoc=etree.HTML(html)
    titie = newdoc.xpath('//div[@class="title-box"]/h2/text()')[0]#取标题内容用text()
    content = newdoc.xpath('//div[@class="content-l detail"]/p/text()')
    #print(titie)
    #print(" \n".join(content))
    # break
    file = open(f"D:/.study/spider/{titie}.txt",'w',encoding='utf-8')
    file.write("\n".join(content))

然后继续进行请求，选择目标数据进行爬取。

然后将数据保存到文件中去。