【Python】爬虫 | request.get方式 | 中国旅游局

最新推荐文章于 2024-07-01 16:53:26 发布

Kinononononobu！

最新推荐文章于 2024-07-01 16:53:26 发布

阅读量1k

点赞数 3

分类专栏：日常记录

本文链接：https://blog.csdn.net/weixin_44493173/article/details/104345285

版权

日常记录专栏收录该内容

19 篇文章 1 订阅

订阅专栏

markdown语法教程：https://www.runoob.com/markdown/md-paragraph.html
爬虫参考教程：http://c.biancheng.net/view/2011.html

一、request.get基础版

GET：最常见的方式，一般用于获取或者查询资源信息，也是大多数网站使用的方式，响应速度快。
最基本的使用方式：

import requests        #导入requests包
url = 'http://www.cntour.cn/'
strhtml = requests.get(url)        #Get方式获取网页数据
print(strhtml.text)

二、request.get进阶版

使用Beautiful Soup解析网页
需要安装BeautifulSoup4和lxml

import requests        #导入requests包
from bs4 import    BeautifulSoup  
url='http://www.cntour.cn/'  #所需要爬的网页地址
strhtml=requests.get(url)   #
soup=BeautifulSoup(strhtml.text,'lxml')
data = soup.select('#main>div>div.mtop.firstMod.clearfix>div.centerBox>ul.newsList>li>a') 
print(data)

代码解析：

strhtml=requests.get(url)
Request库的get（）方法：
最通常的方法是通过r=request.get（url）构造一个向服务器请求资源的url对象。
这个对象是Request库内部生成的。
这时候的r返回的是一个包含服务器资源的Response对象。包含从服务器返回的所有的相关资源。
soup=BeautifulSoup(strhtml.text,‘lxml’)
BeautifulSoup¹最主要的功能是从网页抓取数据，Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。
strhtml.text
strhtml是由requests获取的原始数据，有两种获取形式，一种是.content，一种是.text，前者是byte型数据，后者是unicode数据。unicode型数据可以在网页的header中可以看到定义编码形式²
***soup.select
(’#main>div>div.mtop.firstMod.clearfix>div.centerBox>ul.newsList>li>a’) ***
这句的意思是，你在之前已经获得的网页上使用select（选择器）定位数据，定位数据时需要使用浏览器的开发者模式，将鼠标光标停留在对应的数据位置并右击，然后在快捷菜单中选择“检查”命令，在弹出的开发者命令里，用copy-copy selector 命令复制了路径。复制完以后直接在pycharm中粘贴，可以获得
#main > div > div.mtop.firstMod.clearfix > div.centerBox > ul.newsList > li:nth-child(3) > a
这里是我自己爬的那个的路径。再使用soup.select引用这个路径，语法如下：
data = soup.select(‘路径的具体’)

关于soup.select的使用方法详解，³中写的很清楚。

至此获得了一段目标html代码。
在这里插入图片描述

清洗和组织数据
输入代码：

for item in data:
    result={
        'title':item.get_text(),
        'link':item.get('href')
    }
print(result)

这步我的理解是，这里提取的是数据的标题和链接，因为

<a href="http://www.cntour.cn/news/13754/" target="_blank" title="文旅部关于做好疫情防控工作的通知">
文旅部关于做好疫情防控工作的通知</a>

这段中标题在a标签汇总，提取标签的正文用get_text()方法。
链接在a标签的href属性中，提取标签中的 href 属性用 get() 方法，在括号中指定要提取的属性数据，即 get(＇href＇)。
运行结果：
在这里插入图片描述

文章中有一个数字id，用正则表达式提取，提取如下：
代码：

import re
for item in data:
    result={
        "title":item.get_text(),
        "link":item.get('href'),
        'ID':re.findall('\d+',item.get('href'))
    }
print(result)

在这里插入图片描述

完整的代码

# coding = utf-8
import requests
from bs4 import BeautifulSoup
url = 'http://www.cntour.cn/'
strhtml = requests.get(url)
soup = BeautifulSoup(strhtml.text,'lxml')
data = soup.select('#main > div > div.mtop.firstMod.clearfix > div.centerBox > ul.newsList > li:nth-child(3) > a')
print(data)
for item in data:
    result={
        'title':item.get_text(),
        'link':item.get('href')
    }
print(result)
import re
for item in data:
    result={
        "title":item.get_text(),
        "link":item.get('href'),
        'ID':re.findall('\d+',item.get('href'))
    }
print(result)