爬虫基本语法

最新推荐文章于 2023-01-29 14:22:44 发布

Dull_Demon_King

最新推荐文章于 2023-01-29 14:22:44 发布

阅读量370

点赞数

文章标签：爬虫 python 开发语言

本文链接：https://blog.csdn.net/Dull_Demon_King/article/details/127390180

版权

1.Requests

1.1 安装

pip install requests

1.2 使用

responses = requests.get()

responses = requests.post()

1.3 session

如果想实现客户端和服务端的会话保持，需要用到requests的类 session (在爬取有验证码的网站时，需要保证验证码的一致)

session = requests.session()

responses = session.get()

responses = session.post()

1.4 参数

url : 网址参数

headers： UA参数

proxies ：IP代理参数

params : get参数

data : post参数

responses = requests.get(url,headers=headers,params=data)

responses = requests.post(url,headers=headers,data=data)

1.5 方法

responses.json() ------------> 如果网页源码是json类型，返回的是一个字典

responses.text -------------> 将网页源码以文本显示

responses.content -----------> 二进制数据（图片之类的）

1.6 响应体状态码

responses.status_code （200为正常状态）

1.7 异常处理

try:

有问题的代码都放在这

except:

出现问题时打印出来的内容

1.8 其他

response.encoding = response.apparent_encoding  # 获取网页本身编码

2. beautifulsoup

应用场景： HTML或XML解析

安装：pip install BeautifulSoup4

pip install lxml

BeautifulSoup用之前需要导包：from bs4 import BeautifulSoup

soup = BeautifulSoup(h, 'lxml') # 参数1：要解析的内容参数2：解析器

soup.string # 获取文本内容

soup.find_all('li', {'class':'element'})

soup.select('li.element')

get_text() ：获取遍历出来的内容

text=() : 查找指定内容

print(soup.find_all(text='需要查找的内容')) # 可以做内容统计用

print(len(soup.find_all(text='需要查找的内容''))) # 统计数量

3. lxml （xpath）

lxml 用之前需要导包：from lxml import etree

xml = etree.HTML(html) # 将html转xml

r =xml.xpath('//a[@href="link1.html"]/text()') # // : 忽略任意层级 / : 下一层级

4. jsonpath

应用场景：解析字典

安装 : pip install jsonpath

导包： from jsonpath import jsonpath

title = jsonpath(json_j, '$..title')  # json_j:表示形参 $:表示根节点  ..:表示忽略任意层级，直接到目标层级

2. 正则表达式

运用场景：对文本进行解析

注意：在使用时，最好在前面加r

r'chuanzhiboke\t\.\tpython'

常用方法：

findall() : 全部匹配，返回列表

sub() : 替换

split() : 切割字符串，返回列表

Dull_Demon_King

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
打赏
0
评论
爬虫基本语法

responses.json() ------------> 如果网页源码是json类型，返回的是一个字典。responses.content -----------> 二进制数据（图片之类的）print(len(soup.find_all(text='需要查找的内容''))) # 统计数量。print(soup.find_all(text='需要查找的内容')) # 可以做内容统计用。responses.text -------------> 将网页源码以文本显示。
复制链接

扫一扫