爬虫2--requests和bs4

最新推荐文章于 2024-07-30 17:23:09 发布

Lost__myself

最新推荐文章于 2024-07-30 17:23:09 发布

阅读量145

点赞数

分类专栏： python 文章标签： python json 爬虫

本文链接：https://blog.csdn.net/Lost__myself/article/details/119656053

版权

python 专栏收录该内容

19 篇文章 0 订阅

订阅专栏

requests和bs4

1. requests的使用方法

'''
1.发送请求
requests.get(url,*, headers, paramas, proxies)	-	发送get请求
requests.post()		-	发送post请求

参数：
url			-		请求地址（一个网站的网址、接口地址、图片地址等）
headers		-		设置请求头（设置cookie和User-Agent的时候使用）
pramas		-		设置参数
proxies		-		设置代理
'''
# requests.get(地址？参数&参数)       -  直接拼接到url中
# requests.get(地址，params=参数)    -  多参数可设置成字典
# requests.post(地址，params=参数)   —  只能用关键字参数
'''
2. 获取相应信息
设置编码方式（乱码的时候才需要设置）
'''
# respond.encoding = 'GBK'
# respond.encoding = responds.apparent_encoding
'''获取响应头'''
print(respond.headers)

'''获取响应体'''
# a.获取text（用于请求网页，直接拿到网页源代码）
print(respond.text)

# b.获取json解析结果(用于返回json数据的数据接口)
print(respond.json)

# c.获取content值(获取二进制类型的源数据，用于图片、视频、音频的下载)
print(respond.content)

2.添加请求头

#  =================添加user-agent===============
# headers = {'User Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWe\
#     bKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
#
# respones = requests.get('https://www.meishichina.com/', headers=headers)
# respones.encoding = respones.apparent_encoding
# print(respones.text)


# =======================添加cookie===================
'''
需要登录的网站，首先登录账户，通过检查获取cookie信息
'''
# headers = {'User Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWe\
#     bKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
#      'cookie': 'cookie信息'
#            }

3.json解析

import requests

# 获取json接口，然后再发送请求
response = requests.get('json接口')

all_news = response.json()['data']
for news in all_news:
    print(news['Title'])
    print(news['Image']['url'])

4.图片下载

import requests

def download_image(img_url: str):
    # 请求网络图片数据
    response = requests.get(img_url)
    
    # 获取数据
    data = response.content
    
    # 保存数据到本地文件
    f = open(f'files/{img_url.split("/")[-1]}, 'wb')
    f.wrire(data)
    f.close()

5.bs4的使用

from bs4 import BeautifulSoup

# 1.准备需要解析的网页数据（实际是用reques或者selenium获取）
data = '网页数据'

# 2. 创建BeautifulSoup对象（可以自动就在数据中错误的html结构）
# BeautifulSoup(数据， 解析器)
soup = BeautifulSoup(data, 'lxml')

# 3. 通过BeautifulSoup对象获取标签和标签内容
'''
1) 获取标签
BeautifulSoup对象.select(css选择器)	-	获取css选择器选中的所有标签，返回列表中的元素是标签对象
BeautifulSoup对象.select_one(css选择器)	-	获取css选择器选中的第一个标签，返回的是标签对象
'''
result1 = soup.select('p')
result2 = soup.select('#p1')
result3 = soup.select('div p')

'''
2) 获取标签内容

标签对象.string		-	获取标签中的文字内容（只有标签内容是纯文字的时候有效，否则结果为None
p1 = soup.select_one('p')
print(p1)   # <p>段落1</p>
print(p1.string)   # 段落1

标签对象.get_text() 	-	功能强大
'''
p2 = soup.select_one('span')
print(p2)  # <span>段落3<b>加粗</b></span>
print(p2.get_text())  # 段落3加粗

'''
标签对象.contents		-	二进制数据
'''

'''
标签属性
标签对象.attrs[]
'''
a1 = soup.select_one('img')
print(a1)
print(a1.attrs['src'])

'''
通过BeautifulSoup().select()获取标签对象
'''

Lost__myself

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
爬虫2--requests和bs4

requests和bs41. requests的使用方法'''1.发送请求requests.get(url,*, headers, paramas, proxies) - 发送get请求requests.post() - 发送post请求参数：url - 请求地址（一个网站的网址、接口地址、图片地址等）headers - 设置请求头（设置cookie和User-Agent的时候使用）pramas - 设置参数proxies - 设置代理'''# requests
复制链接

扫一扫