一、python requests爬虫[数据提取]

黑日里不灭的light

已于 2024-07-12 15:37:01 修改

阅读量1k

点赞数 21

分类专栏： # Python爬虫文章标签： python 爬虫 okhttp

于 2023-12-13 20:55:04 首次发布

本文链接：https://blog.csdn.net/weixin_46765649/article/details/126132235

版权

Python爬虫专栏收录该内容

4 篇文章 0 订阅

订阅专栏

一、正则表达式

基础储备：正则表达式

1. json

解释：爬取json储存的url

流程：

发现目标：打开目标网址
若该网址通过ajax请求图片数据，通过开发者工具发现该请求地址

import requests
import re
url = "https://www.luhuoop.cn/backgroun24d/mp/m538773p" # 不能直接使用

response = requests.get(url=url,verify=False)
data = str(response.json()) # 获取的json数据转为字符串
print(data)
# [{'id': 1, 'imgurl': './static/image/a1.jpg'}, {'id': 2,  'url': '/content?id=2', 'imgurl': './static/image/b1.jpg'}, {'id': 3,  'imgurl': './static/image/c1.jpg'}]
rule = "'imgurl': '.(.*?)'}" # 匹配规则

o = re.findall(rule, data)
print(o)
# ['/static/image/a1.jpg', '/static/image/b1.jpg', '/static/image/c1.jpg']
for j,i in enumerate(o):
    u = "https://www.gaoh222.cn"+i
    r = requests.get(url=u, verify=False)
    with open(f'{j}.png','wb') as file: # 这里必须是wb，保存内容都要存放二进制
        file.write(r.content) # 这里content也表示二进制
    print(j)

2. html

流程：

发现目标：打开目标网址
若该网址通过html里面携带图片数据，直接向当前网址发起请求

import requests
import re
url = "https://huopi.com/favoe/uty" # 不一定能用
headers={
    "user-agent":"Mozilla/5.0 (iPhone; CPU iPhone OS 13_2_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.3 Mobile/15E148 Safari/603.1"
}

response = requests.get(url=url,headers=headers)
data = response.text
print(data)
rule = '<a class="main-img".*?><img src="(.*?)".*?></a>'

list = re.findall(rule,data,re.S)
for i,j in enumerate(list):
    http = "https:"+j
    r = requests.get(url=http) 
    with open(f'{i}.webp','wb') as file:# 这里必须是wb，保存内容都要存放二进制
        file.write(r.content)# 这里content也表示二进制

二、Xpath

1. lxml

下载：pip install lxml

语法：

from lxml import etree

c = etree.parse('1.html') # 解析本地文件
c = etree.HTML(a)# 解析获取到的html文档

c.xpath('xpath语法')

2. xpath

解释：这里会介绍语法规则

注意：Xpath下标从1开始

2.1 节点定位语法

表达式	描述
nodename	选中该元素。
/	从根节点选取、或者是元素和元素间的过渡。
//	从匹配选择的当前节点选择文档中的节点，而不考虑它们的位置。
.	选取当前节点。
…	选取当前节点的父节点。
@	选取属性。
text()	选取文本。

举例：

选择所有的h2下的文本
- //h2/text()
获取所有的a标签的href
- //a/@href
获取html下的head下的title的文本
- /html/head/title/text()
获取html下的head下的link标签的href
- /html/head/link/@href

2.2 节点修饰语法

路径表达式	结果
//title[@classs=“eng”]	选择classs属性值为eng的所有title元素
/books/book[1]	选取属于 books 子元素的第一个 book 元素。
/bookstore/book[last()]	选取属于 bookstore 子元素的最后一个 book 元素。
/books/book[last()-1]	选取属于 books 子元素的倒数第二个 book 元素。
/books/book[position()>1]	选择books下面的book元素，从第二个开始选择
//book/title[text()=‘Harry Potter’]	选择所有book下的title元素，仅仅选择文本为Harry Potter的title元素
/books/book[price>3]/title	选取 books 元素中的 book 元素的所有 title 元素，且其中的 price 元素的值须大于 3

举例：

第一个书的链接
- //div[@class="nav_txt"]/ul/li[1]/a/@href
最后一个书的链接
- //div[@class="nav_txt"]/ul/li[last()]/a/@href

3. 实战

from lxml import etree
import requests
url = 'http://pic.netbian.cm/4kmeinv/' # 不一定能用
response = requests.get(url).text
do = etree.HTML(response)
list = do.xpath('//*[@id="main"]/div[3]/ul/li/a/img/@src')
print(list)
['/uploads/allimg/220804/003031-16595442318b56.jpg', '/uploads/allimg/220802/234002-1659454802afa3.jpg', '/uploads/allimg/220727/004202-1658853722bdd3.jpg', '/uploads/allimg/220707/233455-1657208095aec4.jpg', '/uploads/allimg/220715/153854-16578707348791.jpg', '/uploads/allimg/210831/102129-16303764895142.jpg', '/uploads/allimg/220717/002302-1657988582ec9f.jpg', '/uploads/allimg/220712/235655-16576414159641.jpg', '/uploads/allimg/211219/114328-1639885408db64.jpg', '/uploads/allimg/220131/012219-16435633391d32.jpg', '/uploads/allimg/210827/235918-1630079958392e.jpg', '/uploads/allimg/220722/162924-16584785647228.jpg', '/uploads/allimg/220716/222754-1657981674a9a5.jpg', '/uploads/allimg/220715/171100-1657876260243c.jpg', '/uploads/allimg/220716/222533-1657981533ff59.jpg', '/uploads/allimg/220205/002942-1643992182534d.jpg', '/uploads/allimg/210718/001826-16265387066216.jpg', '/uploads/allimg/210922/191729-16323094499dcf.jpg', '/uploads/allimg/210718/000805-16265380858e92.jpg', '/uploads/allimg/220702/222125-1656771685f559.jpg']

三、BeautifulSoup

1.获取ID

import requests
from bs4 import BeautifulSoup

# 发送get请求并获取页面内容
url = 'https:'
headers = {
    "user-agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 13_2_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.3 Mobile/15E148 Safari/603.1"
}
response = requests.get(url, headers=headers)
html_content = response.text

# 使用BeautifulSoup解析页面内容
soup = BeautifulSoup(html_content, 'html.parser')

# 找到id为name的元素
element_with_id = soup.find(id='fieldset1')

# 获取id为name下的所有子div元素
if element_with_id:
    div_elements = element_with_id.find_all('div')
    num_div_elements = len(div_elements)
    print(f"ID为name下面有{num_div_elements}个子div元素")
else:
    print("未找到ID为name的元素")

2.获取Class

import requests
from bs4 import BeautifulSoup

# 发送get请求并获取页面内容
url = 'https://'
headers = {
    "user-agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 13_2_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.3 Mobile/15E148 Safari/603.1"
}
response = requests.get(url, headers=headers)
html_content = response.text

# 使用BeautifulSoup解析页面内容
soup = BeautifulSoup(html_content, 'html.parser')

# 找到id为name的元素
element_with_id = soup.findAll(class_='ui-controlgroup')

for i in element_with_id:
    # 获取id为name下的所有子div元素
    if element_with_id:
        direct_child_div_elements = i.findChildren("div", recursive=False)
        num_direct_child_div_elements = len(direct_child_div_elements)
        print(f"ID为name下面有{num_direct_child_div_elements}个子div元素")
    else:
        print("未找到ID为name的元素")

3.获取属性的值

import requests
from bs4 import BeautifulSoup

url = 'https://'
headers = {
    "user-agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 13_2_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.3 Mobile/15E148 Safari/603.1"
}
response = requests.get(url, headers=headers)
html_content = response.text

soup = BeautifulSoup(html_content, 'html.parser')

# 找到id为fieldset1的元素
element_with_id = soup.find(id='fieldset1')

if element_with_id:
    # 找到id为fieldset1下面所有属性为type的div元素
    div_elements = element_with_id.find_all('div', attrs={'type': True})

    # 提取属性为type的值
    type_values = [div['type'] for div in div_elements]

    print(f"ID为fieldset1下所有属性为type的div元素的值为： {type_values}")
else:
    print("未找到ID为fieldset1的元素")

4.获取内容的值

element_with_id = soup.find(id='fieldset1')

if element_with_id:
	print(element_with_id.text)

黑日里不灭的light

关注

21
点赞
踩
15

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录