2-1.数据请求篇

青竹·忆

已于 2022-04-06 16:17:46 修改

阅读量1.1k

点赞数

文章标签： python

于 2022-03-22 10:05:01 首次发布

本文链接：https://blog.csdn.net/weixin_38477850/article/details/123652264

版权

HTTP基本原理

url的分解https://www.itjuzi.com/search?data=小米&key=貂蝉

https://
协议

www.itjuzi.com/
域名

search
路径

? data=小米 & key=貂蝉
"？"查询字符串 "&"多条件拼接 "data=小米"条件

计算机最大端口数65535

在浏览器地址栏
显示"不安全"有可能端口是80(HTTP)
没有显示有可能是443(HTTPS)

面板组成

对爬虫开发着工具常用功能
1.清除
2.过滤点击全部和fetch/XHR
在这里插入图片描述

做爬虫要了解的点
1.Cookie
2.User-Agent 用户代理

最常用的请求方法
GET (重点)请求页面并返回页面内容
POST (重点)大多用于表单或上传文件,数据包含在请求体中
PUT 从客户端向服务器传送的数据取代指定文档中的内容
DELETE 请求服务器删除指定的页面

响应
重点看:
1.标头
2.载荷
3.预览
4.启动器

响应状态码(status_code)
需要记住的
400 错误请求
403 禁止访问
404 未找到
405 方法禁用
500
501
502 错误网管

socket模块介绍

TCP协议三次挥手四次挥手
请求报文结构******
url1 = ‘https://img2.baidu.com/it/u=2988487386,263436585&fm=253&fmt=auto&app=120&f=JPEG?w=500&h=733’

1.导入包socket用来建立链接 re用来正则提取数据
import socket
import re

2.建立链接
client=socket.socket()
client.connect(("img2.baidu.com",80))

3.构造报文后发送请求
请求报文格式:
    请求方法 空格 请求地址(从域名后开始) 空格 协议版本 (没有空格) 回车符 换行符 (没有空格) 请求头部 (没有空格) 回车符 换行符 回车符 换行符(注意是4个结尾)
      GET         /it/u=261242...        HTTP/1.0\r\n                   Host:img2.baidu.com    \r\n\r\n

resq = "GET /it/u=2612425813,3297972932&fm=253&fmt=auto&app=138&f=JPEG?w=667&h=500 HTTP/1.0\r\nHost:img2.baidu.com\r\n\r\n"
发
client.send(resq.encode())

4.建立一个二进制对象用来存储我们得到的数据
result = 'b'
收
data = client.recv(1024)

因为不知道数据长度多少,所以写个死循环,每次接收1024个字节
while data:
    result = data
    data = client.recv(1024)

5.查看得到的数据
print(result)

6.用正则提取需要的数据 re.S使 . 匹配包括换行在内的所有字符   去掉响应头
images = re.findall(b'\r\n\r\n(.*),result,re.S)
print(len(images[0])) # 32778字节

7.存到指定位置
with open("王xx.jpg","wb") as f:
    f.write(images[0])

httpx请求模块全新的网路请求库

特点:
能发送同步请求,也能发送异步请求
支持http/1.1 和 http/2
能直接向WSGI应用程序或者ASGI应用程序发起请求
封装好的代码量少

res = httpx.get(“http://baidu.com”)
res.status_code 可以拿到响应状态码

爬虫的关键步骤
1.找任务 — 地址
2.发请求 — 连接服务器
3.解析数据 — 提取内容
4.保存数据 — 保存

import httpx
import os
url_list= [
            'https://pic.netbian.com/uploads/allimg/220211/004115-1644511275bc26.jpg',
            'https://pic.netbian.com/uploads/allimg/220215/233510-16449393101c46.jpg',
            'https://pic.netbian.com/uploads/allimg/211120/005250-1637340770807b.jpg'
        ]
class My_Http(object):
    def __init__(self):
        self.headers = {
            'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.74 Safari/537.36'
        }
    def get_url_list(self,url_list):
        _url_list = url_list
        return _url_list
    def save_data(self,filename,img):
        with open(filename,'wb') as f :
            f.write(img.content)
            print('图片提取成功')
    def request(self,url):
        res = httpx.request('get',url,headers=self.headers)
        if res.status_code == 200:
            return res
    def run(self,url_list):
        _url_list = self.get_url_list(url_list)
        for index,url in enumerate(_url_list):
            file_name = './img/%s.jpg' %index
            data = self.request(url)
            self.save_data(file_name,data)

if __name__ == '__main__':
    s = My_Http()
    if os.path.exists("./img") is False:
        os.mkdir('./img')
        s.run(url_list)
    else:
        print('文件夹已存在')

requests 和 requests_cache请求模块使用

requests

requests.get('http://www.dict.baidu.com/s', params={'wd': 'python'})    #GET参数实例
requests.post('http://www.itwhy.org/wp-comments-post.php', data={'comment': '测试POST'})    #POST

参数实例

import requests

crawl_urls = [
    'https://36kr.com/p/1328468833360133',
    'https://36kr.com/p/1328528129988866',
    'https://36kr.com/p/1328512085344642'
]


session = requests.Session()

for url in crawl_urls :
	session.get(url)

requests_cache
# session = requests_cache.CachedSession(‘demo_cache’)
# 只对指定的请求方式起作用 allowable_methods=[‘POST’]
# session = requests_cache.CachedSession(‘demo_cache’,allowable_methods=[‘POST’])

import requests
import requests_cache
import time

start = time.time()

# 初始化      转成系统缓存加  "use_cache_dir=True"
requests_cache.install_cache('demo_cache', backend='filesystem',use_cache_dir=True)

xl= requests.Session()
for i in range(10):
	xl.get('http://httpbin.org/delay/1')
	print(f'Finished {i + 1} requests')
end = time.time()
print('Cost time', end - start)