爬虫入门简易教学（requests）

最新推荐文章于 2024-07-25 17:17:18 发布

毛克鹏

最新推荐文章于 2024-07-25 17:17:18 发布

阅读量173

点赞数

文章标签：爬虫

本文链接：https://blog.csdn.net/weixin_62044613/article/details/134127061

版权

request通过向浏览器发送http请求来获取数据

文章目录

提示：以下是本篇文章正文内容，下面案例可供参考

Requests 是一个 Python 的 HTTP 客户端库。

支持的 HTTP 特性：

保持活动和连接池
国际域名和 URL
Cookie 持久性会话
浏览器式 SSL 验证
自动内容解码
基本 / 摘要身份验证
优雅的键 / 值 Cookie
自动减压
Unicode 响应机构
HTTP（S）代理支持
分段文件上传
流下载
连接超时
分块请求
.netrc 支持
线程安全

一，安装

在python终端输入

pip install requests

二、使用步骤

代码如下（示例）：

import requests
import json
response = requests.get('https://www.baidu.com')
result = response.content.decode()
print(result)

注意：关于使用get请求还是post请求要看网页的请求方式保持一致

GET请求

Get方法中输入相应url连接，结果返回一个响应对象，数据类型为requests .models. Response

POST请求

post方法中输入对应url地址连接，以及部分请求参数，如下图所示：

请求参数仅在form中显示，不在url后缀中显示（get和post之间的区别）

2.用户代理

3.7、设置代理

当需要对一个网站进行频繁或大规模爬取操作时，需要用到代理设置操作，具体参数为proxies

import requests
import json
from time import sleep

def get_proxy_ip():
    '''获取代理IP'''
    xdl_url = "代理网址"
    response = requests.get(xdl_url)
    result = response.content.decode()
    result = result.replace("\r\n","")
    print('获取代理IP:',result)
    proxies = {
        'http': f'http://{result}',
        'https':f'http://{result}'
    }
    return proxies

proxies = get_proxy_ip()

sort_type = [
    {"score": 0, "sortType": 5},
    {"score": 7, "sortType": 5},
    {"score": 5, "sortType": 5},
    {"score": 3, "sortType": 6},
    {"score": 2, "sortType": 6},
    {"score": 1, "sortType": 6},
]

url = "https://api.m.jd.com/?appid=item-v3&functionId=pc_club_productPageComments&client=pc&clientVersion=1.0.0&t=1697678976363&loginType=3&uuid=122270672.1694479982215119690462.1694479982.1697622044.1697676765.11&productId=100026761878&isShadowSku=0&fold=1&bbtf=&shield="

count = 1
page_size = 30
for score in sort_type:
    page = 0
    while True:
        params = {
            "page": page,
            "pageSize": page_size,
            "score": score["score"],
            "sortType": score["sortType"],
        }
        try:
            response = requests.get(url=url,params=params)
            result = response.content.decode()
            result = json.loads(result)
            if not result['comments']:
                break
            for data in result['comments']:
                print(count, data)
                count += 1
            page += 1
        except:
            continue

该处使用的url网络请求的数据。

3.请求所带数据

在F12载荷中有标头所带数据能够控制现实的数据

例如：

    params = {
        "page": page,
        "pageSize": page_size,
        "score": score["score"],
        "sortType": score["sortType"],
    }

4.完整案例演示代码

import requests
import json
url = 'https://datacenter-web.eastmoney.com/api/data/v1/get?callback=jQuery11230021459067746777638_1697627529533&sortColumns=NOTICE_DATE%2CSUM%2CRECEIVE_START_DATE%2CSECURITY_CODE&sortTypes=-1%2C-1%2C-1%2C1&reportName=RPT_ORG_SURVEYNEW&columns=SECUCODE%2CSECURITY_CODE%2CSECURITY_NAME_ABBR%2CNOTICE_DATE%2CRECEIVE_START_DATE%2CRECEIVE_PLACE%2CRECEIVE_WAY_EXPLAIN%2CRECEPTIONIST%2CSUM&quoteColumns=f2~01~SECURITY_CODE~CLOSE_PRICE%2Cf3~01~SECURITY_CODE~CHANGE_RATE&quoteType=0&source=WEB&client=WEB&filter=(NUMBERNEW%3D%221%22)(IS_SOURCE%3D%221%22)(RECEIVE_START_DATE%3E%272020-10-18%27)'
for i in range(1,498):
    params = {
        'pageSize': 50,
        'pageNumber': {i},
    }
    response = requests.get(url=url,params=params)
    result = response.content.decode()
    result = result.replace('jQuery11230021459067746777638_1697627529533(','')
    result = result.replace(')','')
    result = result.replace(';','')
    result= json.loads(result)
    if not result["result"]:
        break
    # 遍历结果
    for item in result['result']['data']:
        SECURITY_CODE = item['SECURITY_CODE']
        SECURITY_NAME_ABBR = item['SECURITY_NAME_ABBR']
        CLOSE_PRICE = item['CLOSE_PRICE']
        CHANGE_RATE = item['CHANGE_RATE']
        print(f'代码：{SECURITY_CODE},名称：{SECURITY_NAME_ABBR},最新价：{CLOSE_PRICE},涨跌幅：{CHANGE_RATE}')