静态网页爬取

最新推荐文章于 2024-11-08 05:53:46 发布

石雕冰

最新推荐文章于 2024-11-08 05:53:46 发布

阅读量362

点赞数 10

文章标签：爬虫

本文链接：https://blog.csdn.net/fhiceng/article/details/135629855

版权

本文介绍了如何使用Python的Request库进行静态网页抓取，包括获取HTML内容、处理URL参数、控制编码、发送POST请求、设置超时、代理访问以及伪装请求头。

摘要由CSDN通过智能技术生成

静态网页指纯HTML格式的网页，没有后台数据库、不含程序、不可交互，更新相对比较麻烦，数据比较好爬。

使用的库：Request

Request库使用示例

import requests

r=requests.get("https://mp.csdn.net/mp_blog/creation/editor?spm=1011.2124.3001.6192")
print(r.status_code)
print(r.headers['content-type'])
print(r.encoding)
print(r.content)#字节方式的响应体，自动解码gzip和deflate编码的响应数据

'''
200
text/html; charset=utf-8
utf-8
b'<!DOCTYPE html><html><head><meta charset="utf-8"><meta http-equiv=...
'''

#Request的编码模式可修改，修改后使用修改后的编码获取网页内容
r.encoding='ISO-8859-1'
print(r.encoding)
print(r.content)

JSON是JavaScript的对象标记，使用对象和数组的组合表示数据。

传递URL参数

两种构建方式

直接构建：http://httpbin.org/get?key1=value1

import requests

key_dict={'key1':'value1','key2':'value2'}
r=requests.get('http://httpbin.org/get',params=key_dict)
print(r.url)
print(r.text)

'''
http://httpbin.org/get?key1=value1&key2=value2
{
  "args": {
    "key1": "value1", 
    "key2": "value2"
  }, 
   "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.31.0", 
    "X-Amzn-Trace-Id": "Root=1-65a6782e-23eda628341abb5738e9de51"
  }, 
  "origin": "111.33.117.248", 
  "url": "http://httpbin.org/get?key1=value1&key2=value2"
}
'''

网页编码

参数allow_redirects=False禁止跳转，可以直接显示跳转的状态码。

定制请求头

#获取请求头,以字典形式返回
r=requests.get("https://mp.csdn.net/mp_blog/creation/editor?spm=1011.2124.3001.6192")
print(r.headers)

'''
{'Server': 'openresty', 'Date': 'Tue, 16 Jan 2024 12:52:37 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Last-Modified': 'Tue, 09 Jan 2024 06:48:35 GMT', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Keep-Alive': 'timeout=20', 'Vary': 'Accept-Encoding', 'ETag': 'W/"659cec43-1478"', 'Strict-Transport-Security': 'max-age=864000', 'Content-Encoding': 'gzip'}
'''

#获取请求头内容
print(r.request.header)
'''
{'User-Agent': 'python-requests/2.31.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
'''

发送POST请求

将字典传给request的data参数，发送时自动编码为表单格式。

key_dict={'key1':'value1','key2':'value2'}
r=request.post('http://httpbin.org/post',data=key_dict)

设置超时

设置时间为20s，超时报错。

request.get('http://httpbin.org',timeout=20)

代理访问

proxies={
    "http":"http://10.10.1.10:3128",
    "https":"http://10.10.1.10:1080",
}
#代理需要账号密码
proxies={
    "http":"http://user:pass@10.10.1.10:3128/",
}
request.get("http://httpbin.org",proxies=proxies)

伪装请求头部

headers={'User-Agent':'alexkh'}
r=request.get('http:httpbin.org',headers=headers)

石雕冰

关注

10
点赞
踩
7

收藏

觉得还不错? 一键收藏
1
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫