学习Python 二十（爬虫一、fidder、请求头伪造）

最新推荐文章于 2024-01-23 16:32:08 发布

夕阳下，沙滩上，海洋中

最新推荐文章于 2024-01-23 16:32:08 发布

阅读量220

点赞数

分类专栏： python 文章标签：爬虫 python

本文链接：https://blog.csdn.net/weixin_53002381/article/details/116430355

版权

python 专栏收录该内容

34 篇文章 2 订阅

订阅专栏

19.爬虫

爬虫，又叫做网络爬虫，按照一定的规律，去抓取万维网上的信息的一个程序

爬虫的目的：采集数据

爬虫的分类：
通用的网络爬虫（检索引擎（百度））遵循robots协议
聚焦网络爬虫
增量式网络爬虫
累计式爬虫
深层网络爬虫（暗网）

19.1爬虫的第一个程序

#导包-网络库
import urllib.request
url='http://www.baidu.com'
#响应头
response=urllib.request.urlopen(url)
#获取数据
data=response.read()
print(data)

#导包 网络库 
import urllib.request 
url = "http://www.sina.com.cn" 
#响应头 
response = urllib.request.urlopen(url) 
#获取数据 
data = response.read() 
# print(data) 
html = data.decode("utf-8") 
with open("sina1.html","w",encoding="utf-8") as f: 
	f.write(html) 
	print("新浪信息采集完毕")

19.2 fidder的使用

抓包工具
https://www.telerik.com/download/fiddler
在这里插入图片描述
选择：I Agree

选择安装的路径

选择install 进行安装

点击close，安装完后
打开软件，打开浏览器，百度页面，会出现很多请求

remove all 清除

打开pycharm运行代码
然后到fiddler中看到如下：

Accept-Encoding: identity 期望编码
User-Agent: Python-urllib/3.9 用户代理对象
Connection: close
Host: www.sina.com.cn
网页百度页面：查看源代码

19.3 请求头伪造

反爬

反反爬：
1.请求头伪造
2.多次采集数据 Time.sleep(random)
3.ip地址的代理（推荐）

import urllib.request 
from urllib import request 
headers = { "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36" }
url = "https://www.baidu.com" 
response=request.Request(url=url,headers=headers) 
resp = request.urlopen(response) 
data = resp.read() 
print(data) 
with open("baidu.html","wb") as f: 
	f.write(data)

from urllib import request 
import random 
us = [
	"Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)" "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1" 
]
headers = { 
"User-Agent":random.choice(us)
}
print(headers) 
url = "https://www.baidu.com" 
response=request.Request(url=url,headers=headers) 
resp = request.urlopen(response) 
data = resp.read() 
print(data) 
# with open("qq.html","wb") as f: 
#	 f.write(data)

import random from urllib 
import request 
import chardet 
us = [
	"Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)" "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1" 
]
headers = { 
	"User-Agent":random.choice(us) 
}
url = "http://www.sina.com.cn" 
#真正的请求头对象 
req = request.Request(url=url,headers=headers) resp = request.urlopen(req) 
data = resp.read() 
#返回的是字典对象 
res = chardet.detect(data) 
char = res.get("encoding")
print(char) 
#print(res) 
html = data.decode(char) 
# html = data.decode("gb2312",errors="ignore") 
# #先转为二进制数据 转为字符串 # print(html) 
# with open("qq.html","wb") as f: 
# 	f.write(data)