19.爬虫
爬虫,又叫做网络爬虫,按照一定的规律,去抓取万维网上的信息的一个程序
爬虫的目的:采集数据
爬虫的分类:
通用的网络爬虫(检索引擎(百度))遵循robots协议
聚焦网络爬虫
增量式网络爬虫
累计式爬虫
深层网络爬虫(暗网)
19.1爬虫的第一个程序
#导包-网络库
import urllib.request
url='http://www.baidu.com'
#响应头
response=urllib.request.urlopen(url)
#获取数据
data=response.read()
print(data)
#导包 网络库
import urllib.request
url = "http://www.sina.com.cn"
#响应头
response = urllib.request.urlopen(url)
#获取数据
data = response.read()
# print(data)
html = data.decode("utf-8")
with open("sina1.html","w",encoding="utf-8") as f:
f.write(html)
print("新浪信息采集完毕")
19.2 fidder的使用
抓包工具
https://www.telerik.com/download/fiddler
选择:I Agree
选择安装的路径
选择install 进行安装
点击close,安装完后
打开软件,打开浏览器,百度页面,会出现很多请求
remove all 清除
打开pycharm运行代码
然后到fiddler中看到如下:
Accept-Encoding: identity 期望编码
User-Agent: Python-urllib/3.9 用户代理对象
Connection: close
Host: www.sina.com.cn
网页百度页面:查看源代码
19.3 请求头伪造
反爬
反反爬:
1.请求头伪造
2.多次采集数据 Time.sleep(random)
3.ip地址的代理(推荐)
import urllib.request
from urllib import request
headers = { "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36" }
url = "https://www.baidu.com"
response=request.Request(url=url,headers=headers)
resp = request.urlopen(response)
data = resp.read()
print(data)
with open("baidu.html","wb") as f:
f.write(data)
from urllib import request
import random
us = [
"Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)" "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1"
]
headers = {
"User-Agent":random.choice(us)
}
print(headers)
url = "https://www.baidu.com"
response=request.Request(url=url,headers=headers)
resp = request.urlopen(response)
data = resp.read()
print(data)
# with open("qq.html","wb") as f:
# f.write(data)
import random from urllib
import request
import chardet
us = [
"Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)" "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1"
]
headers = {
"User-Agent":random.choice(us)
}
url = "http://www.sina.com.cn"
#真正的请求头对象
req = request.Request(url=url,headers=headers) resp = request.urlopen(req)
data = resp.read()
#返回的是字典对象
res = chardet.detect(data)
char = res.get("encoding")
print(char)
#print(res)
html = data.decode(char)
# html = data.decode("gb2312",errors="ignore")
# #先转为二进制数据 转为字符串 # print(html)
# with open("qq.html","wb") as f:
# f.write(data)