简介
Python 给人的印象是抓取网页非常方便,提供这种生产力的,主要依靠的就是 urllib、requests这两个模块。
01 网络数据采集之urllib库
官方文档地址
urllib库是python的内置HTTP请求库,包含以下各个模块内容:
(1)urllib.request:请求模块
(2)urllib.error:异常处理模块
(3)urllib.parse:解析模块
(4)urllib.robotparser:robots.txt解析模块
from urllib.request import urlopen, Request
#方法一: 通过get方法请求url
with urlopen('http://www.python.org/') as f:
# 默认返回的页面信息是bytes类型, bytes类型转换成字符串,decode方法。
print(f.read(300).decode('utf-8'))
#方法二: Request对象发起请求
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0'
#封装请求头部信息, 模拟浏览器向服务器发起请求
request = Request('http://www.python.org/', headers={'User-Agent': user_agent})
with urlopen(request) as f:
# 默认返回的页面信息是bytes类型, bytes类型转换成字符串,decode方法。
print(f.read(300).decode('utf-8'))
网络数据采集之requests库
requests官方网址
request方法汇总
response对象汇总
应用:
为防止网页反扒,我们先自己写一个服务器,用于测试
from flask import Flask,request
app = Flask(__name__)
@app.route('/')
def index():
# 获取get提交的数据信息
print(request.args)
print('客户端请求的IP:',request.remote_addr)
print(request.user_agent)
return 'index:%s' %(request.args)
@app.route('/post/',methods=['POST'])
def post():
# 获取POST提交的数据
print(request.form)
# return 'post info:%s' %(request.form)
username = request.form.get('username')
password = request.form.get('password')
if username == 'admin' and password == 'westos':
return 'login success'
else:
return 'login failded'
if __name__ == '__main__':
app.run()
分别使用request.get()和request.post()方法来模拟浏览器访问服务器
from urllib.error import HTTPError
import requests
# from requests import HTTPError
def get():
url = 'http://127.0.0.1:5000'
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.109 Safari/537.36'}
# 默认情况下python的用户代理是 python-requests/2.22.0
response = requests.get(url,headers=headers)
# print(response)
# print(response.status_code)
# print(response.text)
# print(response.content)
# print(response.encoding)
def post():
url = 'http://127.0.0.1:5000/post/'
try:
data = {
'username': 'admin',
'password': 'westos',
}
response = requests.post(url,data=data)
print(response.text)
except HTTPError as e:
print('爬虫爬取%s失败:%s' %(url,e.reason))
if __name__ == '__main__':
post()
# get()
高级应用
添加headers
有些网站访问时必须带有浏览器等信息,如果不传入headers就会报错。添加头部信息,使爬虫看起来更像浏览器,默认情况下,爬虫请求的用户代理是python-request
headers = { ‘User-Agent’: useragent}
response = requests.get(url, headers=headers)
UserAgent是识别浏览器的一串字符串,相当于浏览器的身份证,在利用爬虫爬取网站数据时,频繁更换UserAgent可以避免触发相应的反爬机制。fake-useragent对频繁更换UserAgent提供了很好的支持,可谓防反扒利器。
用户代理链接:https://fake-useragent.herokuapp.com/browsers/0.1.11
user_agent = UserAgent().random
import requests
from fake_useragent import UserAgent
def add_headers():
url = 'http://127.0.0.1:5000'
ua = UserAgent()
# print(ua.random) # 随机产生一个useragent
headers = {'User-Agent': ua.random}
# 默认情况下python的用户代理是 python-requests/2.22.0
response = requests.get(url, headers=headers)
print(response)
if __name__ == '__main__':
add_headers()
用户代理设置
在进行爬虫爬取时,有时候爬虫会被服务器给屏蔽掉,这时采用的方法主要有降低访问时间,通过代理IP访问。ip可以从网上抓取,或者某宝购买。可以从西次网查询免费的IP代理。
proxies = { “http”: “http://127.0.0.1:9743”, “https”: “https://127.0.0.1:9743”,}
response = requests.get(url, proxies=proxies)
import requests
from fake_useragent import UserAgent
ua = UserAgent()
proxies = {
'http':'http://223.100.166.3:36945',
'https':'https://115.219.171.35:8118'
}
response = requests.get('http://47.92.255.98:8000',headers = {'User-Agent': ua.random}, proxies=proxies)
print(response)
print(response.text)
这个需要将Flask项目部署到远程服务器上才能测试,因为默认访问的是自己主机
项目案例
项目案例一: 京东商品的爬取
项目案例二: 百度/360搜索关键词提交
import requests
from urllib.error import HTTPError
from colorama import Fore
from fake_useragent import UserAgent
def download_page(url,params):
ua = UserAgent()
try:
response = requests.get(url,headers = {'User-Agent':ua.random},params=params)
except HTTPError as e:
print(Fore.RED + '[-] 爬取网站%s失败:%s' %(url,e.reason))
return None
else:
return response.content
def download_file(content=b'',filename='hello.html'):
with open(filename,'wb') as f:
f.write(content)
print(Fore.GREEN + '[+]写入文件%s成功' %(filename))
if __name__ == '__main__':
# url = 'https://item.jd.com/100012014970.html#crumb-wrap'
# html = download_page(url)
# print(html)
# download_file(content=html)
url = 'https://www.so.com/s'
params = {
'q':'python'
}
content = download_page(url,params)
download_file(content)