python笔记----Python 网络爬虫
1.什么是爬虫
例子:爬取百度首页
import urlib.request #导入网络请求模块
response = urlib.request.urlopen('网址’) #实现网络请求
print(response.read().decode(‘utf-8’)) #打印内容
2.网络爬虫的常用技术
2.1.网络请求
3个常用的模块:
urllib模块
request 该模块定义的打开URL的方法和类
error 该模块主要
parse 该模块用于URL的解析和URL引用
robotparser 该模块用于解析robot.txt文件(爬虫规则文件)
简单例子:
import urllib.request
import urllib.parse #导入解析模块
#创建参数
data = bytes(urllib.parse.urlencode({‘word’:‘hello’}),encoding = ‘utf-8’)
#发送网络请求
response = urllib.request.urlopen(‘http://httpbin.org/post’,data = data)
html = response.read()
print(html)
urllib3模块
官网:http://urllib3.readthedocs.io/en/latest/
简单例子:
import urllib3
#创建PoolManager对象,用于处理与线程的连接与线程安全
http = urllib3.PoolManager()
response = http.request(‘GET’,‘http://www.baidu.com’)
print(response.data.decode()) #打印请求信息
简单例子2:
import urllib3
#创建PoolManager对象,用于处理与线程的连接与线程安全
http = urllib3.PoolManager()
response = http.request(‘POST’,‘http://httpbin.org/post’,fields = {‘word’:‘hello’})
print(response.data.decode()) #打印请求信息
Requests模块
官网:http://www.python-requests.org/en/master/
简单例子:
import requests
response = requests.get(‘http://www.baidu.com’)
print(response.status_code) #打印状态码
print(response.url) #打印请求地址
print(response.headers) #打印头部信息
print(response.cookies) #打印cookies信息
print(response.text) #打印文本源码
print(response.content) #打印字节流源码
简单例子(post请求):
mport requests
data = {‘word’:‘hello’} #表单参数
response = requests.post(‘http://httpbin.org/post’,data = data)
print(response.status_code) #打印状态码
2.2.请求headers处理
网址:https://www.whatismyip.com
import requests
url = ‘https://www.whatismyip.com/’ #网络请求地址
headers = {‘user-agent’:‘Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36’}
response = requests.get(url,headers=headers ) #发送网络请求
print(response.content.decode(‘utf-8’))
····2.3.网络超时
import requests
#导入网络请求模块中的三种异常类
from requests.exceptions import ReadTimeout,HTTPError,RequestException
#循环发送50次请求
for a in range(0,50):
try:
response = requests.get(‘https://www.whatismyip.com/’,timeout = 0.5)
print(response.status_code)
except ReadTimeout :
print(‘timeout’)
except HTTPError :
print(‘httperror’)
except ReadTimeout :
print(‘readtimeout’)
2.4.代理服务
网址:www.xicidaili.com
import requests
#设置代理ip
proxy = {‘http’:‘221.6.32.214:50514’,
‘https’:‘120.78.225.5:3128’}
response = requests.get(‘https://www.baidu.com’,proxies = proxy)
print(response.content.decode(‘utf-8’))
2.5.解析HTMl
LXML模块
Requests-HTML模块
HtmlParser 模块
Beautifulsoup 模块
安装:easy_install beautifulsoup44
1.pip install bs4
2.pip install beautifulsoup4
地址:https://www.crummy.com/software/BeautifulSoup/bs4/download/
安装命令:Python setup.py.install
使用文档地址:https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html
简单例子:
import requests
from bs4 import BeautifulSoup
response = requests.get(‘http://news.baidu.com’)
soup = BeautifulSoup(response.text,features = ‘lxml’)
print(soup.find(‘title’).text)
3.爬虫的常用框架
3.1.scapy
官网:https://scrapy.org/
安装:pip install scrapy
3.2.Crawley Project
官网:http://project.crawley-cloud.com/
安装:pip install crawley
3.3.pyspider
源码地址:https://github.com/binux/pyspider/releases
文档地址:http://docs.pyspider.org/
安装:pip install pyspider
4.实战:快手爬票
4.1.概述:
4.2.搭建QT环境
官网:https://www.qt.io/download
download.qt.io/archive/qt
4.3.主窗体设计
4.4.分析请求参数
4.5.下载站名文件
4.6.获取车票信息