数据之路 - Python爬虫-基本库、解析库

最新推荐文章于 2024-04-24 15:45:08 发布

weixin_33842328

最新推荐文章于 2024-04-24 15:45:08 发布

阅读量427

点赞数

文章标签：爬虫 python 人工智能

原文链接：http://www.cnblogs.com/Iceredtea/p/11050660.html

版权

一、基本库-urllib库

urllib库，它是Python内置的HTTP请求库。它包含4个模块：

request：它是最基本的HTTP请求模块，可以用来模拟发送请求。
error：异常处理模块，如果出现请求错误，我们可以捕获这些异常，然后进行重试或其他操作以保证程序不会意外终止。
parse：一个工具模块，提供了许多URL处理方法，比如拆分、解析、合并等。
robotparser：主要是用来识别网站的robots.txt文件，然后判断哪些网站可以爬，哪些网站不可以爬，它其实用得比较少。

1.urllib.request模块

request模块主要功能：构造HTTP请求，利用它可以模拟浏览器的一个请求发起过程，

request模块同时还有：处理授权验证（authenticaton）、重定向（redirection)、浏览器Cookies以及其他内容。

- urlopen方法

 urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

urlopen参数介绍：

url用于请求URL
data不传：GET请求，传：POST请求
timeout设置超时时间，单位为秒，意思就是如果请求超出了设置的这个时间，还没有得到响应，就会抛出异常。如果不指定该参数，就会使用全局默认时间。它支持HTTP、HTTPS、FTP请求。
context必须是ssl.SSLContext类型，用来指定SSL设置。
cafile指定CA证书
capath指定CA证书的路径，这个在请求HTTPS链接时会有用。

- Request方法

class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

Request参数介绍：

url用于请求URL，这是必传参数，其他都是可选参数。
data如果要传，必须传bytes（字节流）类型的。如果它是字典，可以先用urllib.parse模块里的urlencode()编码。
headers是一个字典，它就是请求头，我们可以在构造请求时通过headers参数直接构造，也可以通过调用请求实例的add_header()方法添加。添加请求头最常用的用法就是通过修改User-Agent来伪装浏览器
origin_req_host指的是请求方的host名称或者IP地址。
unverifiable表示这个请求是否是无法验证的，默认是False，意思就是说用户没有足够权限来选择接收这个请求的结果。例如，我们请求一个HTML文档中的图片，但是我们没有自动抓取图像的权限，这时unverifiable的值就是True`。
method是一个字符串，用来指示请求使用的方法，比如GET、POST和PUT等。

from urllib import request, parse
 
url = 'http://httpbin.org/post'
headers = {
    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',
    'Host': 'httpbin.org'
}
dict = {
    'name': 'Germey'
}
data = bytes(parse.urlencode(dict), encoding='utf8')
req = request.Request(url=url, data=data, headers=headers, method='POST')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

- Handler处理器

urllib.request模块里的BaseHandler类，它是所有其他Handler的父类。

常见Handler介绍：

HTTPDefaultErrorHandler：用于处理HTTP响应错误，错误都会抛出HTTPError类型的异常。
HTTPRedirectHandler：用于处理重定向。
HTTPCookieProcessor：用于处理Cookies。
ProxyHandler：用于设置代理，默认代理为空。
HTTPPasswordMgr：用于管理密码，它维护了用户名和密码的表。
HTTPBasicAuthHandler：用于管理认证，如果一个链接打开时需要认证，那么可以用它来解决认证问题。

- 代理

ProxyHandler，其参数是一个字典，键名是协议类型（比如HTTP或者HTTPS等），键值是代理链接，可以添加多个代理。

然后，利用这个Handler及build_opener()方法构造一个Opener，之后发送请求即可。

from urllib.error import URLError
from urllib.request import ProxyHandler, build_opener
 
proxy_handler = ProxyHandler({
    'http': 'http://127.0.0.1:9743',
    'https': 'https://127.0.0.1:9743'
})
opener = build_opener(proxy_handler)
try:
    response = opener.open('https://www.baidu.com')
    print(response.read().decode('utf-8'))
except URLError as e:
    print(e.reason)

- cookies

# 从网页获取cookie，并逐行输出
import http.cookiejar, urllib.request
 
cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
for item in cookie:
    print(item.name+"="+item.value)

# 从网页获取cookie，保存为文件格式
filename = 'cookies.txt'
cookie = http.cookiejar.MozillaCookieJar(filename)　　# cookie = http.cookiejar.LWPCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)

PS：MozillaCookieJar是CookieJar的子类，LWPCookieJar与MozillaCookieJar均可读取、保存cookie，但格式不同

调用load()方法来读取本地的Cookies文件，获取到了Cookies的内容。

cookie = http.cookiejar.LWPCookieJar()
cookie.load('cookies.txt', ignore_discard=True, ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
print(response.read().decode('utf-8'))

2.urllib.error模块

from urllib import request, error
try:
    response = request.urlopen('http://cuiqingcai.com/index.htm')
except error.URLError as e:
    print(e.reason)

3.urllib.parse模块

urlparse()
urlunparse()
urlsplit()
urlunsplit()
urljoin()
urlencode()
parse_qs()
parse_qsl()
quote()
unquote()

4.urllib.robotparser模块

Robots协议也称作爬虫协议、机器人协议，它的全名叫作网络爬虫排除标准（Robots Exclusion Protocol），用来告诉爬虫和搜索引擎哪些页面可以抓取，哪些不可以抓取。它通常是一个叫作robots.txt的文本文件,

一般放在网站的根目录下。www.taobao.com/robots.txt

robotparser模块提供了一个类RobotFileParser，它可以根据某网站的robots.txt文件来判断一个爬取爬虫是否有权限来爬取这个网页。

urllib.robotparser.RobotFileParser(url='')

# set_url()：用来设置robots.txt文件的链接。
# read()：读取robots.txt文件并进行分析。
# parse()：用来解析robots.txt文件。
# can_fetch()：该方法传入两个参数，第一个是User-agent，第二个是要抓取的URL。
# mtime()：返回的是上次抓取和分析robots.txt的时间。
# modified()：将当前时间设置为上次抓取和分析robots.txt的时间。

from urllib.robotparser import RobotFileParser
 
rp = RobotFileParser()
rp.set_url('http://www.jianshu.com/robots.txt')
rp.read()
print(rp.can_fetch('*', 'http://www.jianshu.com/p/b67554025d7d'))
print(rp.can_fetch('*', "http://www.jianshu.com/search?q=python&page=1&type=collections"))

二、基本库-requests库

get()、post()、put()、delete()方法分别用于实现GET、POST、PUT、DELETE请求。

1.基础语法

- GET请求

import requests
 
data = {
    'name': 'germey',
    'age': 22
}
r = requests.get("http://httpbin.org/get", params=data)
print(r.text)

- POST请求

import requests
 
data = {'name': 'germey', 'age': '22'}
r = requests.post("http://httpbin.org/post", data=data)
print(r.text)

- 响应

import requests
 
r = requests.get('http://www.jianshu.com')
print(type(r.status_code), r.status_code)    # status_code属性得到状态码
print(type(r.headers), r.headers)    # 输出headers属性得到响应头
print(type(r.cookies), r.cookies)    # 输出cookies属性得到Cookies
print(type(r.url), r.url)    # 输出url属性得到URL
print(type(r.history), r.history)    # 输出history属性得到请求历史

2.高级语法

- 文件上传

import requests
 
files = {'file': open('favicon.ico', 'rb')}
r = requests.post("http://httpbin.org/post", files=files)
print(r.text)

- cookies

# 获取Cookies
import requests
 
r = requests.get("https://www.baidu.com")
print(r.cookies)
for key, value in r.cookies.items():
    print(key + '=' + value)

- 会话维持

import requests
 
s = requests.Session()
s.get('http://httpbin.org/cookies/set/number/123456789')
r = s.get('http://httpbin.org/cookies')
print(r.text)

- SSL证书验证

requests还提供了证书验证的功能。当发送HTTP请求的时候，它会检查SSL证书，我们可以使用verify参数控制是否检查此证书。其实如果不加verify参数的话，默认是True，会自动验证。

# 通过verity参数设置忽略警告
import requests
from requests.packages import urllib3
 
urllib3.disable_warnings()
response = requests.get('https://www.12306.cn', verify=False)
print(response.status_code)

# 通过捕获警告到日志的方式忽略警告
import logging
import requests
logging.captureWarnings(True)
response = requests.get('https://www.12306.cn', verify=False)
print(response.status_code)

# 指定一个本地证书用作客户端证书，这可以是单个文件（包含密钥和证书）或一个包含两个文件路径的元组
import requests
 
response = requests.get('https://www.12306.cn', cert=('/path/server.crt', '/path/key'))
print(response.status_code)

- 代理

import requests
 
proxies = {
  "http": "http://10.10.1.10:3128",
  "https": "http://10.10.1.10:1080",
}
 
requests.get("https://www.taobao.com", proxies=proxies)

- 超时设置

import requests

# 超时抛出异常
r = requests.get("https://www.taobao.com", timeout = 1)
print(r.status_code)

# 请求分为两个阶段，即连接（connect）和读取（read），可以分别指定，传入一个元组
r = requests.get('https://www.taobao.com', timeout=(5,11, 30))

# 永久等待    
r = requests.get('https://www.taobao.com', timeout=None)
r = requests.get('https://www.taobao.com')

- 身份认证

# 使用requests自带的身份认证功能
import requests
from requests.auth import HTTPBasicAuth

r = requests.get('http://localhost:5000', auth=HTTPBasicAuth('username', 'password'))
print(r.status_code)

# 传一个元组，默认使用HTTPBasicAuth类来认证
import requests
 
r = requests.get('http://localhost:5000', auth=('username', 'password'))
print(r.status_code)

三、正则表达式

1.常用匹配规则

模式	描述
\w	匹配字母、数字及下划线
\W	匹配不是字母、数字及下划线的字符
\s	匹配任意空白字符，等价于[\t\n\r\f]
\S	匹配任意非空字符
\d	匹配任意数字，等价于[0-9]
\D	匹配任意非数字的字符
\A	匹配字符串开头
\Z	匹配字符串结尾，如果存在换行，只匹配到换行前的结束字符串
\z	匹配字符串结尾，如果存在换行，同时还会匹配换行符
\G	匹配最后匹配完成的位置
\n	匹配一个换行符
\t	匹配一个制表符
^	匹配一行字符串的开头
$	匹配一行字符串的结尾
.	匹配任意字符，除了换行符，当re.DOTALL标记被指定时，则可以匹配包括换行符的任意字符
[...]	用来表示一组字符，单独列出，比如[amk]匹配a、m或k
[^...]	不在[]中的字符，比如[^abc]匹配除了a、b、c之外的字符
*	匹配0个或多个表达式
+	匹配1个或多个表达式
?	匹配0个或1个前面的正则表达式定义的片段，非贪婪方式
{n}	精确匹配n个前面的表达式
{n,m}	匹配n到m次由前面正则表达式定义的片段，贪婪方式
a\|b	匹配a或b
( )	匹配括号内的表达式，也表示一个组

2.修饰符

修饰符	描述
re.I	使匹配对大小写不敏感
re.L	做本地化识别（locale-aware）匹配
re.M	多行匹配，影响^和$
re.S	使.匹配包括换行在内的所有字符
re.U	根据Unicode字符集解析字符。这个标志影响\w、\W、 \b和\B
re.X	该标志通过给予你更灵活的格式以便你将正则表达式写得更易于理解

3.常用正则函数

match()方法会尝试从字符串的起始位置匹配正则表达式，match()方法中，第一个参数传入了正则表达式，第二个参数传入了要匹配的字符串。group()方法可以输出匹配到的内容；span()方法可以输出匹配的范围。
search()方法在匹配时会扫描整个字符串，然后返回第一个成功匹配的结果。
findall()方法会搜索整个字符串，然后返回匹配正则表达式的所有内容。
sub()方法可将一串文本中的所有数字都去掉。
compile()方法将正则字符串编译成正则表达式对象，以便在后面的匹配中复用。
split()方法将字符串用给定的正则表达式匹配的字符串进行分割，分割后返回结果list。

四、解析库-XPath

XPath，全称XML Path Language，即XML路径语言，它是一门在XML文档中查找信息的语言。

1.XPath基本用法

使用XPath来对网页进行解析，首先导入lxml库的etree模块，然后声明了一段HTML文本，调用HTML类进行初始化，这样就成功构造了一个XPath解析对象。etree模块可以自动修正HTML文本。

from lxml import etree
text = '''
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </ul>
 </div>
'''
html = etree.HTML(text)
result = etree.tostring(html)
print(result.decode('utf-8'))

# 利用XPath规则提取信息
html = etree.parse('./test.html', etree.HTMLParser()) 
result = html.xpath(’//*’) 
print(result)

# 属性多值匹配,采用contains()函数
html = etree.HTML(text) 
result = html. xpath (’//li[contains(@class,”li”)]/a/text()’) 
print(result)

# 多属性匹配，借助and运算符实现
html = etree.HTML(text) 
result = ht「吐. xpath(' //li[contains(@class,”li") and @name＝飞tem”］／a/text()' )
print(result)

# 按序选择节点，借助中括号传入索引的方法获取特定次序的节点
html = etree.HTML(text) 
result = html. xpath (’//li[l]/a/text()’) 
print(result) 
result = html.xpath(’I /li[last()] /a/text()’) 
print(result) 
result = html.xpath(’I !li [position() <3] I a/text()’) 
print (resl肚）
result = html. xpath (’I /li [last ()-2] /a/text()’) 
print(result)

# 节点轴选择，未完待续

2.XPath常用规则

表达式	描述
nodename	选取此节点的所有子节点
/	从当前节点选取直接子节点
//	从当前节点选取子孙节点
.	选取当前节点
..	选取当前节点的父节点
@	选取属性

五、解析库-Beautiful Soup

Beautiful Soup就是Python的一个HTML或XML的解析库，可以用它来方便地从网页中提取数据。Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为UTF-8编码。

1.Beautiful Soup基本用法

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())
print(soup.title.string)

2.Beautiful Soup支持的解析器

解析器	使用方法	优势	劣势
Python标准库	BeautifulSoup(markup, "html.parser")	Python的内置标准库、执行速度适中、文档容错能力强	Python 2.7.3及Python 3.2.2之前的版本文档容错能力差
lxml HTML解析器	BeautifulSoup(markup, "lxml")	速度快、文档容错能力强	需要安装C语言库
lxml XML解析器	BeautifulSoup(markup, "xml")	速度快、唯一支持XML的解析器	需要安装C语言库
html5lib	BeautifulSoup(markup, "html5lib")	最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档	速度慢、不依赖外部扩展

3. 节点选择器

直接调用节点的名称就可以选择节点元素，再调用string属性就可以得到节点内的文本。

from  bs4 import  BeautifulSoup 
soup = BeautifulSoup(html,’lxml') 

# 选择元素，当有多个节点时，只会选择到第一个匹配的节点，其他的后面节点会被忽略
print(soup.title) 

# 获取信息，获取文本值
print(soup.title.string) 

# 获取信息，获取节点属性值
print(soup.title.name) 

# 获取信息，获取属性
print(soup.p.attrs) 
print(soup.p.attrs［’name『］）

4.方法选择器

find_all()查询所有符合条件的元素
find_all(narne,attrs,recursive,text,**kwargs)

# 节点名查询元素
print(soup.findall(name=’ul'))
print(type(soup.find_all(name=’ul’)[0]))

# 属性查询元素
print(soup.干ind_all(attrs＝｛’id＇：’list-1'｝））
print(soup.于ind_all(attrs＝｛’name＇：’elements’｝））

# 传入字符串或正则表达式匹配文本
print(soup.find_all(text=re.compile(’link')))

5.CSS选择器

from  bs4 import  BeautifulSoup 
soup = BeautifulSoup(html, ’lxml' ) 
# 获取属性，通过attrs 属性获取属性值
for ul in  soup. select(' ul’)· 
    print(ul[’id’]) 
    print ( ul. attrs ['id’])     

# 利用string属性、get_next()方法获取文本
for li in soup. select(' li') : 
    print(’Get Text:’, li. get_ text()) 
    print(’String:’, li. string)

六、解析库-Pyquery

1.初始化

# 字符串初始化
from pyquery import PyQuery as pq
doc = pd(html)
print(doc('li'))

# URL初始化
from pyquery import PyQuery as pq 
doc = pq(url=' https://cuiqingcai.com’) 
print(doc(’title'))

# 文件初始化
from  pyquery import  PyQuery as pq 
doc = pq(filename=’demo.html’) 
print(doc(’li’))

转载于:https://www.cnblogs.com/Iceredtea/p/11050660.html

weixin_33842328

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
数据之路 - Python爬虫-基本库、解析库

一、基本库-urllib库urllib库，它是Python内置的HTTP请求库。它包含4个模块：request：它是最基本的HTTP请求模块，可以用来模拟发送请求。error：异常处理模块，如果出现请求错误，我们可以捕获这些异常，然后进行重试或其他操作以保证程序不会意外终止。parse：一个工具模块，提供了许多URL处理方法，比如拆分、解析、合并等。...
复制链接

扫一扫