python--爬虫--获取和解析存储网页内容--以薄荷网为例

本文链接：https://blog.csdn.net/zzq900503/article/details/89188025

如需转载请注明出处:python–爬虫–获取和解析存储网页内容–以薄荷网为例

我们在之前的文章中已经学习了如何进行数据抓包和截取以及分析访问网页。

例如:
抓取app数据教程–fiddler抓包数据截取-薄荷app为例

本章主要学习怎么获取分析出来的链接地址的内容，进行解析和进行保存。

分析网页或者手机APP请求地址

通过观察fiddler中的请求可以发现我们需要抓取的地址。

详情可参考

Fiddler介绍和安装以及基本使用

抓取app数据教程–fiddler抓包数据截取-薄荷app为例

如下

GET https://dali.anjuke.com/sale/rd1/?from=zjsr&kw=%E9%87%91%E5%87%A4%E9%82%91 HTTP/1.1
Host: dali.anjuke.com
Connection: keep-alive
Cache-Control: max-age=0
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Referer: https://dali.anjuke.com/sale/
Accept-Encoding: gzip, deflate, sdch, br
Accept-Language: zh-CN,zh;q=0.8,en;q=0.6
Cookie: sessid=96EF681F-1388-E3CD-B31F-378CE5016A19; lps=http%3A%2F%2Fwww.anjuke.com%2F%7Chttps%3A%2F%2Fwww.google.com%2F; als=0; wmda_uuid=85761127cbefa6fdb36bb244a28cd1c6; wmda_new_uuid=1; wmda_visited_projects=%3B6289197098934; lp_lt_ut=5f22eca517494baf49d4dda18cbf137b; isp=true; Hm_lvt_c5899c8768ebee272710c9c5f365a6d8=1554866616; Hm_lpvt_c5899c8768ebee272710c9c5f365a6d8=1554866689; ctid=102; propertys=bd0o79-ppq6uq_ao4ql9-ppq6u3_; search_words=%E9%87%91%E5%87%A4%E9%82%91; wmda_session_id_6289197098934=1554866585809-f7d858b2-a679-a926; __xsptplusUT_8=1; _ga=GA1.2.552888709.1554866175; _gid=GA1.2.2042714575.1554866175; _gat=1; __xsptplus8=8.1.1554866175.1554867222.12%233%7Cwww.google.com%7C%7C%7C%7C%23%23xKiLY2OWxJKdV1vJqk3U0hdWFTow95Ul%23; 58tj_uuid=e3af93b7-3825-4f27-ba60-39e23a2ee0ed; new_session=0; init_refer=https%253A%252F%252Fwww.google.com%252F; new_uv=1; aQQ_ajkguid=1139DD34-732E-2C0B-A461-07CFE6B14ADD; twe=2

我们可以看到需要获取该页面的信息需要使用哪种访问方式，一般都是get，登录的提交一般为post。

get方式获取网页信息

实现方式介绍

使用Python访问网页主要有四种方式： python自带官方标准库的urllib, urllib2, httplib，第三方模块，如Requests

一、urllib
urllib比较简单，功能相对也比较弱

二、httplib
httplib简单强大，用法有点类似于java的httpclient，httplib是一个相对底层的http请求模块，其上有专门的包装模块，如urllib内建模块，goto等第三方模块，但是封装的越高就越不灵活，比如urllib模块里请求错误时就不会返回结果页的内容，只有头信息，对于某些需要检测错误请求返回值的场景就不适用，所以就得用这个模块了。
httplib实现了HTTP和HTTPS的客户端协议，一般不直接使用，在python更高层的封装模块中（urllib,urllib2）使用了它的HTTP和HTTPS实现。

三、urllib2
urllib2是python自带的一个访问网页和本地文件的库。

四、Requests
Requests 完全满足如今网络的需求，其功能有以下：
国际化域名和 URLs
Keep-Alive & 连接池
持久的 Cookie 会话
类浏览器式的 SSL 加密认证
基本/摘要式的身份认证
优雅的键/值 Cookies
自动解压
Unicode 编码的响应体
多段文件上传
连接超时
支持 .netrc
适用于 Python 2.6—3.4
线程安全
请参考中文官方文档，写的非常详细：Requests官网

其中快速上手页面写的非常棒，请看：快速上手指南

正如介绍所说：Requests 是使用 Apache2 Licensed 许可证的 HTTP 库。用 Python 编写，真正的为人类着想。

我们可以根据需求来使用，强烈推荐使用Requests。

因为python2和python3中可用的包不一样，所以我们分开来尝试。

Python2中的urllib与urllib2

urllib2的用法

urllib2可以接受一个Request类的实例来设置URL请求的headers，可以带cooikes等登录信息和User Agent等伪装信息。
例如：

req = urllib2.Request(
        url=url,
        data=postdata,
        headers=headers
)
result = urllib2.urlopen(req)

urllib的用法

urllib仅可以接受URL。
这意味着，你不可以伪装你的User Agent字符串等。
但是urllib提供urlencode方法用来GET查询字符串的产生，而urllib2没有。这是就是为何urllib常和urllib2一起使用的原因，如下：

postdata = urllib.urlencode(postdata)

把字典形式的postdata编码一下

Python3 urllib、urllib2

urllib的用法

1、最简单
import urllib.request
response = urllib.request.urlopen('http://python.org/')
html = response.read()

2、使用 Request
import urllib.request
req = urllib.request.Request('http://python.org/')
response = urllib.request.urlopen(req)
the_page = response.read()


3、发送数据
import urllib.parse
import urllib.request
url = '"
values = {
'act' : 'login',
'login[email]' : '',
'login[password]' : ''
}
data = urllib.parse.urlencode(values)
req = urllib.request.Request(url, data)
req.add_header('Referer', 'http://www.python.org/')
response = urllib.request.urlopen(req)
the_page = response.read()
print(the_page.decode("utf8"))

4、发送数据和header
import urllib.parse
import urllib.request
url = ''
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
values = {
'act' : 'login',
'login[email]' : '',
'login[password]' : ''
}
headers = { 'User-Agent' : user_agent }
data = urllib.parse.urlencode(values)
req = urllib.request.Request(url, data, headers)
response = urllib.request.urlopen(req)
the_page = response.read()
print(the_page.decode("utf8"))



5、http 错误
import urllib.request
req = urllib.request.Request(' ')
try:
urllib.request.urlopen(req)
except urllib.error.HTTPError as e:
print(e.code)
print(e.read().decode("utf8"))

6、异常处理1
from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError
req = Request("http://www..net /")
try:
response = urlopen(req)
except HTTPError as e:
print('The server couldn't fulfill the request.')
print('Error code: ', e.code)
except URLError as e:
print('We failed to reach a server.')
print('Reason: ', e.reason)
else:
print("good!")
print(response.read().decode("utf8"))


7、异常处理2
from urllib.request import Request, urlopen
from urllib.error import  URLError
req = Request("http://www.Python.org/")
try:
response = urlopen(req)
except URLError as e:
if hasattr(e, 'reason'):
print('We failed to reach a server.')
print('Reason: ', e.reason)
elif hasattr(e, 'code'):
print('The server couldn't fulfill the request.')
print('Error code: ', e.code)
else:
print("good!")
print(response.read().decode("utf8"))


8、HTTP 认证
import urllib.request
# create a password manager
password_mgr = urllib.request.HTTPPasswordMgrWithDefaultRealm()
# Add the username and password.
# If we knew the realm, we could use it instead of None.
top_level_url = ""
password_mgr.add_password(None, top_level_url, 'rekfan', 'xxxxxx')
handler = urllib.request.HTTPBasicAuthHandler(password_mgr)
# create "opener" (OpenerDirector instance)
opener = urllib.request.build_opener(handler)
# use the opener to fetch a URL
a_url = ""
x = opener.open(a_url)
print(x.read())
# Install the opener.
# Now all calls to urllib.request.urlopen use our opener.
urllib.request.install_opener(opener)
a = urllib.request.urlopen(a_url).read().decode('utf8')
print(a)



9、使用代理
import urllib.request
proxy_support = urllib.request.ProxyHandler({'sock5': 'localhost:1080'})
opener = urllib.request.build_opener(proxy_support)
urllib.request.install_opener(opener)
a = urllib.request.urlopen("").read().decode("utf8")
print(a)


10、超时
import socket
import urllib.request
# timeout in seconds
timeout = 2
socket.setdefaulttimeout(timeout)
# this call to urllib.request.urlopen now uses the default timeout
# we have set in the socket module
req = urllib.request.Request('')
a = urllib.request.urlopen(req).read()
print(a)

urllib2的用法

1、最简单的urlopen

#coding:utf-8
import urllib, urllib2
 
#前半部分的链接(注意是http，不是https)
url_pre = 'http://www.baidu.com/s'
 
#GET参数
params = {}
params['wd'] = u'测试'.encode('utf-8')
url_params = urllib.urlencode(params)
 
#GET请求完整链接
url = '%s?%s' % (url_pre, url_params)
 
#打开链接，获取响应
response = urllib2.urlopen(url)
 
#获取响应的html
html = response.read()
 
#将html保存到文件
with open('test.txt', 'w') as f:
    f.write(html)

2、使用Request

#coding:utf-8
import urllib, urllib2
 
#前半部分的链接(注意是http，不是https)
url_pre = 'http://www.baidu.com/s'
 
#GET参数
params = {}
params['wd'] = u'测试'.encode('utf-8')
url_params = urllib.urlencode(params)
 
#GET请求完整链接
url = '%s?%s' % (url_pre, url_params)
 
#构造请求，获取响应
request = urllib2.Request(url)
response = urllib2.urlopen(request)
 
#获取响应的html
html = response.read()
 
with open('test.txt', 'w') as f:
    f.write(html)

3、post请求

#coding:utf-8
import urllib, urllib2
 
#构造表单数据，表单数据也是和GET请求一样的形式
values = {}
values['username'] = "aaaaaa"
values['password'] = "bbbbbb"
data = urllib.urlencode(values)
 
#构造请求
url = "http://xxxxxxxxxxx"
request = urllib2.Request(url, data)
 
#响应结果
response = urllib2.urlopen(request)
html = response.read()
print(html)

4、处理cookie

#coding:utf-8
import urllib2
import cookielib
 
#创建cookie
cookie = cookielib.CookieJar()
handler=urllib2.HTTPCookieProcessor(cookie)
 
#通过handler来构建自定义opener
opener = urllib2.build_opener(handler)
 
#此处的open方法同urllib2的urlopen方法
request = urllib2.Request('http://www.baidu.com')
response = opener.open(request)
for item in cookie:
    print('%s = %s' % (item.name, item.value))

5、反爬虫设置header

#coding:utf-8
import urllib, urllib2  
 
#设置header
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'  
headers = {'User-Agent':user_agent} 
 
#构造Request请求，其中第二个参数是data
url = 'http://www.server.com/login'
request = urllib2.Request(url, None, headers)
 
#响应请求
response = urllib2.urlopen(request)  
html = response.read()

6、读一个本地文件

import urllib2

f=urllib2.urlopen('file:./a.txt')
buf=f.read()

7、中文地址解析

h4 = u'http://www.baidu.com?w=测试'
h4=h4.encode('utf-8')
response = urllib2.urlopen(h4)
html = response.read()

最好用正确的编码转换一下。上面的例子如果不用转换的函数处理一下网址，会导致urlopen 失败。

8、分类操作
FTP

handler = urllib2.FTPHandler()
request = urllib2.Request(url='ftp://ftp.ubuntu.com/')
opener = urllib2.build_opener(handler)
f = opener.open(request)
print f.read()

如果需要用户名和密码:

urllib2.Request(url='ftp://用户名:密码@ftp地址/')

HTTP

handler = urllib2.HTTPHandler()
request = urllib2.Request(url='http://ftp.ubuntu.com/')
opener = urllib2.build_opener(handler)
f = opener.open(request)
print f.read()

9、使用代理

proxy_support = urllib2.ProxyHandler({"http":"http://proxy.****.com/"})  
opener = urllib2.build_opener(proxy_support)  
urllib2.install_opener(opener)  
res = urllib2.urlopen('http://www.taobao.com/')  
print res.read() #将读取得到整个html页面

可能遇到的问题–No module named 'urllib2

需要注意的是

在python3.3后urllib2已经不能再用，只能用urllib.request来代替

response=urllib2.urlopen('
  File "b.py", line 1, in <module>
ImportError: No module named 'urllib2'
response=urllib.urlopen('  File "b.py", line 2, in <module>
http://www.baidu.com')

将urllib2给改为urllib.request即可正常运行

import urllib.request
print(urllib.request.__file__)

第三方模块requests

导入

import requests

1、简单获取get

 r = requests.get('https://api.github.com/events')
 print(r.text)

2、post方式传参

r = requests.post('http://httpbin.org/post', data = {'key':'value'})
print(r.text)

3、其他方式

r = requests.put('http://httpbin.org/put', data = {'key':'value'})
r = requests.delete('http://httpbin.org/delete')
r = requests.head('http://httpbin.org/get')
r = requests.options('http://httpbin.org/get')
print(r.text)

4、get方式传参

payload = {'key1': 'value1', 'key2': 'value2'}
r = requests.get("http://httpbin.org/get", params=payload)
print(r.text)
或者
payload = {'key1': 'value1', 'key2': ['value2', 'value3']}
r = requests.get('http://httpbin.org/get', params=payload)
print(r.text)

通过打印输出该 URL，你能看到 URL 已被正确编码：

print(r.url)

输出如下:

http://httpbin.org/get?key2=value2&key1=value1
或者
http://httpbin.org/get?key1=value1&key2=value2&key2=value3

5、编码
Requests 会自动解码来自服务器的内容。大多数 unicode 字符集都能被无缝地解码。
查看使用的编码

print(r.encoding)

6、获取二进制内容–例如图片、文件等

使用r.content如下:

from PIL import Image
from io import BytesIO

i = Image.open(BytesIO(r.content))
print(i.mode,i.size,i.format)
i.show()
i.save("/imgae/1.png")
i.save(outfile, "JPEG")

PIL (Python Image Library) 是 Python 平台处理图片的事实标准，兼具强大的功能和简洁的 API。
调用i.show()会在图片查看工具中显示当前操作的image对象。
标准版本的show方法的实现不太高效，因为它先把image保存到一个临时文件，然后调用xy工具来显示图像。如果你没有安装xy，那么它就无法工作了。不过如果它可以工作，倒还是非常方便用来debug和测试。

save(filename)用以保存这个临时的image对象img到硬盘。

制作缩略图

 try:
            im   = Image.open(infilepath)
            x, y = im.size
            im.thumbnail((x//2, y//2))
            im.save(outfilepath, "JPEG")
        except IOError:
            print "cannot create thumbnail for", infile

7、获取json格式的返回值

import requests

r = requests.get('https://api.github.com/events')
print(r.json())

如果 JSON 解码失败， r.json() 就会抛出一个异常。例如，响应内容是 401 (Unauthorized)，尝试访问 r.json() 将会抛出 ValueError: No JSON object could be decoded 异常。

需要注意的是，成功调用 r.json() 并不意味着响应的成功。有的服务器会在失败的响应中包含一个 JSON 对象（比如 HTTP 500 的错误细节）。这种 JSON 会被解码返回。要检查请求是否成功，请使用 r.raise_for_status() 或者检查 r.status_code 是否和你的期望相同。

8、获取文件流–下载视频等
原始响应内容
在罕见的情况下，你可能想获取来自服务器的原始套接字响应，那么你可以访问 r.raw。如果你确实想这么干，那请你确保在初始请求中设置了 stream=True。具体你可以这么做：

r = requests.get('https://api.github.com/events', stream=True)
r.raw
r.raw.read(10)

with open(filename, 'wb') as fd:
    for chunk in r.iter_content(chunk_size):
        fd.write(chunk)

输出

<requests.packages.urllib3.response.HTTPResponse object at 0x101194810>
r.raw.read(10)

但一般情况下，应该将文本流保存到文件。

9、增加header头

url = 'https://api.github.com/some/endpoint'
headers = {'user-agent': 'my-app/0.0.1'}
r = requests.get(url, headers=headers)

10、post文件模式Multipart-Encoded

 url = 'http://httpbin.org/post'
files = {'file': open('report.xls', 'rb')}

r = requests.post(url, files=files)
r.text

显式地设置文件名，文件类型和请求头：

url = 'http://httpbin.org/post'
files = {'file': ('report.xls', open('report.xls', 'rb'), 'application/vnd.ms-excel', {'Expires': '0'})}
r = requests.post(url, files=files)
r.text

11、异常处理

r = requests.get('http://httpbin.org/get')
r.status_code
if r.status_code == requests.codes.ok：
   print("ok")
else:
   r.raise_for_status()

通过Response.raise_for_status() 来抛出异常,如果r 的 status_code 是 200 ，当我们调用 raise_for_status() 时，得到的是：

>>> r.raise_for_status()
None

12、重定向

>>> r = requests.get('http://github.com', allow_redirects=False)
>>> r.status_code
301
>>> r.history
或者
>>> r = requests.head('http://github.com', allow_redirects=True)
>>> r.url
'https://github.com/'
>>> r.history
[<Response [301]>]

13、超时

>>> requests.get('http://github.com', timeout=0.001)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
requests.exceptions.Timeout: HTTPConnectionPool(host='github.com', port=80): Request timed out. (timeout=0.001)

14、cookie和模拟登录相关
参考
http://cn.python-requests.org/en/latest/user/advanced.html#advanced

15、代理

import requests

proxies = {
  "http": "http://10.10.1.10:3128",
  "https": "http://10.10.1.10:1080",
}

requests.get("http://example.org", proxies=proxies)

HTTP Basic Auth代理

proxies = {
    "http": "http://user:pass@10.10.1.10:3128/",
}

16、socks模式代理

proxies = {
    'http': 'socks5://user:pass@host:port',
    'https': 'socks5://user:pass@host:port'
}

17、重试次数
如果使用get等简单形式，默认会重试3次
重试只有在DNS解析错误、链接错误、链接超时等异常是才重试。在比如读取超时、写超时、HTTP协议错误等不会重试
使用重试会导致返回的错误为MaxRetriesError，而不是确切的异常

import requests
from requests.adapters import HTTPAdapter

s = requests.Session()
s.mount('http://', HTTPAdapter(max_retries=3))
s.mount('https://', HTTPAdapter(max_retries=3))

s.get('http://example.com', timeout=1)

requests实例

实例1：京东商品页面的爬取

现在我们利用requests库爬取京东的商品信息

首先引入requests库

 import requests

复制代码然后爬取页面

r =requests.get("https://item.jd.com/4645290.html")

复制代码然后我们测试状态码,编码和内容

r.status_code
r.encoding
r.text[:1000]

可以看到输出了获取到的页面内容前1000个字符。

到此，说明我们已经成功利用requests库获取到了商品页面的信息。

完整的爬取代码

import requests
url = "https://item.jd.com/4645290.html"
try:
    r = requests.get(url)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.text[:1000])
except:
    print("爬取失败")

实例2 ：亚马逊商品页面爬取

首先，我们按照之前的步骤进行爬取
引入requests库，然后get，判断status_code

r = requests.get("https://www.amazon.cn/dp/B0011F7WU4/ref=s9_acss_bw_cg_JAVA_1a1_w?m=A1AJ19PSB66TGU&pf_rd_m=A1U5RCOVU0NYF2&pf_rd_s=merchandised-search-6&pf_rd_r=D9MK8AMFACZGHMFJGRXP&pf_rd_t=101&pf_rd_p=f411a6d2-b2c5-4105-abd9-69dbe8c05f1c&pf_rd_i=1899860071")
r.status_code

复制代码显示503，说明服务器错误，
503 （服务不可用）服务器目前无法使用（由于超载或停机维护）。通常，这只是暂时状态。

我们查看编码发现

r.encoding
'ISO-8859-1'

我们需要转换编码

r.encoding = r.apparent_encoding
r.text

然后显示爬取内容，发现出现了错误的原因。

网页告诉我们出现了错误，但只要我们正确获取到了网页的内容，就说明网络方面肯定是没有错误的。这说明亚马逊对爬虫有限制，一般对爬虫限制的话，就是robots协议，其中还可以对访问对象进行限制，限制只能通过相应的浏览器访问，而限制爬虫的访问。

我们通过request.header查看我们发给亚马逊的请求头部到底是什么内容

我们看到信息中的user-agent的信息是python。这说明我们的程序诚实的告诉亚马逊，这个程序是python的requests库发起的请求。
亚马逊的服务器看到这是个爬虫请求，所以就返回错误的信息。

那么我们如何才能访问呢？

我们都知道requests库可以更改请求的头部信息，我们可以模拟一个浏览器的请求

我们构造一个键值对

kv = {'user-agent':'Mozilla/5.0'}
url = "https://www.amazon.cn/dp/B0011F7WU4/ref=s9_acss_bw_cg_JAVA_1a1_w?m=A1AJ19PSB66TGU&pf_rd_m=A1U5RCOVU0NYF2&pf_rd_s=merchandised-search-6&pf_rd_r=D9MK8AMFACZGHMFJGRXP&pf_rd_t=101&pf_rd_p=f411a6d2-b2c5-4105-abd9-69dbe8c05f1c&pf_rd_i=1899860071"
r = requests.get(url, headers = kv)

我们查看状态码，发现为200，说明这一次成功获取到了页面的内容

完整的爬取代码

import requests
url = "https://www.amazon.cn/dp/B0011F7WU4/ref=s9_acss_bw_cg_JAVA_1a1_w?m=A1AJ19PSB66TGU&pf_rd_m=A1U5RCOVU0NYF2&pf_rd_s=merchandised-search-6&pf_rd_r=D9MK8AMFACZGHMFJGRXP&pf_rd_t=101&pf_rd_p=f411a6d2-b2c5-4105-abd9-69dbe8c05f1c&pf_rd_i=1899860071"
try:
    kv = {'user-agent':'Mozilla/5.0'}
    r = requests.get(url, headers = kv)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.text[1000:2000])
except:
    print("爬取失败")

实例3：百度/360搜索关键词提交爬虫

搜索关键词提交的接口：

https://www.baidu.com/s?ie=UTF-8&wd=keyword

通过requests的params参数，构造查询参数
完整的代码

import requests

keyword = "张学友"
url = "http://www.baidu.com/s?ie=UTF-8"

try:
    kv = {"wd":keyword}
    r = requests.get(url, params = kv)
    print(r.request.url)
    r.raise_for_status()
    print(len(r.text))
    print(r.text)
except:
    print("爬取失败")

实例4 网络图片的爬取和存储

网络中图片连接的格式

http://www.example.com/picture.jpg

假设我们现在要爬取图片网站

http://www.nationalgeographic.com.cn/

图片连接:

http://image.nationalgeographic.com.cn/2015/0121/20150121033625957.jpg

完整的爬取代码：

import requests
import os

url = "http://image.nationalgeographic.com.cn/2015/0121/20150121033625957.jpg"
root = "D://pics//"
path = root + url.split('/')[-1]

try:
    if not os.path.exists(root):
        os.mkdir(root)
    if not os.path.exists(path):
        r = requests.get(url)
        with open(path,'wb') as f:
            f.write(r.content)
            f.close()
            print("文件保存成功")
    else :
        print("文件已存在")
except:
    print("爬取失败")

实例5 IP地址归属地查询

此网站可以查询IP地址归属地

http://m.ip138.com/ip.asp

我们分析它请求的过程，发现它的请求接口就是在地址后附加参数，类似于百度搜索

http://m.ip138.com/ip.asp?ip=125.220.159.160

所以我们可以构造查询参数，发送给服务器，然后获取返回的结果

完整代码

import requests
url = "http://m.ip138.com/ip.asp?"
ip = "125.220.159.160"
kv = {"ip":ip}

try:
    r = requests.get(url, params = kv)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    print(r.text)
except:
    print("爬取失败")

BeautifulSoup解析网页内容

前面说了这么多，都是为了获取网页内容html。既然获取到html之后，我们就要进行解析从中提取我们需要的数据。

我们所获取的html本质是字符串。处理字符串最基本的方法是通过相关的字符串函数，但效率很低，容易出错。

还可以使用正则表达式处理字符串。这部分的知识也是很多，大家可以自行了解。

这里，我们推荐的处理方式是使用BeautifulSoup。

BeautifulSoup是解析html/xml的库。非Python自带的库，安装如下：

pip install beautifulsoup4
pip install lxml

安装lxml库是为了加快html解析效率。

基本用法
1、创建BeautifulSoup对象

import bs4
from bs4 import BeautifulSoup

接下来使用beautifulsoup扩展库对html中特定的div进行解析

from bs4 import *  
soup = BeautifulSoup(res.read( ))  
print(soup.find(id="div1")) #得到id=div1的div

2、访问节点

soup = BeautifulSoup(res.read( ))  
print(soup.title)
print(soup.title.name)
print(soup.title.string)
print(soup.title.parent.name)
print(soup.p)
print(soup.p['class'])

3、指定tag、class或id

print(soup.find_all('a'))
print(soup.find('a'))
print(soup.find(class_='title'))
print(soup.find(id="link3"))
print(soup.find('p',class_='title'))

4、从文档中找到所有a标签的链接

for link in soup.find_all('a'):
    print(link.get('href'))

5、指定解析器

出现了警告，根据提示，我们在创建BeautifulSoup对象时，指定解析器即可。

soup = BeautifulSoup(html_doc,'html.parser')

6、从文档中获取所有文字内容

print(soup.get_text())

7、正则匹配

link_node = soup.find('a',href=re.compile(r"til"))
print(link_node)

8、使用lxml解析html

soup = BeautifulSoup(html_doc, 'lxml')

soup是解析得到的解析器。
我们可以根据html的结构获取对应的节点。例如我想获取p标签：

p = soup.body.p

但该方法只能获取到第1个节点。假如body标签下有很多p节点，该方法无法获取全部。
这里，我们可以用find_all或select方法获取。建议大家使用select方法，这个方法可以jQuery选择器用法差不多。例如：

p1 = soup.select('p') #获取p标签
p2 = soup.select('#test_p') #获取id为test_p的标签
p3 = soup.select('.test')   #获取class为test的标签
p4 = soup.select('body .test') #获取body下的class为test的标签

完整的例子，输出结果：

#coding:utf-8
from bs4 import BeautifulSoup
 
#先随便假设一个html
html = '''<html>
<head></head>
<body>
    <p id="test_p" class="test">test1</p>
    <p class="test">test2</p>
</body>
<html>'''
 
#使用lxml解析html
soup = BeautifulSoup(html, 'lxml')
 
#获取全部p标签
for p in soup.select('p'):
    print(p)
通过该方法，可以输出全部p标签。
那假如我要获取p标签的属性和数据呢？方法如下：
for p in soup.select('p'):
    print(p.name) #标签名称
    
    #标签属性，也可以用p['id']。若属性不存在会报错，和字典获取键值类似
    print(p.get('id')) 
    print(p.string) #标签内容
若一个标签里面有很多子标签，你可以再进一步继续使用select。
若想获取标签下全部子标签的文本内容。可以用strings属性得到一个生成器，不过可能有很多回车和空格。若想屏蔽回车和空格，可以使用stripped_strings属性。如下所示：
print(''.join(soup.body.strings))
print(''.join(soup.body.stripped_strings))
将分别得到：
u'\ntest1\ntest2\n'
u'test1test2'

beautifulsoup更多用法参考

https://beautifulsoup.readthedocs.io/zh_CN/latest/#id18

并行处理–多线程抓取

单线程抓取和解析都会比较慢，可以使用多线程进行处理。
python的多线程使用参考
python积累–多线程的使用实例

存储

写入excel

import xlwt

class ToutiaoPipeline(object):
    def __init__(self):
        self.book=xlwt.Workbook()
        self.sheet=self.book.add_sheet('sheet', cell_overwrite_ok=True)
        head=[u'名字', u'点赞', u'回复', u'评论']
        i=0
        for h in head:
            self.sheet.write(0, i, h)
            i += 1

    def process_item(self,item,spider):
        self.sheet.write(item['Num'],0,item['name'])
        self.sheet.write(item['Num'],1,item['like'])
        self.sheet.write(item['Num'],2,item['reply'])
        self.sheet.write(item['Num'],3,item['text'])
        self.book.save('TouTiao.xls')

存入mongodb

参考 python使用pymongo读写mongodb

存入mysql

pandas表格导入MySQL数据库
pandas提供了将数据便捷存入关系型数据库的方法，在新版的pandas中，主要是已sqlalchemy方式与数据建立连接，支持MySQL、Postgresql、Oracle、MS SQLServer、SQLite等主流数据库。本例以MySQL数据库为代表，展示将获取到的股票数据存入数据库的方法,其他类型数据库请参考sqlalchemy官网文档的create_engine部分。

常用参数说明：

name:表名，pandas会自动创建表结构
con：数据库连接，最好是用sqlalchemy创建engine的方式来替代con
flavor:数据库类型 {‘sqlite’, ‘mysql’}, 默认‘sqlite’，如果是engine此项可忽略
schema:指定数据库的schema，默认即可
if_exists:如果表名已存在的处理方式 {‘fail’, ‘replace’, ‘append’},默认‘fail’
index:将pandas的Index作为一列存入数据库，默认是True
index_label:Index的列名
chunksize:分批存入数据库，默认是None，即一次性全部写人数据库
dtype:设定columns在数据库里的数据类型，默认是None

调用方法：

from sqlalchemy import create_engine
import tushare as ts

df = ts.get_tick_data('600848', date='2014-12-22')
engine = create_engine('mysql://user:passwd@127.0.0.1/db_name?charset=utf8')

#存入数据库
df.to_sql('tick_data',engine)

#追加数据到现有表
#df.to_sql('tick_data',engine,if_exists='append')

或者参考

python–mysql–驱动简介和使用

薄荷网获取解析存库完整示例

抓取列表链接

def fetchcategoryview(value, source):
    headlink="http://www.boohee.com/food/view_group/"
    pagelink="?page="
    content=fetchraw(headlink+str(value)+pagelink+str(1))
    soup = BeautifulSoup(content)
    div=soup.find("div", class_="widget-food-list")
    #print(div)
    h3s=div.h3
    #print(h3s)
    ss=h3s.stripped_strings
    for inx,val in enumerate(ss):
        if inx==0:
           type = val.replace(" ","").replace("：","").strip('\n')
           print(type)
    span = div.find("span", class_="pagination-sum")
    nums = span.string
    recordnum = re.findall(r"\d+\.?\d*",nums)[0]
    pagelimit = int(recordnum) // 10 + 1
    print(pagelimit)
    pagelimit = pagelimit + 1
    order = 0
    asc = 0
    if pagelimit == 12:
        pagelimit = 10
    for page in range(1, pagelimit):
        print("page:%s order_by:%s  order_asc%s" % (page, order, asc))
        link = headlink + str(value) + pagelink + str(page)
        print(link)
        insertCategoryPageLink(page, order, asc, link, type, source)



def fetchcategorygroup(value, source):
    headlink="http://www.boohee.com/food/group/"
    pagelink="?page="
    content=fetchraw(headlink+str(value)+pagelink+str(1))
    soup = BeautifulSoup(content)
    div=soup.find("div", class_="widget-food-list")
    #print(div)
    h3s=div.h3
    #print(h3s)
    ss=h3s.stripped_strings
    for inx,val in enumerate(ss):
        if inx==0:
           type = val.replace(" ","").replace("：","").strip('\n')
           print(type)
    span = div.find("span", class_="pagination-sum")
    nums = span.string
    recordnum = re.findall(r"\d+\.?\d*",nums)[0]
    print(recordnum)
    pagelimit=int(recordnum)//10+1
    print(pagelimit)
    pagelimit = pagelimit + 1
    order=0
    asc=0
    if pagelimit==12:
       pagelimit = 10
    for page in range(1, pagelimit):
        print("page:%s order_by:%s  order_asc%s" % (page, order, asc))
        link = headlink + str(value) + pagelink + str(page)
        print(link)
        insertCategoryPageLink(page, order, asc, link, type, source)



for value in range(1, 41):
    fetchcategorygroup(value, "薄荷web-group")

for value in range(1, 132):
    fetchcategoryview(value, "薄荷web-view")

抓取网页原内容

import urllib
import urllib.request
import json
from mgap_spider.dao.categoryPageLinkDao import *
from bs4 import *
import string
import time
import re

def fetchraw(link):
    f = urllib.request.Request(link)
    response = urllib.request.urlopen(f)
    the_page = response.read()
    content = the_page.decode("utf8")
    time.sleep(1)
    #print(content)
    return content
    #contentjson = json.loads(content)
    #print(contentjson)
    #print(contentjson['total_pages'])


for i in range(0, 30000, 100):
    links=findNoDealedLimit(i,100)
    for x in links:
        content=fetchraw(x['link'])
        insertCategoryPageRaw(x['link'],content,x['type'],x['source'])
        dealCategoryPagelink(x['link'])
        print("dealed %s %s %s" % (x['source'], x['type'], x['link']))

解析

from mgap_spider.dao.itemLinkDao import *
from mgap_spider.dao.categoryPageLinkDao import *
import json
from bs4 import *
import string
import time
import re
import _thread




def parserawauto(begin,size):
    linkhead = "http://food.boohee.com/fb/v1/foods/"
    linkend = "/mode_show?token=&user_key=&app_version=2.6.2.1&app_device=Android&os_version=7.1.2&phone_model=M6+Note&channel=meizu"
    while 1:
        try:
            count=countNoDealedPageRaw()
            if count==0:
                break
            raws=findNoDealedRawLimit(begin,size)
            for raw in raws:
                if raw['source'] == '食物库app':
                    content=raw['content']
                    contentjson = json.loads(content)
                    foods = contentjson['foods']
                    for food in foods:
                        link=linkhead+food['code']+linkend
                        print("dealed %s %s" % (food['code'], food['name']))
                        insertItemLink(food['code'],food['name'],raw['link'],link,raw['type'],raw['source'])
                        dealCategoryPageRaw(raw['link'])
                    print("dealed %s %s %s" % (raw['source'], raw['type'], raw['link']))
                else:
                    content=raw['content']
                    soup = BeautifulSoup(content)
                    div = soup.find("div", class_="widget-food-list")
                    ul = div.find("ul", class_="food-list")
                    boxs = ul.find_all("div", class_="text-box")
                    for box in boxs:
                        node = box.find('a', href=re.compile(r'/shiwu/\w+'))
                        code=node['href'].replace("/shiwu/","")
                        #code=ahref.replace("/shiwu/","")
                        name=node['title']
                        link=linkhead+code+linkend
                        print("dealed %s %s" % (code, name))
                        insertItemLink(code,name,raw['link'],link,raw['type'],raw['source'])
                        dealCategoryPageRaw(raw['link'])
                    print("dealed %s %s %s" % (raw['source'], raw['type'], raw['link']))
        except Exception as e:
            print(e)
    return "begin "+str(begin)+" finish"+datetime.now()

def run():
    # 创建两个线程
    try:
        _thread.start_new_thread(parserawauto, (0, 100))
        _thread.start_new_thread(parserawauto, (100, 100))
    except Exception as e:
       print(e)
       print("Error: unable to start thread")


run()

while 1:
    pass

存储

import pymongo
from pymongo import MongoClient
from mgap_spider.settings import config
from datetime import datetime


def initMongoClient():
    uri = "mongodb://"+config['mongo.username']+":"+config['mongo.password']+"@"+config['mongo.host']+":"+config['mongo.port']+"/admin"
    print(uri)
    client = MongoClient(uri)
    return client


def insertCategoryPageLink(page,order,asc, link, type , source):
    insert_record = {'page': page,'order': order,'asc': asc, 'link': link, 'dealed': 0, 'type': type, 'source': source, 'date':datetime.now()}
    client = initMongoClient()
    db = client['mydb_food']
    collection = db['categorypagelink']
    queryArgs = {'link': link}
    linkcuont = collection.count(queryArgs)
    if linkcuont == 0:
       collection.insert_one(insert_record)

def findNoDealed():
    familys = []
    client = initMongoClient()
    db = client['mydb_food']
    collection = db['categorypagelink']
    queryArgs = {'dealed': 0}
    searchRes = collection.find(queryArgs)
    for x in searchRes:
        familys.append(x)
    return familys


def findNoDealedLimit(skip,limit):
    familys = []
    client = initMongoClient()
    db = client['mydb_food']
    collection = db['categorypagelink']
    queryArgs = {'dealed': 0}
    searchRes = collection.find(queryArgs).skip(skip).limit(limit)
    for x in searchRes:
        familys.append(x)
    return familys


def findNoDealedRawLimit(skip,limit):
    familys = []
    client = initMongoClient()
    db = client['mydb_food']
    collection = db['categorypageraw']
    queryArgs = {'dealed': 0}
    searchRes = collection.find(queryArgs).skip(skip).limit(limit)
    for x in searchRes:
        familys.append(x)
    return familys


def countNoDealedPageRaw():
    client = initMongoClient()
    db = client['mydb_food']
    collection = db['categorypageraw']
    queryArgs = {'dealed': 0}
    return collection.count(queryArgs)


def findNoDealedLinks():
    barcodes = []
    searchList = findNoDealed()
    for x in searchList:
        barcodes.append(x['link'])
    print(barcodes)
    return barcodes

def insertCategoryPageRaw(link,content,type,source):
    insert_record = {'content': content, 'link': link, 'dealed': 0, 'type': type, 'source': source, 'date':datetime.now()}
    client = initMongoClient()
    db = client['mydb_food']
    collection = db['categorypageraw']
    collection.insert_one(insert_record)

def dealCategoryPagelink(link):
    filterArgs = {'link': link}
    updateArgs = {'$set': {'dealed': 1}}
    client = initMongoClient()
    db = client['mydb_food']
    collection = db['categorypagelink']
    updateRes = collection.update_many(filter=filterArgs, update=updateArgs)


def dealCategoryPageRaw(link):
    filterArgs = {'link': link}
    updateArgs = {'$set': {'dealed': 1}}
    client = initMongoClient()
    db = client['mydb_food']
    collection = db['categorypageraw']
    updateRes = collection.update_many(filter=filterArgs, update=updateArgs)

案例：爬取妹纸图

import requests
from bs4 import BeautifulSoup
import os
#导入所需要的模块
class mzitu():
    def all_url(self, url):
        html = self.request(url)##
        all_a = BeautifulSoup(html.text, 'lxml').find('div', class_='all').find_all('a')
        for a in all_a:
            title = a.get_text()
            print('------开始保存：', title) 
            path = str(title).replace("?", '_') ##替换掉带有的？
            self.mkdir(path) ##调用mkdir函数创建文件夹！这儿path代表的是标题title
            href = a['href']
            self.html(href) 

    def html(self, href):   ##获得图片的页面地址
        html = self.request(href)
        max_span = BeautifulSoup(html.text, 'lxml').find('div', class_='pagenavi').find_all('span')[-2].get_text()
        #这个上面有提到
        for page in range(1, int(max_span) + 1):
            page_url = href + '/' + str(page)
            self.img(page_url) ##调用img函数

    def img(self, page_url): ##处理图片页面地址获得图片的实际地址
        img_html = self.request(page_url)
        img_url = BeautifulSoup(img_html.text, 'lxml').find('div', class_='main-image').find('img')['src']
        self.save(img_url)

    def save(self, img_url): ##保存图片
        name = img_url[-9:-4]
        img = self.request(img_url)
        f = open(name + '.jpg', 'ab')
        f.write(img.content)
        f.close()

    def mkdir(self, path): ##创建文件夹
        path = path.strip()
        isExists = os.path.exists(os.path.join("E:\mzitu2", path))
        if not isExists:
            print('建了一个名字叫做', path, '的文件夹！')
            os.makedirs(os.path.join("E:\mzitu2", path))
            os.chdir(os.path.join("E:\mzitu2", path)) ##切换到目录
            return True
        else:
            print( path, '文件夹已经存在了！')
            return False

    def request(self, url): ##这个函数获取网页的response 然后返回
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36',
            'referer':#伪造一个访问来源 "http://www.mzitu.com/100260/2"
        }
        content = requests.get(url, headers=headers)
        return content
#设置启动函数
def main():
    Mzitu = mzitu() ##实例化
    Mzitu.all_url('http://www.mzitu.com/all') ##给函数all_url传入参数  

main()

csdn案例


# 对CSDN博客信息进行爬取，获取博客的主题、链接、日期、访问量、评论数等信息
import re
from urllib import request
 
from bs4 import BeautifulSoup
 
 
class CSDNSpider:
 
    # 初始化爬取的页号、链接以及封装Header
    def __init__(self, pageIndex=1, url="http://blog.csdn.net/u012050154/article/list/1"):
        self.pageIndex = pageIndex
        self.url = url
        self.header = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36"
        }
 
    # 请求网页得到BeautifulSoup对象
    def getBeautifulSoup(self, url):
        # 请求网页
        req = request.Request(url, headers=self.header)
        res = request.urlopen(req)
        # 以html5lib格式的解析器解析得到BeautifulSoup对象
        # 还有其他的格式如：html.parser/lxml/lxml-xml/xml/html5lib
        soup = BeautifulSoup(res, 'html5lib')
        return soup
 
    # 获取博客的博文分页总数
    def getTotalPages(self):
        soup = self.getBeautifulSoup(self.url)
        # 得到如下内容“209条  共14页”
        pageNumText = soup.find('div', 'pagelist').span.get_text()
        # 利用正则表达式进一步提取得到分页数
        pageNum =re.findall(re.compile(pattern=r'共(.*?)页'), pageNumText)[0]
        return int(pageNum)
 
    # 读取每个页面上各博文的主题、链接、日期、访问量、评论数等信息
    def getBlogInfo(self, pageIndx):
        res = []
        # 每页的链接如http://blog.csdn.net/u012050154/article/list/1
        # 所以按pageIndex更新url
        url = self.url[0:self.url.rfind('/')+1] + str(pageIndx)
        # 按url解析得到BeautifulSoup对象
        soup = self.getBeautifulSoup(url)
        # 得到目标信息
        blog_items = soup.find_all('div', 'list_item article_item')
        for item in blog_items:
            # 博文主题
            title = item.find('span', 'link_title').a.get_text()
            blog = '标题:' + title
            # 博文链接
            link = item.find('span', 'link_title').a.get('href')
            blog += '\t博客链接:' + link
            # 博文发表日期
            postdate = item.find('span', 'link_postdate').get_text()
            blog += '\t发表日期:' + postdate
            # 博文的访问量
            views_text = item.find('span', 'link_view').get_text() # 阅读(38)
            views = re.findall(re.compile(r'(\d+)'), views_text)[0]
            blog += '\t访问量:' + views
            # 博文的评论数
            comments_text = item.find('span', 'link_comments').get_text()
            comments = re.findall(re.compile(r'(\d+)'), comments_text)[0]
            blog += '\t评论数:' + comments + '\n'
 
            print(blog)
            res.append(blog)
        return res
 
def saveFile(datas ,pageIndex):
    path = "D:\\Program\\PythonCrawler\\CSDN\Data\\page_" + str(pageIndex + 1) + ".txt"
    with open(path, 'w', encoding='gbk') as file:
        file.write('当前页：' + str(pageIndex + 1) + '\n')
        for data in datas:
            file.write(data)
 
 
 
if __name__=="__main__":
    spider = CSDNSpider()
 
    pageNum = spider.getTotalPages()
    print("博客总页数：", pageNum)
 
    for index in range(pageNum):
        print("正在处理第%s页…" % (index+1))
        blogsInfo = spider.getBlogInfo(index+1)
        saveFile(blogsInfo, index)

可能遇到的问题-urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1045)>

这个错误是因为Python 2.7.9 之后引入了一个新特性，当你使用urllib.urlopen一个 https 的时候会验证一次 SSL证书。当目标使用的是自签名的证书时就会报urllib.error.URLError错误。

通过导入ssl模块把证书验证改成不用验证就行了。

解决方式

加入

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

例子

import urllib.request
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
response = urllib.request.urlopen('https://www.python.org')
print(response.read().decode('utf-8'))

如需转载请注明出处:python–爬虫–获取和解析存储网页内容–以薄荷网为例

参考链接:
https://juejin.im/post/5a3b3c086fb9a044ff31a0d6
https://segmentfault.com/a/1190000011192866