摘要
今天来介绍一下通过requests + re(正则表达式) 完成的爬虫
首先我们先介绍一下 requests 这各库,这是一个很好用的爬虫库搭配BS4可以完成大部分的爬虫工作,当然如果遇到JS 渲染的时侯就要另外想办法了,可以参考Scrapy框架这是一个专门做爬虫的框架 ,我们接着说requests ,requests我个人常用的方法有两种一个事get 一个是post ,get-向指定的資源要求資料,類似於查詢操作。post-將要處理的資料提交給指定的資源,類似於更新操作。再来我们说说请求头,在爬取某些网站的时候我们需要将爬虫模拟浏览器的请求这时候就需要请求头请求头的详细资料可以参考百度百科这里提供一个请求头的范例 大部分的请求只需要加入user-agent 即可完成模拟浏览器如果遇到需要登录的网站那我们就可以加入cookies 来完成这个步骤
好接下来我们说说re正则匹配模块 先上文档正则表达式是爬虫里常用的一个东西,可以笔者资值愚钝所以常常搞不懂该怎么写,在大部分的情况下我都是.*?进行贪婪匹配如果遇到特殊需求的话就会上网查毕竟自己在哪边想破头还不如上网查来的直接便利…在爬虫的部份可以多参考这里
绪论
前面简单的介绍了requests 和 re 模块 接下来我们先上代码来看看实际效果
import requests
from bs4 import BeautifulSoup
import re
class ReqTest:
def __init__(self):
pass
def get_baidu(self):
"""
使用GET方法获取百度
:return:状态码
"""
req = requests.get('https://www.baidu.com/')
return req.status_code, req.url
def post_baidu(self):
"""
使用post方法获取百度
:return: 状态码
"""
data = {
'name': 'test',
"tags": '1234'
}
req = requests.post('https://www.baidu.com/', data=data)
return req.text
def get_movie_detail(self):
dic = {}
tmp_list = []
for i in range(0, 250, 25):
url = 'https://movie.douban.com/top250?start=%s&filter=' % i
headers = {
'User-Agent': "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50"
}
res = requests.get(url, headers=headers)
retag = re.compile(
r'.*?<em class="">(.*?)</em>' # 匹配 rank
r'.*?<span class="title">(.*?)</span>' # 匹配 名称
r'.*?<span class="title"> / (.*?)</span>'
r'.*?<span class="other"> / (.*?)</span>'
r'.*?<p class="">.*?: (.*?) .*? (\d+).*?</p>' # 匹配 导演 年份
r'.*?<span class="rating_num" property="v:average">(.*?)</span>' # 匹配 评价
r'.*?<span class="inq">(.*?)</span>', # 匹配 短评
re.S)
a = retag.findall(res.text)
# print(a)
for i in a:
rank = i[0]
title = i[1]
title_en = i[2]
title_other = i[3]
author = i[4]
year = i[5]
scores = i[6]
paragraph = i[7]
sup = {
'rank': rank,
'title': title + '/' + title_en + '/' + title_other,
'author': author,
'year': year,
'score': scores,
'paragraph': paragraph
}
tmp_list.append(sup)
return tmp_list
a = ReqTest()
get = a.get_baidu() # 200
post = a.post_baidu() # 302
movie = a.get_movie_detail()
for i in movie:
print(i)
运行结果:
首先我分别使用get 以及post 方法来对百度发出请求第三个函数得部份我使用了re以及requests来爬取豆瓣电影的top250电影的详情 前面两个返回的是状态码 分别为200 以及302 这里200代表请求成功 302 代表 临时移动。与301类似。但资源只是临时被移动。客户端应继续使用原有URI 详细的http 状态码可以参考这里基本上我们在爬虫的时候常常遇到的都是200,也许你会想问如果在没有网络连接的情况下发出请求会有什么样的结果,不啰嗦直接上结果
/home/chen/Desktop/chne_lean/venv/bin/python /home/chen/Desktop/Exercise/reqtest.py
Traceback (most recent call last):
File "/home/chen/Desktop/chne_lean/venv/lib/python3.6/site-packages/urllib3/connection.py", line 159, in _new_conn
(self._dns_host, self.port), self.timeout, **extra_kw)
File "/home/chen/Desktop/chne_lean/venv/lib/python3.6/site-packages/urllib3/util/connection.py", line 57, in create_connection
for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
File "/usr/lib/python3.6/socket.py", line 745, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/chen/Desktop/chne_lean/venv/lib/python3.6/site-packages/urllib3/connectionpool.py", line 600, in urlopen
chunked=chunked)
File "/home/chen/Desktop/chne_lean/venv/lib/python3.6/site-packages/urllib3/connectionpool.py", line 343, in _make_request
self._validate_conn(conn)
File "/home/chen/Desktop/chne_lean/venv/lib/python3.6/site-packages/urllib3/connectionpool.py", line 839, in _validate_conn
conn.connect()
File "/home/chen/Desktop/chne_lean/venv/lib/python3.6/site-packages/urllib3/connection.py", line 301, in connect
conn = self._new_conn()
File "/home/chen/Desktop/chne_lean/venv/lib/python3.6/site-packages/urllib3/connection.py", line 168, in _new_conn
self, "Failed to establish a new connection: %s" % e)
urllib3.exceptions.NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x7f10d7bda470>: Failed to establish a new connection: [Errno -2] Name or service not known
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/chen/Desktop/chne_lean/venv/lib/python3.6/site-packages/requests/adapters.py", line 449, in send
timeout=timeout
File "/home/chen/Desktop/chne_lean/venv/lib/python3.6/site-packages/urllib3/connectionpool.py", line 638, in urlopen
_stacktrace=sys.exc_info()[2])
File "/home/chen/Desktop/chne_lean/venv/lib/python3.6/site-packages/urllib3/util/retry.py", line 398, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.baidu.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f10d7bda470>: Failed to establish a new connection: [Errno -2] Name or service not known',))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/chen/Desktop/Exercise/reqtest.py", line 71, in <module>
get = a.get_baidu() # 200
File "/home/chen/Desktop/Exercise/reqtest.py", line 15, in get_baidu
req = requests.get('https://www.baidu.com/')
File "/home/chen/Desktop/chne_lean/venv/lib/python3.6/site-packages/requests/api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "/home/chen/Desktop/chne_lean/venv/lib/python3.6/site-packages/requests/api.py", line 60, in request
return session.request(method=method, url=url, **kwargs)
File "/home/chen/Desktop/chne_lean/venv/lib/python3.6/site-packages/requests/sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "/home/chen/Desktop/chne_lean/venv/lib/python3.6/site-packages/requests/sessions.py", line 646, in send
r = adapter.send(request, **kwargs)
File "/home/chen/Desktop/chne_lean/venv/lib/python3.6/site-packages/requests/adapters.py", line 516, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='www.baidu.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f10d7bda470>: Failed to establish a new connection: [Errno -2] Name or service not known',))
Process finished with exit code 1
这里我们可以看到报错很多很长但是我们只需要关注requests.exceptions.ConnectionError: HTTPSConnectionPool(host=‘www.baidu.com’, port=443): Max retries exceeded with url: / (Caused by NewConnectionError(’<urllib3.connection.VerifiedHTTPSConnection object at 0x7f10d7bda470>: Failed to establish a new connection: [Errno -2] Name or service not known’,)) 大致上得意思就是没有网路连接
以上就是今天爬虫的简介 欢迎提出改善建议