入门例子:
需求:爬取汽车之家新闻
代码:
需要安装模块:
shell> pip install requests
shell> pip install beautifulsoup4
import requests
from bs4 import BeautifulSoup
import uuid
#使用requsets.get下载html源码
response = requests.get(url = 'https://www.autohome.com.cn/news/')
#更改其编码格式,源码使用什么编码格式,就自动为对应的编码格式
response.encoding = response.apparent_encoding
#使用BeautifulSoup创建其对象
soup = BeautifulSoup(response.text, features='html.parser')
#使用find找整个源码中id=xxx的div
div_list = soup.find(id='auto-channel-lazyload-article')
#使用find_all找上面div中的li标签,如果使用find只会寻找到第一个li标签
li_list = div_list.find_all('li')
#由于find_all取出的值为列表,且不为对象,所以循环
for i in li_list:
#找其中的A标签
a_list = i.find('a')
#判断其中的a标签是否有值
if a_list:
#获取a标签的href链接
a_link ='http:' + a_list.attrs.get('href')
print(a_link)
#找a标签下面的h3标签内容
h3_text = a_list.find('h3').text
print(h3_text)
# 找a标签中的img标签,获取其中的图片链接
img_list = a_list.find('img')
img_link = 'http:' + img_list.attrs.get('src')
# print(img_link)
#再次图片发送请求
img_response = requests.get(url=img_link)
#组织图片名称,注意UUID需要转换成str才能进行拼接
file_name = str(uuid.uuid4()) + '.jpg'
#存储图片
with open(file_name,'wb') as f:
f.write(img_response.content)
# 找a标签中的p标签,即新闻简介
p_text = a_list.find('p').text
print(p_text)
运行结果:
入门例子2:
需求1: 自动登录抽屉网页
需求2: 为某篇文章进行点赞
讲解: 当使用chrom访问抽屉时,点击“登录”,发现请求URL更改为‘https://dig.chouti.com/login’,以POST请求发送,且数据分别以phone、password、oneMonth传送,分别代表手机号、密码、一个月自动登录
自动登录代码:
import requests
#整理提交的数据
post_dict = {
'phone': 'XXXXXX',
'password': 'XXXXXXX',
'oneMonth': 1,
}
#加入headers
header = {'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
#post方式发送登录信息,将第一次的cookies信息传,让其激活
r2 = requests.post(
url="https://dig.chouti.com/login",
data=post_dict,
headers=header,
)
print(r2.text)
’运行结果:
正确登录结果:{"result":{"code":"9999", "message":"", "data":{"complateReg":"0","destJid":"cdu_53613419358"}}}
错误登录结果:{"result":{"code":"8887", "message":"手机号格式不对", "data":""}}
自动登录及文章点赞代码:
import requests
#整理提交的数据
post_dict = {
'phone': '86xxxx',
'password': 'xxxxx',
'oneMonth': 1,
}
#加入headers
header = {'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
#通过requests.get发送get请求,获取第一次cookies信息
r1 = requests.get('http://dig.chouti.com/',headers=header,)
r1_cookies = r1.cookies.get_dict()
#post方式发送登录信息,将第一次的cookies信息传,让其激活
r2 = requests.post(
url="https://dig.chouti.com/login",
data=post_dict,
headers=header,
cookies=r1_cookies,
)
#点赞功能,使用第一次get的cookies信息,cookies的gpsd信息
r3 = requests.post(
url='https://dig.chouti.com/link/vote?linksId=21901416',
headers=header,
cookies={'gpsd': r1_cookies.get('gpsd')}
)
#点赞成功返回信息
print(r3.text)
运行结果:{"result":{"code":"30010", "message":"你已经推荐过了", "data":""}}
即已经实现自动登录及点赞功能
Requests详解:
无参数GET请求:
import requests
ret = requests.get(url='http://www.baidu.com')
print(ret.text)
有参数GET请求:
import requests
values= {'k1': 'v1', 'k2': 'v2'}
ret = requests.get(rul="http://httpbin.org/get", params=values)
#实现的效果为:http://httpbin.org/get?k1=v1&k2=v2
POST请求:
import requests
payload = {'key1': 'value1', 'key2': 'value2'}
ret = requests.post(url="http://httpbin.org/post", data=payload)
其他请求:
1 2 3 4 5 6 7 8 9 10 |
|
更多参数:
def request(method, url, **kwargs):
"""Constructs and sends a :class:`Request <Request>`.
:param method: method for the new :class:`Request` object.
:param url: URL for the new :class:`Request` object.
:param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`.
:param data: (optional) Dictionary, bytes, or file-like object to send in the body of the :class:`Request`.
:param json: (optional) json data to send in the body of the :class:`Request`.
:param headers: (optional) Dictionary of HTTP Headers to send with the :class:`Request`.
:param cookies: (optional) Dict or CookieJar object to send with the :class:`Request`.
:param files: (optional) Dictionary of ``'name': file-like-objects`` (or ``{'name': file-tuple}``) for multipart encoding upload.
``file-tuple`` can be a 2-tuple ``('filename', fileobj)``, 3-tuple ``('filename', fileobj, 'content_type')``
or a 4-tuple ``('filename', fileobj, 'content_type', custom_headers)``, where ``'content-type'`` is a string
defining the content type of the given file and ``custom_headers`` a dict-like object containing additional headers
to add for the file.
:param auth: (optional) Auth tuple to enable Basic/Digest/Custom HTTP Auth.
:param timeout: (optional) How long to wait for the server to send data
before giving up, as a float, or a :ref:`(connect timeout, read
timeout) <timeouts>` tuple.
:type timeout: float or tuple
:param allow_redirects: (optional) Boolean. Set to True if POST/PUT/DELETE redirect following is allowed.
:type allow_redirects: bool
:param proxies: (optional) Dictionary mapping protocol to the URL of the proxy.
:param verify: (optional) whether the SSL cert will be verified. A CA_BUNDLE path can also be provided. Defaults to ``True``.
:param stream: (optional) if ``False``, the response content will be immediately downloaded.
:param cert: (optional) if String, path to ssl client cert file (.pem). If Tuple, ('cert', 'key') pair.
:return: :class:`Response <Response>` object
:rtype: requests.Response
Usage::
>>> import requests
>>> req = requests.request('GET', 'http://httpbin.org/get')
<Response [200]>
"""
参数讲解:
method:请求方式,可以指定POST、GET、PUT等;
url: 请求路径;
params:在URL中传递的参数,GET ;
data:提交的数据,可以是字典、字符串、字节、文件对象;
json:提交的数据,与data的区别在于,若提交的数据为字典,且字典中还包含字典时,使用json;
headers:提交请求头数据;
cookies:提交cookies;
import requests
#加入headers请求头
header = {'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
#通过requests.get发送get请求,获取第一次cookies信息
r1 = requests.get('http://dig.chouti.com/',headers=header,)
r1_cookies = r1.cookies.get_dict()
#post方式发送登录信息,将第一次的cookies信息传,让其激活
r2 = requests.request(
method='post',
url="https://dig.chouti.com/login",
data=post_dict,
headers=header,
cookies=r1_cookies,
)
files:上传文件
aut:基本认知(headers中加入加密的用户名和密码)
timeout :请求和响应的超市时间
allow_redirects : 是否允许重定向
proxies:代理
verify:是否忽略证书
cert :证书文件
stream:部分性取数据
参数示例:
转载武沛齐老师示例:
def param_method_url():
# requests.request(method='get', url='http://127.0.0.1:8000/test/')
# requests.request(method='post', url='http://127.0.0.1:8000/test/')
pass
def param_param():
# - 可以是字典
# - 可以是字符串
# - 可以是字节(ascii编码以内)
# requests.request(method='get',
# url='http://127.0.0.1:8000/test/',
# params={'k1': 'v1', 'k2': '水电费'})
# requests.request(method='get',
# url='http://127.0.0.1:8000/test/',
# params="k1=v1&k2=水电费&k3=v3&k3=vv3")
# requests.request(method='get',
# url='http://127.0.0.1:8000/test/',
# params=bytes("k1=v1&k2=k2&k3=v3&k3=vv3", encoding='utf8'))
# 错误
# requests.request(method='get',
# url='http://127.0.0.1:8000/test/',
# params=bytes("k1=v1&k2=水电费&k3=v3&k3=vv3", encoding='utf8'))
pass
def param_data():
# 可以是字典
# 可以是字符串
# 可以是字节
# 可以是文件对象
# requests.request(method='POST',
# url='http://127.0.0.1:8000/test/',
# data={'k1': 'v1', 'k2': '水电费'})
# requests.request(method='POST',
# url='http://127.0.0.1:8000/test/',
# data="k1=v1; k2=v2; k3=v3; k3=v4"
# )
# requests.request(method='POST',
# url='http://127.0.0.1:8000/test/',
# data="k1=v1;k2=v2;k3=v3;k3=v4",
# headers={'Content-Type': 'application/x-www-form-urlencoded'}
# )
# requests.request(method='POST',
# url='http://127.0.0.1:8000/test/',
# data=open('data_file.py', mode='r', encoding='utf-8'), # 文件内容是:k1=v1;k2=v2;k3=v3;k3=v4
# headers={'Content-Type': 'application/x-www-form-urlencoded'}
# )
pass
def param_json():
# 将json中对应的数据进行序列化成一个字符串,json.dumps(...)
# 然后发送到服务器端的body中,并且Content-Type是 {'Content-Type': 'application/json'}
requests.request(method='POST',
url='http://127.0.0.1:8000/test/',
json={'k1': 'v1', 'k2': '水电费'})
def param_headers():
# 发送请求头到服务器端
requests.request(method='POST',
url='http://127.0.0.1:8000/test/',
json={'k1': 'v1', 'k2': '水电费'},
headers={'Content-Type': 'application/x-www-form-urlencoded'}
)
def param_cookies():
# 发送Cookie到服务器端
requests.request(method='POST',
url='http://127.0.0.1:8000/test/',
data={'k1': 'v1', 'k2': 'v2'},
cookies={'cook1': 'value1'},
)
# 也可以使用CookieJar(字典形式就是在此基础上封装)
from http.cookiejar import CookieJar
from http.cookiejar import Cookie
obj = CookieJar()
obj.set_cookie(Cookie(version=0, name='c1', value='v1', port=None, domain='', path='/', secure=False, expires=None,
discard=True, comment=None, comment_url=None, rest={'HttpOnly': None}, rfc2109=False,
port_specified=False, domain_specified=False, domain_initial_dot=False, path_specified=False)
)
requests.request(method='POST',
url='http://127.0.0.1:8000/test/',
data={'k1': 'v1', 'k2': 'v2'},
cookies=obj)
def param_files():
# 发送文件
# file_dict = {
# 'f1': open('readme', 'rb')
# }
# requests.request(method='POST',
# url='http://127.0.0.1:8000/test/',
# files=file_dict)
# 发送文件,定制文件名
# file_dict = {
# 'f1': ('test.txt', open('readme', 'rb'))
# }
# requests.request(method='POST',
# url='http://127.0.0.1:8000/test/',
# files=file_dict)
# 发送文件,定制文件名
# file_dict = {
# 'f1': ('test.txt', "hahsfaksfa9kasdjflaksdjf")
# }
# requests.request(method='POST',
# url='http://127.0.0.1:8000/test/',
# files=file_dict)
# 发送文件,定制文件名
# file_dict = {
# 'f1': ('test.txt', "hahsfaksfa9kasdjflaksdjf", 'application/text', {'k1': '0'})
# }
# requests.request(method='POST',
# url='http://127.0.0.1:8000/test/',
# files=file_dict)
pass
def param_auth():
from requests.auth import HTTPBasicAuth, HTTPDigestAuth
ret = requests.get('https://api.github.com/user', auth=HTTPBasicAuth('wupeiqi', 'sdfasdfasdf'))
print(ret.text)
# ret = requests.get('http://192.168.1.1',
# auth=HTTPBasicAuth('admin', 'admin'))
# ret.encoding = 'gbk'
# print(ret.text)
# ret = requests.get('http://httpbin.org/digest-auth/auth/user/pass', auth=HTTPDigestAuth('user', 'pass'))
# print(ret)
#
def param_timeout():
# ret = requests.get('http://google.com/', timeout=1)
# print(ret)
# ret = requests.get('http://google.com/', timeout=(5, 1))
# print(ret)
pass
def param_allow_redirects():
ret = requests.get('http://127.0.0.1:8000/test/', allow_redirects=False)
print(ret.text)
def param_proxies():
# proxies = {
# "http": "61.172.249.96:80",
# "https": "http://61.185.219.126:3128",
# }
# proxies = {'http://10.20.1.128': 'http://10.10.1.10:5323'}
# ret = requests.get("http://www.proxy360.cn/Proxy", proxies=proxies)
# print(ret.headers)
# from requests.auth import HTTPProxyAuth
#
# proxyDict = {
# 'http': '77.75.105.165',
# 'https': '77.75.105.165'
# }
# auth = HTTPProxyAuth('username', 'mypassword')
#
# r = requests.get("http://www.google.com", proxies=proxyDict, auth=auth)
# print(r.text)
pass
def param_stream():
ret = requests.get('http://127.0.0.1:8000/test/', stream=True)
print(ret.content)
ret.close()
# from contextlib import closing
# with closing(requests.get('http://httpbin.org/get', stream=True)) as r:
# # 在此处理响应。
# for i in r.iter_content():
# print(i)
def requests_session():
import requests
session = requests.Session()
### 1、首先登陆任何页面,获取cookie
i1 = session.get(url="http://dig.chouti.com/help/service")
### 2、用户登陆,携带上一次的cookie,后台对cookie中的 gpsd 进行授权
i2 = session.post(
url="http://dig.chouti.com/login",
data={
'phone': "8615131255089",
'password': "xxxxxx",
'oneMonth': ""
}
)
i3 = session.post(
url="http://dig.chouti.com/link/vote?linksId=8589623",
)
print(i3.text)
requests返回详解:
text:下载的html源码,或返回的数据;
content: 字节形式返回的数据,如图片、视频等;
encoding:编码格式;
aparent_encoding:源码的编码格式;
status_code:状态码;
cookies.get_dict(): 获取cookies属性
BeautifulSoup详解:
BeautifulSoup是一个模块,该模块用于接收一个HTML或XML字符串,然后将其进行格式化,之后遍可以使用他提供的方法进行快速查找指定元素,从而使得在HTML或XML中查找指定元素变得简单。
安装:pip install beautifulsoup4
创建对象,传递html或xml,使用features指定其‘解码方式’,默认使用html.parser, 也可以使用lxml,但需要手动安装lxml
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, features='html.parser')
beautifulsoup对象方法:
ind: 查找第一个所匹配上的值,返回为对象,find中可以使用id,nid(自定义属性),class等;
find_all:查找所以匹配上的值,返回为列表,若需要使用对象需先进行循环遍历;
attrs: 标签对象中的属性,例如href\src等
string:标签内容,attrs中get方法可取出内容,但无法设置新值
name:获取标签名称
children:所有子标签
clear:将标签的所有子标签全部清空(保留标签名)
decompose:递归的删除所有的标签
decode:转换为字符串(含当前标签)
decode_contents:转换为字符串(不含当前标签)
encode:转换为字节(含当前标签)
has_attr:检查标签是否具有该属性
get_text:获取标签内部文本内容
index:检查标签在某标签中的索引位置
is_empty_element:是否是空标签(是否可以是空)或者自闭合标签,判断是否是如下标签:'br' , 'hr', 'input', 'img',
select:选择器;
append:在当前标签内部追加一个标签
#参考爬取汽车之家代码;
#使用BeautifulSoup创建其对象
soup = BeautifulSoup(response.text, features='html.parser')
#使用find找整个源码中id=xxx的div
div_list = soup.find(id='auto-channel-lazyload-article')
#使用find_all找上面div中的li标签,如果使用find只会寻找到第一个li标签
li_list = div_list.find_all('li')
#由于find_all取出的值为列表,且不为对象,所以循环
for i in li_list:
#找其中的A标签
a_list = i.find('a')
#判断其中的a标签是否有值
if a_list:
#获取a标签的href链接
a_link ='http:' + a_list.attrs.get('href')
print(a_link)
#找a标签下面的h3标签内容
h3_text = a_list.find('h3').text
print(h3_text)
# 找a标签中的img标签,获取其中的图片链接
img_list = a_list.find('img')
img_link = 'http:' + img_list.attrs.get('src')
# print(img_link)
#再次图片发送请求
img_response = requests.get(url=img_link)
#组织图片名称,注意UUID需要转换成str才能进行拼接
file_name = str(uuid.uuid4()) + '.jpg'
#存储图片
with open(file_name,'wb') as f:
f.write(img_response.content)
# 找a标签中的p标签,即新闻简介
p_text = a_list.find('p').text
print(p_text)
更多使用案例:
转载武沛齐老师示例:
使用示例:
1 2 3 4 5 6 7 8 9 10 11 |
|
1. name,标签名称
1 2 3 4 5 |
|
2. attr,标签属性
1 2 3 4 5 6 |
|
3. children,所有子标签
1 2 |
|
4. children,所有子子孙孙标签
1 2 |
|
5. clear,将标签的所有子标签全部清空(保留标签名)
1 2 3 |
|
6. decompose,递归的删除所有的标签
1 2 3 |
|
7. extract,递归的删除所有的标签,并获取删除的标签
1 2 3 |
|
8. decode,转换为字符串(含当前标签);decode_contents(不含当前标签)
1 2 3 4 |
|
9. encode,转换为字节(含当前标签);encode_contents(不含当前标签)
1 2 3 4 |
|
10. find,获取匹配的第一个标签
1 2 3 4 5 |
|
11. find_all,获取匹配的所有标签
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 |
|
12. has_attr,检查标签是否具有该属性
1 2 3 |
|
13. get_text,获取标签内部文本内容
1 2 3 |
|
14. index,检查标签在某标签中的索引位置
1 2 3 4 5 6 7 |
|
15. is_empty_element,是否是空标签(是否可以是空)或者自闭合标签,
判断是否是如下标签:'br' , 'hr', 'input', 'img', 'meta','spacer', 'link', 'frame', 'base'
1 2 3 |
|
16. 当前的关联标签
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
17. 查找某标签的关联标签
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
18. select,select_one, CSS选择器
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 |
|
19. 标签的内容
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
20.append在当前标签内部追加一个标签
1 2 3 4 5 6 7 8 9 10 |
|
21.insert在当前标签内部指定位置插入一个标签
1 2 3 4 5 6 |
|
22. insert_after,insert_before 在当前标签后面或前面插入
1 2 3 4 5 6 7 |
|
23. replace_with 在当前标签替换为指定标签
1 2 3 4 5 6 |
|
24. 创建标签之间的关系
1 2 3 4 |
|
25. wrap,将指定标签把当前标签包裹起来
1 2 3 4 5 6 7 8 9 10 11 |
|
26. unwrap,去掉当前标签,将保留其包裹的标签
1 2 3 |
|