python爬虫常用模块介绍(1)_python之路 -- 爬虫 -- 常用模块

最新推荐文章于 2023-05-15 10:10:14 发布

weixin_39605706

最新推荐文章于 2023-05-15 10:10:14 发布

阅读量293

点赞数

文章标签： python爬虫常用模块介绍(1)

1.requests

Requests 是用Python语言编写，基于 urllib，采用 Apache2 Licensed 开源协议的 HTTP 库。它比 urllib 更加方便，可以节约我们大量的工作，完全满足 HTTP 测试需求。

requests模块的参数

1.1 get　　#发送get请求

requests.get( )的参数有：url、params、headers、cookies

1 requests.get(2 url=”http: //www.oldboyedu.com”,3 params = {“nid”:1,”name”:”xx”} #实际上传入的url为http://www.oldboyedu.com？Nid=1&name=xx #url中传入参数

4 headers ={...},5 cookies ={...}6 )

1.2 post　　#发送post请求

requests.post( )的参数有：url、params , haders , data , cookies

post中的参数用法和get中的一样，就不一一赘述了。

1.3proxies

proxies　　--代理

1 #发送文件，定制文件名(上传文件)

2 #file_dict = {

3 #'f1': ('test.txt', open('readme', 'rb'))

4 #}

5 #requests.request(method='POST',

6 #url='http://127.0.0.1:8000/test/',

7 #files=file_dict)

1.4json

当请求中提交的不是From Data作为数据，而是payload.时使用，导入json模块json.dumps(data)

post发送json数据

1 importrequests2 importjson3

4 r = requests.post('https://api.github.com/some/endpoint', data=json.dumps({'some': 'data'}))5 print(r.json())

1.5 auth

做基本的认证

1.6 timeout

#超时时间

timeout=(m,n)#表示请求时间最多n秒；响应时间最多等待接收m秒

1.7allow_redirects

是否支持重定向，默认为True

1.8stream

下载大文件是使用，一点一点的下载

1 ret = requests.get('http://127.0.0.1:8000/test/', stream=True)2 for i inr.iter_content():3 print(i)4 from contextlib importclosing5 with closing(requests.get('http://httpbin.org/get', stream=True)) as r:6 #在此处理响应。

7 for i inr.iter_content():8 print(i)

View Code

1.9 cert: 证书

1.10 verify: 确认

2.BeautifulSoup

Beautiful Soup 是一个可以从 HTML 或 XML 文件中提取数据的 Python 库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup 会帮你节省数小时甚至数天的工作时间.

2.1 bs4的安装

pip install BeautifulSoup4

2.2 解析

1 importrequests2 from bs4 importBeautifulSoup3

4 ret = requests.get("http://www.baidu.com")5 soup = BeautifulSoup(ret.text,'html.parser')6 print(soup) #打印解析出来的html

2.3 find和find_all 方法

1 div = soup.find(name="div",attrs={"id":"content-list"})2 #找到标签名为div，id属性为content-list的标签，返回此div标签中的所有内容

3 items = div.find_all(name="div",attrs={"class":"item"})4 #找到标签名为div，class属性为item的所有标签，返回所有此class属性的div标签

一大波练习这两个爬虫最常用模块的实例：

1.自动登录抽屉并批量点赞

1 importrequests2 from bs4 importBeautifulSoup3

4 #获取每一页页面的URL

5 for page in range(5,6):6 pageurl = "https://dig.chouti.com/all/hot/recent/%s"%page7

8 header ={9 "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36"

10 }11 #循环访问每一页，并获取cookies

12 response =requests.get(13 url=pageurl,14 headers =header15 )16 cookie1_dict =response.cookies.get_dict()17 #response.encoding = response.apparent_encoding

18 #print(response.text)

20 #发送post请求，进行登录

21 data ={22 "phone":"********",23 "password":"*******",24 "oneMonth":1

25 }26 response1 = requests.post(url="https://dig.chouti.com/login",27 data=data,28 headers=header,29 cookies =cookie1_dict30 )31

32 #找到每页的各个新闻的ID

33 soup = BeautifulSoup(response.text,"html.parser")34 div = soup.find(name="div",attrs={"id":"content-list"})35 #print(div)

36 items = div.find_all(name="div",attrs={"class":"item"})37 for item initems:38 id=item.find(name="div",attrs = {"class":"part2"}).get("share-linkid")39

40 #进行点赞操作

41 response2 = requests.post(url="https://dig.chouti.com/link/vote?linksId=%s"%id,42 headers={43 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36"

44 },45 cookies =cookie1_dict46

47 )48 print(response2.text)

View Code

2.自动登录GitHub并获取个人信息

1 importrequests2 from bs4 importBeautifulSoup3

4 res = requests.get(url="https://github.com/login")5 soup1 = BeautifulSoup(res.text,"html.parser")6 tag = soup1.find(name='input', attrs={'name': 'authenticity_token'})7 authenticity_token = tag.get('value')8 cookie1 =res.cookies.get_dict()9

10 header={11 "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36"

12 }13

14 res_login = requests.post(url="https://github.com/session",15 headers =header,16 data ={17 "commit":"Sign in",18 "utf8":"✓",19 "authenticity_token":authenticity_token,20 "login":"******",21 "password":"**********"

22 },23 cookies =cookie124 )25 cookie2 =res_login.cookies.get_dict()26 #print(res_login.text)

28 res_message = requests.get(url="https://github.com/Aberwang",29 headers=header,30 cookies =cookie2,31 )32 #print(res_message.text)

33 soup2 = BeautifulSoup(res_message.text,"html.parser")34 div = soup2.find(name="div",attrs={"id":"js-pjax-container"})35 h1 = div.find(name="h1",attrs={"class":"vcard-names"})36 span = h1.find(name="span",attrs={"class":"p-nickname vcard-username d-block"})37 username =span.get_text()38 print("获取到的用户名为：",username)39 a = div.find(name="a",attrs={"class":"u-photo d-block tooltipped tooltipped-s"})40 img = a.find(name="img",attrs={"class":"avatar width-full rounded-2"})41 src = img.get("src")42 print("获取到的用户头像地址为：",src)

View Code

3.汽车之家新闻抓取

1 importrequests2 from bs4 importBeautifulSoup3

4 res = requests.get("https://www.autohome.com.cn/news/") #获取网页HTML内容

5 res.encoding = "gbk"

7 soup = BeautifulSoup(res.text,"html.parser")8 #解析所获得的html页面

9 li_list = soup.find(id = "auto-channel-lazyload-article").find_all(name = "li")10 for li inli_list:11 title = li.find("h3")12 if nottitle:13 continue

14 summary = li.find("p")15 url = li.find("a").get("href")16 img = li.find('img').get('src')17 print(title.text,url,summary.text,img)

View Code

4.自动登录码云并获取个人信息

1 importrequests2 from bs4 importBeautifulSoup3

4 #获取token

5 r1 = requests.get("https://gitee.com/login")6 r1.encoding = "utf-8"

7 soup = BeautifulSoup(r1.text,"html.parser")8 token = soup.find(name = "input",attrs = {"name":"authenticity_token"}).get("value")9

10 #将用户名，密码，token发送到服务端，以POST请求方式

13 date ={14 "utf8":"✓",15 "authenticity_token":token,16 "redirect_to_url":"",17 "user[login]":"***账号****",18 'user[password]':"***密码***.",19 "captcha":"",20 "user[remember_me]":"0",21 "commit":"登录"

22 }23 r2 = requests.post("https://gitee.com/login",date)24

25 cookie_dict =r2.cookies.get_dict()26 r3 = requests.get("https://gitee.com/aberwang/projects",cookie_dict)27 print(r3.text)

View Code

weixin_39605706

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python爬虫常用模块介绍(1)_python之路 -- 爬虫 -- 常用模块

1.requestsRequests 是用Python语言编写，基于 urllib，采用 Apache2 Licensed 开源协议的 HTTP 库。它比 urllib 更加方便，可以节约我们大量的工作，完全满足 HTTP 测试需求。requests模块的参数1.1 get　　#发送get请求requests.get( )的参数有：url、params、headers、cookies1 requ...
复制链接

扫一扫

python爬虫常用模块介绍(1)_python之路 -- 爬虫 -- 常用模块

“相关推荐”对你有帮助么？