BeautifulSoup4介绍与登录知乎案例

最新推荐文章于 2020-12-24 14:38:37 发布

gxh_apologize

最新推荐文章于 2020-12-24 14:38:37 发布

阅读量1.1k

点赞数

分类专栏： Python学习笔记文章标签： python-爬虫

本文链接：https://blog.csdn.net/GXH_APOLOGIZE/article/details/78948277

版权

Python学习笔记专栏收录该内容

5 篇文章 0 订阅

订阅专栏

一、BeautifulSoup4介绍

和lxml一样，Beautiful Soup 也是一个HTML/XML的解析器，主要的功能也是如何解析和提取 HTML/XML 数据
lxml 只会局部遍历，而Beautiful Soup 是基于HTML DOM的，会载入整个文档，解析整个DOM树，因此时间和内存开销都会大很多，所以性能要低于lxml
BeautifulSoup 用来解析 HTML 比较简单，API非常人性化，支持CSS选择器、Python标准库中的HTML解析器，也支持 lxml 的 XML解析器
Beautiful Soup 3 目前已经停止开发，推荐现在的项目使用Beautiful Soup 4。使用 pip 安装即可：pip install beautifulsoup4
官方文档：http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/
-

二、使用

示例代码：

html = """
<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
        <title>Insert title here</title>
    </head>

    <frameset rows="70%,*">
        <frame bordercolor="1" src="04_table.html" noresize="noresize" />
        <frameset cols="20%,*">
            <frame bordercolor="1" src="layout/b.html" noresize="noresize" />
            <frame bordercolor="1" noresize="noresize" name="content" />
        </frameset>
    </frameset>

    <body>
        <a href="http://www.baidu.com" class="hehe" id="link1"><!-- hehehe --></a>

    </body>
</html>
"""
bs=BeautifulSoup(html,"lxml")

1、四大对象种类，Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:

tag 它查找的是在所有内容中的第一个符合要求的标签

In [22]: bs.head
Out[22]: <head>\n<meta content="text/html; charset=unicode-escape" http-equiv="Content-Type"/>\n<title>Insert title here</title>\n</head>

In [23]: print type(bs.head)
<class 'bs4.element.Tag'>

In [24]: bs.name
Out[24]: u'[document]'

In [25]: bs.head.name
Out[25]: 'head'

In [26]: bs.a.attrs
Out[26]: {'class': ['hehe'], 'href': 'http://www.baidu.com', 'id': 'link1'}

In [27]: bs.a['class']
Out[27]: ['hehe']

In [31]: bs.a['class']="haha" #修改属性

In [32]: bs.a['class']
Out[32]: 'haha'

In [33]: del bs.a['class']

In [34]: bs.a.attrs
Out[34]: {'href': 'http://www.baidu.com', 'id': 'link1'}

NavigableString


In [45]: bs.title.string   #.string获取标签内容
Out[45]: u'Insert title here'

In [46]: print type(bs.title.string)
<class 'bs4.element.NavigableString'>

BeautifulSoup

In [38]: bs.name
Out[38]: u'[document]'

In [39]: print type(bs.name)
<type 'unicode'>

In [40]: bs.attrs
Out[40]: {}

In [41]: # 文档本身属性为空

Comment 是一个特殊类型的 NavigableString 对象，其输出的内容不包括注释符号

In [35]: bs.a.string #获取a标签内文字
Out[35]: u' hehehe '

In [36]: print type(bs.a.string)
<class 'bs4.element.Comment'>

2. 遍历文档树

直接子节点： .contents .children 属性
.content 属性可以将tag的子节点以列表的方式输出
.children 返回的是一个生成器对象

In [52]: bs.frameset.contents
Out[52]: 
[u'\n',
 <frame bordercolor="1" noresize="noresize" src="04_table.html"/>,
 u'\n',
 <frameset cols="20%,*">\n<frame bordercolor="1" noresize="noresize" src="layout/b.html"/>\n<frame bordercolor="1" name="content" noresize="noresize"/>\n</frameset>,
 u'\n']

In [53]: bs.frameset.contents[3]
Out[53]: <frameset cols="20%,*">\n<frame bordercolor="1" noresize="noresize" src="layout/b.html"/>\n<frame bordercolor="1" name="content" noresize="noresize"/>\n</frameset>

所有子孙节点: .descendants 属性，也需要遍历
节点内容：.string属性
如果tag只有一个 NavigableString 类型子节点,那么这个tag可以使用 .string 得到子节点。如果一个tag仅有一个子节点,那么这个tag也可以使用 .string 方法,输出结果与当前唯一子节点的 .string 结果相同。

通俗点说就是：如果一个标签里面没有标签了，那么 .string 就会返回标签里面的内容。如果标签里面只有唯一的一个标签了，那么 .string 也会返回最里面的内容

3、搜索文档树

find_all(name, attrs, recursive, text, **kwargs)
name参数可以传字符串、正则表达式、列表。
text 参数可以搜搜文档中的字符串内容，与 name 参数的可选值一样, text 参数接受字符串 , 正则表达式 , 列表

In [93]: bs.find_all('a')
Out[93]: [<a class="hehe" href="http://www.baidu.com" id="link1"><!-- hehehe --></a>]

In [94]: 

In [94]: bs.find_all(['a','title'])
Out[94]: 
[<title>Insert title here</title>,
 <a class="hehe" href="http://www.baidu.com" id="link1"><!-- hehehe --></a>]

In [95]: 

In [95]: bs.find_all(id="link1")
Out[95]: [<a class="hehe" href="http://www.baidu.com" id="link1"><!-- hehehe --></a>]

In [96]: 

In [96]: bs.find_all(text="hehehe")
Out[96]: []

In [97]: bs.find_all(text="Insert title here")
Out[97]: [u'Insert title here']

In [98]: import re

In [99]: bs.find_all(text=re.compile("Insert title"))
Out[99]: [u'Insert title here']

4、select方式

In [83]: bs.select('title')  #通过标签名查找
Out[83]: [<title>Insert title here</title>]

In [84]: bs.select('.hehe')  #通过类名查找
Out[84]: [<a class="hehe" href="http://www.baidu.com" id="link1"><!-- hehehe --></a>]

In [85]: bs.select('#link1')  #通过id查找
Out[85]: [<a class="hehe" href="http://www.baidu.com" id="link1"><!-- hehehe --></a>]

In [86]: bs.select('head > title') #查找head标签下title标签
Out[86]: [<title>Insert title here</title>]

In [87]: bs.select('a #link1') #查找a标签中id为link1的
Out[87]: []

In [88]: bs.select('a  #link1') #查找a标签中id为link1的
Out[88]: []

In [89]: bs.select('a[herf="http://www.baidu.com"])  #通过属性查找
  File "<ipython-input-89-d90676467fe4>", line 1
    bs.select('a[herf="http://www.baidu.com"])  #通过属性查找

^
SyntaxError: EOL while scanning string literal


In [90]: bs.select('a[herf="http://www.baidu.com"]')  #通过属性查找
Out[90]: []

In [91]: bs.select('a[href="http://www.baidu.com"]')  #通过属性查找
Out[91]: [<a class="hehe" href="http://www.baidu.com" id="link1"><!-- hehehe --></a>]

In [92]: # 以上的 select 方法返回的结果都是列表形式，可以遍历形式输出，然后用 get_text() 方法来获取它的内容

三、登录知乎案例

这个案例登录知乎，并把主页保存html。
关于验证码，有一种是字母数字，是将图片保存到本地，手动输入的，比如：jk12，请看注释掉的代码。另一种是点击倒立的文字，也是保存到本地的，手动输入坐标，坐标是有规律的，详细请看代码注释，如果第一个第二个是倒立的请输入：2，23，46
下面案例如果测试的话有三处需要改的地方：你的账号、你的密码、你的主页。
当然可以选择把55-57行代码注释掉。

 1 # coding:utf-8
  2 
  3 from bs4 import BeautifulSoup
  4 import requests
  5 import time
  6 
  7 def captcha(captcha_data):
  8     with open("captcha.jpg","wb") as f:
  9         f.write(captcha_data)
 10     ''' 
 11     text=raw_input("请输入验证码：")
 12     return text 
 13     ''' 
 14     text=raw_input("请输入验证码个数以及坐标：")
 15     # 第一个坐标[23,23],第二个坐标[46,23]...
 16     arr=text.split(",") 
 17     if "1"==arr[0]:
 18         result='{"img_size":[200,44],"input_points":[[%s,23]]}' % int(arr[1])
 19     else:
 20         result='{"img_size":[200,44],"input_points":[[%s,23],[%s,23]]}' % (int(arr[1]),int(arr[2]))
 21 
 22     return result
 23 
 24 def zhihuLogin(): 
 25     # 构建一个Session对象，可以保存Cookie
 26     session=requests.Session()
 27 
 29     # get请求获取登录页面，找到需要的数据_xsrf,同时记录Cookie
 30     html=session.get("https://www.zhihu.com/#signin",headers=headers).text
 31 
 32     # 调用lxml解析库
 33     bs=BeautifulSoup(html,"lxml");
 34     # _xsrf作用是防止CSRF攻击(跨站请求伪造，也就是跨域攻击)
 35     # 跨域攻击通常通过利用Cookie伪装成网站信任的用户的请求，盗取用户信息、欺骗web服务器
 36     # 所以网站通过设置一个隐藏字段存放这个MD5字符串，用来校验用户Cookie和服务器Session
 37     _xsrf=bs.find("input",attrs={"name":"_xsrf"}).get("value")
 38 
 39     #captcha_url="http://www.zhihu.com/captcha.gif?r=%d&type=login"%(time.time()*1000)
 40     captcha_url="https://www.zhihu.com/captcha.gif?r=%d&type=login&lang=cn"%(time.time()*1000)
 41     captcha_data=session.get(captcha_url,headers=headers).content
 42     text=captcha(captcha_data)
 43 
 44     data={
 45         "_xsrf":_xsrf,
 46         "phone_num":"**你的账号**",
 47         "password":"**你的密码**",
 48         "captcha_type":"cn",
 49         "captcha":text
 50     }
 51 
 52     response=session.post("https://www.zhihu.com/login/phone_num",data=data,headers=headers)
 53     print response.text
 54 
 55     response=session.get("**你登录后的主页地址**",headers=headers)
 56     with open("my.html","w") as f:
 57         f.write(response.text.encode("utf-8"))
 58 
 59 if __name__=="__main__":
 60     zhihuLogin()

gxh_apologize

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
BeautifulSoup4介绍与登录知乎案例

一、BeautifulSoup4介绍和lxml一样，Beautiful Soup 也是一个HTML/XML的解析器，主要的功能也是如何解析和提取 HTML/XML 数据lxml 只会局部遍历，而Beautiful Soup 是基于HTML DOM的，会载入整个文档，解析整个DOM树，因此时间和内存开销都会大很多，所以性能要低于lxmlBeautifulSoup 用来解析 HTML 比较简单，
复制链接

扫一扫