BeautifulSoup
BeautifulSoup最主要的功能是从网页抓取数据,Beautiful Soup自动将输入文档转换为Unicode编码,输出文档转换为utf-8编码。
html = """
<html><head><meta charset="utf-8"><meta http-equiv="X-UA-Compatible" content="IE=edge"><title>QQ浏览器</title><link href="/favicon.ico" rel="shortcut icon"><link rel="dns-prefetch" href="//stdl.qq.com"><link rel="dns-prefetch" href="//skeyword.browser.qq.com"><link rel="dns-prefetch"href="//searchsuggest.browser.qq.com"><link rel="dns-prefetch" href="//wis.qq.com"> <link rel="stylesheet"href="//stdl.qq.com/stdl/qb/assets/navigate/gameandlive/gameandlive.c1cec064.css"> <link rel="stylesheet"href="//stdl.qq.com/stdl/qb/search/1.0.4/search-box.css"> <link rel="stylesheet"href="//stdl.qq.com/stdl/qb/assets/ad/0.1.5/qbad_sdk.css?v=2"> </head>
<body qbl="1301">
<div id="qb-bg"></div>
<div class="header-wrapper">
<div class="header">
<div class="doodle">
<div class="ad-banner" data-qbad="doodle"></div>
<div class="doodle-default"></div>
</div>
<div class="search-area">
<div class="search"></div>
<div class="hotword">
<a rel="noopener noreferrer" target="_blank" href="https://www.sogou.com/tx?ie=utf-8&pid=sogou-site-1f2b8183cd1e469a&query=肺炎疫情" data-log="RC1.1" data-xlog="肺炎疫情"> 肺炎疫情 </a>
<a rel="noopener noreferrer" target="_blank" href="https://cloud.tencent.com/act/event/tencentmeeting_free?fromSource=gwzcw.3211865.3211865.3211865&utm_medium=cpc&utm_id=gwzcw.3211865.3211865.3211865" data-log="RC1.2" data-xlog="腾讯会议"> 腾讯会议 </a>
<a rel="noopener noreferrer" target="_blank" href="https://now.qq.com/pcweb/topic.html?topic=%E6%96%B0%E4%BA%BA&_wv=16778245&from=98002&ADTAG=gdh-kz" data-log="RC1.3" data-xlog="高颜值美女"> 高颜值美女 </a>
<a rel="noopener noreferrer" target="_blank" href="......">...</a>
"""
soup = BeautifulSoup(html)
soup.prettify()//格式化打印内容
<html>
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<title>QQ浏览器</title>
<link href="/favicon.ico" rel="shortcut icon">
<link rel="dns-prefetch" href="//stdl.qq.com">
<link rel="dns-prefetch" href="//skeyword.browser.qq.com">
<link rel="dns-prefetch"href="//searchsuggest.browser.qq.com">
<link rel="dns-prefetch" href="//wis.qq.com">
<link rel="stylesheet"href="//stdl.qq.com/stdl/qb/assets/navigate/gameandlive/gameandlive.c1cec064.css">
<link rel="stylesheet"href="//stdl.qq.com/stdl/qb/search/1.0.4/search-box.css">
<link rel="stylesheet"href="//stdl.qq.com/stdl/qb/assets/ad/0.1.5/qbad_sdk.css?v=2">
</head>
<body qbl="1301">
<div id="qb-bg"></div>
<div class="header-wrapper">
<div class="header">
<div class="doodle">
<div class="ad-banner" data-qbad="doodle"></div>
<div class="doodle-default"></div>
</div>
<div class="search-area">
<div class="search"></div>
<div class="hotword">
<a rel="noopener noreferrer" target="_blank" href="https://www.sogou.com/tx?ie=utf-8&pid=sogou-site-1f2b8183cd1e469a&query=肺炎疫情" data-log="RC1.1" data-xlog="肺炎疫情"> 肺炎疫情 </a>
<a rel="noopener noreferrer" target="_blank" href="https://cloud.tencent.com/act/event/tencentmeeting_free?fromSource=gwzcw.3211865.3211865.3211865&utm_medium=cpc&utm_id=gwzcw.3211865.3211865.3211865" data-log="RC1.2" data-xlog="腾讯会议"> 腾讯会议 </a>
<a rel="noopener noreferrer" target="_blank" href="https://now.qq.com/pcweb/topic.html?topic=%E6%96%B0%E4%BA%BA&_wv=16778245&from=98002&ADTAG=gdh-kz" data-log="RC1.3" data-xlog="高颜值美女"> 高颜值美女 </a>
<a rel="noopener noreferrer" target="_blank" href="......">...</a>
(1)Tag
Tag,它有两个重要的属性,是 name 和 attrs
soup 对象本身比较特殊,它的 name 即为[document],对于其他内部标签,输出的值便为标签本身的名称.
attrs 把标签的所有属性打印输出了出来,得到的类型是一个字典.
soup加标签名查找的是在所有内容中的第一个符合要求的标签
soup.title
<title>QQ浏览器</title>
soup.head
<head><meta charset="utf-8"><meta http-equiv="X-UA-Compatible" content="IE=edge"><title>QQ浏览器</title><link href="/favicon.ico" rel="shortcut icon"><link rel="dns-prefetch" href="//stdl.qq.com"><link rel="dns-prefetch" href="//skeyword.browser.qq.com"><link rel="dns-prefetch"href="//searchsuggest.browser.qq.com"><link rel="dns-prefetch" href="//wis.qq.com"> <link rel="stylesheet"href="//stdl.qq.com/stdl/qb/assets/navigate/gameandlive/gameandlive.c1cec064.css"> <link rel="stylesheet"href="//stdl.qq.com/stdl/qb/search/1.0.4/search-box.css"> <link rel="stylesheet"href="//stdl.qq.com/stdl/qb/assets/ad/0.1.5/qbad_sdk.css?v=2"></head>
soup.a
<a rel="noopener noreferrer" target="_blank" href="https://www.sogou.com/tx?ie=utf-8&pid=sogou-site-1f2b8183cd1e469a&query=肺炎疫情" data-log="RC1.1" data-xlog="肺炎疫情"> 肺炎疫情 </a>
soup.a[‘href’]
['https://www.sogou.com/tx?ie=utf-8&pid=sogou-site-1f2b8183cd1e469a&query=肺炎疫情']
soup.a.attrs
{'rel': 'noopener noreferrer', 'target': '_blank', 'href': 'https://www.sogou.com/tx?ie=utf-8&pid=sogou-site-1f2b8183cd1e469a&query=肺炎疫情', 'data-log': 'RC1.1', 'data-xlog': '肺炎疫情'}
(2)NavigableString
soup.a.string
肺炎疫情
(3)查找方式
soup.find(name, attrs, recursive, text, **wargs)只返回第一个匹配到的对象
soup.find_all(name, attrs, recursive, text, **wargs)返回所有匹配到的结果
soup.select()
CSS 标签名不加任何修饰,类名前加点,id名前加 #
(1)标签名
soup.select(‘title’)
<title>QQ浏览器</title>
(2)类名
soup.select(’.hotword’)
<div class="hotword">...</div>
(3) id
soup.select(’#qb-bg’)
<div id="qb-bg">...</div>
(4)属性
soup.select(‘a[href=“https://www.sogou.com/tx?ie=utf-8&pid=sogou-site-1f2b8183cd1e469a&query=肺炎疫情”]’)
<a rel="noopener noreferrer" target="_blank" href="https://www.sogou.com/tx?ie=utf-8&pid=sogou-site-1f2b8183cd1e469a&query=肺炎疫情" data-log="RC1.1" data-xlog="肺炎疫情"> 肺炎疫情 </a>
(5)组合
组合查找使用标签名与类名、id名时用空格分开即可。
注意:属性和标签属于同一节点,所以中间不能加空格,否则会无法匹配到。