爬虫入门:BeautifulSoup

BeautifulSoup

BeautifulSoup最主要的功能是从网页抓取数据,Beautiful Soup自动将输入文档转换为Unicode编码,输出文档转换为utf-8编码。

html = """
<html><head><meta charset="utf-8"><meta http-equiv="X-UA-Compatible" content="IE=edge"><title>QQ浏览器</title><link href="/favicon.ico" rel="shortcut icon"><link rel="dns-prefetch" href="//stdl.qq.com"><link rel="dns-prefetch" href="//skeyword.browser.qq.com"><link rel="dns-prefetch"href="//searchsuggest.browser.qq.com"><link rel="dns-prefetch" href="//wis.qq.com"> <link rel="stylesheet"href="//stdl.qq.com/stdl/qb/assets/navigate/gameandlive/gameandlive.c1cec064.css"> <link rel="stylesheet"href="//stdl.qq.com/stdl/qb/search/1.0.4/search-box.css"> <link rel="stylesheet"href="//stdl.qq.com/stdl/qb/assets/ad/0.1.5/qbad_sdk.css?v=2">  </head>
<body qbl="1301">
<div id="qb-bg"></div>
<div class="header-wrapper">
<div class="header">
<div class="doodle">
<div class="ad-banner" data-qbad="doodle"></div>
<div class="doodle-default"></div>
</div>
<div class="search-area">
<div class="search"></div>
<div class="hotword">  
<a rel="noopener noreferrer" target="_blank"  href="https://www.sogou.com/tx?ie=utf-8&amp;pid=sogou-site-1f2b8183cd1e469a&amp;query=肺炎疫情" data-log="RC1.1" data-xlog="肺炎疫情"> 肺炎疫情 </a>  
<a rel="noopener noreferrer" target="_blank"  href="https://cloud.tencent.com/act/event/tencentmeeting_free?fromSource=gwzcw.3211865.3211865.3211865&amp;utm_medium=cpc&amp;utm_id=gwzcw.3211865.3211865.3211865" data-log="RC1.2" data-xlog="腾讯会议"> 腾讯会议 </a>  
<a rel="noopener noreferrer" target="_blank"  href="https://now.qq.com/pcweb/topic.html?topic=%E6%96%B0%E4%BA%BA&amp;_wv=16778245&amp;from=98002&amp;ADTAG=gdh-kz" data-log="RC1.3" data-xlog="高颜值美女"> 高颜值美女 </a>  
<a rel="noopener noreferrer" target="_blank"  href="......">...</a>							 	
"""

soup = BeautifulSoup(html)
soup.prettify()//格式化打印内容

<html>
<head>
	<meta charset="utf-8">
	<meta http-equiv="X-UA-Compatible" content="IE=edge">
	<title>QQ浏览器</title>
	<link href="/favicon.ico" rel="shortcut icon">
	<link rel="dns-prefetch" href="//stdl.qq.com">
	<link rel="dns-prefetch" href="//skeyword.browser.qq.com">
	<link rel="dns-prefetch"href="//searchsuggest.browser.qq.com">
	<link rel="dns-prefetch" href="//wis.qq.com"> 
	<link rel="stylesheet"href="//stdl.qq.com/stdl/qb/assets/navigate/gameandlive/gameandlive.c1cec064.css"> 
	<link rel="stylesheet"href="//stdl.qq.com/stdl/qb/search/1.0.4/search-box.css"> 
	<link rel="stylesheet"href="//stdl.qq.com/stdl/qb/assets/ad/0.1.5/qbad_sdk.css?v=2">  
</head>
<body qbl="1301">
	<div id="qb-bg"></div>
	<div class="header-wrapper">
		<div class="header">
			<div class="doodle">
				<div class="ad-banner" data-qbad="doodle"></div>
				<div class="doodle-default"></div>
			</div>
		<div class="search-area">
		<div class="search"></div>
			<div class="hotword">  
				<a rel="noopener noreferrer" target="_blank"  href="https://www.sogou.com/tx?ie=utf-8&amp;pid=sogou-site-1f2b8183cd1e469a&amp;query=肺炎疫情" data-log="RC1.1" data-xlog="肺炎疫情"> 肺炎疫情 </a>  
				<a rel="noopener noreferrer" target="_blank"  href="https://cloud.tencent.com/act/event/tencentmeeting_free?fromSource=gwzcw.3211865.3211865.3211865&amp;utm_medium=cpc&amp;utm_id=gwzcw.3211865.3211865.3211865" data-log="RC1.2" data-xlog="腾讯会议"> 腾讯会议 </a>  
				<a rel="noopener noreferrer" target="_blank"  href="https://now.qq.com/pcweb/topic.html?topic=%E6%96%B0%E4%BA%BA&amp;_wv=16778245&amp;from=98002&amp;ADTAG=gdh-kz" data-log="RC1.3" data-xlog="高颜值美女"> 高颜值美女 </a>  
				<a rel="noopener noreferrer" target="_blank"  href="......">...</a>

(1)Tag

Tag,它有两个重要的属性,是 nameattrs
soup 对象本身比较特殊,它的 name 即为[document],对于其他内部标签,输出的值便为标签本身的名称.
attrs 把标签的所有属性打印输出了出来,得到的类型是一个字典.

soup加标签名查找的是在所有内容中的第一个符合要求的标签

soup.title

<title>QQ浏览器</title>

soup.head

<head><meta charset="utf-8"><meta http-equiv="X-UA-Compatible" content="IE=edge"><title>QQ浏览器</title><link href="/favicon.ico" rel="shortcut icon"><link rel="dns-prefetch" href="//stdl.qq.com"><link rel="dns-prefetch" href="//skeyword.browser.qq.com"><link rel="dns-prefetch"href="//searchsuggest.browser.qq.com"><link rel="dns-prefetch" href="//wis.qq.com"> <link rel="stylesheet"href="//stdl.qq.com/stdl/qb/assets/navigate/gameandlive/gameandlive.c1cec064.css"> <link rel="stylesheet"href="//stdl.qq.com/stdl/qb/search/1.0.4/search-box.css"> <link rel="stylesheet"href="//stdl.qq.com/stdl/qb/assets/ad/0.1.5/qbad_sdk.css?v=2"></head>

soup.a

<a rel="noopener noreferrer" target="_blank"  href="https://www.sogou.com/tx?ie=utf-8&amp;pid=sogou-site-1f2b8183cd1e469a&amp;query=肺炎疫情" data-log="RC1.1" data-xlog="肺炎疫情"> 肺炎疫情 </a>

soup.a[‘href’]

['https://www.sogou.com/tx?ie=utf-8&amp;pid=sogou-site-1f2b8183cd1e469a&amp;query=肺炎疫情']

soup.a.attrs

{'rel': 'noopener noreferrer', 'target': '_blank', 'href': 'https://www.sogou.com/tx?ie=utf-8&amp;pid=sogou-site-1f2b8183cd1e469a&amp;query=肺炎疫情', 'data-log': 'RC1.1', 'data-xlog': '肺炎疫情'}

(2)NavigableString

soup.a.string

肺炎疫情

(3)查找方式

soup.find(name, attrs, recursive, text, **wargs)只返回第一个匹配到的对象

soup.find_all(name, attrs, recursive, text, **wargs)返回所有匹配到的结果

soup.select()
CSS 标签名不加任何修饰,类名前加点,id名前加 #
(1)标签名
soup.select(‘title’)

<title>QQ浏览器</title>

(2)类名
soup.select(’.hotword’)

<div class="hotword">...</div>

(3) id
soup.select(’#qb-bg’)

<div id="qb-bg">...</div>

(4)属性
soup.select(‘a[href=“https://www.sogou.com/tx?ie=utf-8&pid=sogou-site-1f2b8183cd1e469a&query=肺炎疫情”]’)

<a rel="noopener noreferrer" target="_blank"  href="https://www.sogou.com/tx?ie=utf-8&amp;pid=sogou-site-1f2b8183cd1e469a&amp;query=肺炎疫情" data-log="RC1.1" data-xlog="肺炎疫情"> 肺炎疫情 </a>

(5)组合
组合查找使用标签名与类名、id名时用空格分开即可。
注意:属性和标签属于同一节点,所以中间不能加空格,否则会无法匹配到。

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值