爬虫入门之BeautifulSoup库学习记录

最新推荐文章于 2024-07-25 18:55:58 发布

weixin_42150990

最新推荐文章于 2024-07-25 18:55:58 发布

阅读量218

点赞数

分类专栏： beautifulsoup 文章标签： python beautifulsoup

本文链接：https://blog.csdn.net/weixin_42150990/article/details/86437114

版权

beautifulsoup 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

最近一直在写爬虫的程序，发现对BeautifulSoup的掌握不够透彻，所以在这里做一个总结笔记：

首先参考BeautifulSoup的官方文档： https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.

所以对BeautifulSoup(以下简称btfs)的理解可以理解为从html文件中提取数据的库； Btfs是根据tag去查找文档内容的。

了解BeautifulSoup , 首先要了解我们的HTML语言

HTML语言是一种超文本标记语言，HTML使用标记标签来描述网页，而标记标签总是成对出现的。如下面例子中的<html>、<body>、<p>标签等；

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
<body>
<html>
"""

HTML中有很多有用的信息需要我们提取，这里我们就选择BeautifulSoup来对HTML进行解析；

解析的基本语法是：

soup = BeautifulSoup(html_doc, 'html.parser')

这里打印出type(soup) 结果为：

对soup里面的元素提取的常见操作：

根据标签和属性去查找元素：

## 获取标签名为p的标签：
soup.p
 <p class="title"><b>The Dormouse's story</b></p>
## 获取p标签下的class属性的值
soup.p['class']
## 获取p标签下的所有class的值（假设他html中有多个p标签，并且有多个class的值
# 找到第一个p标签：
   soup.find('p')
# 找到所有的p标签：
    links = soup.find_all('p')
# 找到所有p标签下的class的值 ： 
    for link in links:
        link = links.get('class')
## 找到标签名为p 和class 为title,id属性为001,string 为 alice 标签,其中id ,class,string 等均可以省略；
    tag = soup.find_all('p',class_ = '',id_ = 001,string = 'alice')

找到某标签下所有的子标签，并以列表返回：

head_tag = soup.head
head_tag
# <head><title>The Dormouse's story</title></head>

head_tag.contents
[<title>The Dormouse's story</title>]

title_tag = head_tag.contents[0]
title_tag
# <title>The Dormouse's story</title>

title_tag.contents
# [u'The Dormouse's story']

接下来用爬虫过程中遇到的实际问题来展开相关的代码-以爬取京东商商品为例；

1.首先获取京东商品页面：

def get_data(url):
    ua = UserAgent()
    headers1 = {'cookie': 你的cookie ,
              'user-agent':ua.random,
               ..其他参数
              }
    res = requests.get(url,headers = headers2)
    
    requests.adapters.DEFAULT_RETRIES = 5

    time.sleep(random.randint(0,1))

    if res.status_code == 200 :
        res.encoding = 'gbk'
        data = res.text
    else:
        if res.status_code == 502 : 
            res = get_data(url)
    return data 

url = 'https://search.jd.com/Search?keyword=%E5%8F%A3%E7%BA%A2&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=%E5%8F%A3%E7%BA%A2&page=1' 
data = get_data(url)

其中 data即为url给回的response ,她是我们对应url的html文档，所以可以用beautiful进行解析，首先将html文档转化为soup对象：

soup = BeautifulSoup(data,'html.parser')

2.爬取京东商品页面的价格、评价数量、名称等元素；

发现所有商品的信息都藏在li表现，class为gl-item的名目下；所以将所有满足上述条件的标签抓取下来，如下

results = soup.find_all('li', class_='gl-item')

再抓取每一个li标签下的价格以及商品对应的sku-id；价格位于div 标签且class为p-price的标签下，采用如上所述的方法，而sku-id在li标签的data-sku属性中藏着，所以用字典的办法获取：

 p_price = result.find('div', class_='p-price').text.strip() ## .text方法获取文字内容，.strip方法将字符串左右两端的空格去掉；
 p_sku_id = result['data-sku'].strip()

weixin_42150990

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
爬虫入门之BeautifulSoup库学习记录

最近一直在写爬虫的程序，发现对BeautifulSoup的掌握不够透彻，所以在这里做一个总结笔记：首先参考BeautifulSoup的官方文档： https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/ Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找...
复制链接

扫一扫

专栏目录