【入门篇】使用requests和Beautiful Soup轻松搞掂网页数据爬取_import requests from bs4 import beautifulsoup from-CSDN博客

本文链接：https://blog.csdn.net/jgku/article/details/128225095

其实这类文章很多了，但还是简要记录一下。

requests + BS4 能完成60%的爬虫工作。

文章目录

黄金搭档：requests和BS4(Beautiful Soup 4)

requests是最基础的HTTP库，毋庸多言。
Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库。
BeautifulSoup支持下列解析器：

Python标准库： BeautifulSoup(markup, ‘html.parser’) python内置的标准库，执行速度适中 Python3.2.2之前的版本容错能力差
lxml HTML解析器： BeautifulSoup(markup, ‘lxml’) 速度快、文档容错能力强需要安装C语言库
lxml XML解析器： BeautifulSoup(markup, ‘xml’) 速度快，唯一支持XML的解析器需要安装C语言库
html5lib：BeautifulSoup(markup, ‘html5lib’) 最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档速度慢，不依赖外部拓展

结论：lxml解析器可以解析HTML和XML文档，并且速度快，容错能力强，所有推荐使用它。

from bs4 import BeautifulSoup
from lxml import etree
import requests
  
URL = "https://en.wikipedia.org/wiki/Nike,_Inc."
  
HEADERS = ({'User-Agent':
            'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 \
            (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',\
            'Accept-Language': 'en-US, en;q=0.5'})
  
webpage = requests.get(URL, headers=HEADERS)
soup = BeautifulSoup(webpage.content, "lxml")
dom = etree.HTML(str(soup))
print(dom.xpath('//*[@id="firstHeading"]')[0].text)

BS4的CSS选择器

和selenium的选择器写法几乎一致。，选择某个文档节点后，按照CSS选择器能定位到任一子元素，从而完成数据提取。

## 标签查找
soup.select("title")
# [<title>The Dormouse's story</title>]

soup.select("p:nth-of-type(3)")
# [<p class="story">...</p>]

## ID查找
soup.select("#link1")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

soup.select("a#link2")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

## 类名查找
soup.select(".sister")

## 按属性查找
soup.select('a[href="http://example.com/elsie"]')

select()和find()是节点的两个重要方法。
str(node)和node.text分别能获得某个节点的html文本和纯文本。

例子1：获得CSDN用户信息

def get_blog_user(self, user_id):
   url = f'https://blog.csdn.net/{user_id}'
   response = self.session.get(url, headers=HEADERS, verify=False)
   soup = BeautifulSoup(response.content, 'lxml')
   dom = etree.HTML(str(soup))
   # 别名
   alias = soup.select("div.user-profile-head-name>div:first-child")[0].text
   # 原创
   innovates = dom.xpath('//div[@class="user-profile-head-info-r-c"]//div[text()="原创"]/preceding-sibling::div[1]')[0].text
   ranking =  dom.xpath('//div[@class="user-profile-head-info-r-c"]//div[text()="排名"]/preceding-sibling::div[1]')[0].text
   ## 粉丝数
   fans = dom.xpath('//div[@class="user-profile-head-info-r-c"]//div[text()="粉丝"]/preceding-sibling::div[1]')[0].text

etree库可以使用xpath提取数据，可以作为CSS选择器的补充。

例子2：获得阿里云博客文章

阿里云官方博客列表的获取链接为：https://developer.aliyun.com/group/alitech/article/?spm=a2c6h.12873581.technical-group.166.4c6a36e4XYs70g&pageNum=2，对应的HTML片段如下：

<div id="article">
    <div class="content-tab-list show">
        <ul class="content-tab-list all-list show">
            <li class="all-list-box">
                <div class="news-message">
                    <a target="_blank" data-spm-click="gostr=/developer.group.timeline;locaid=dArticle985959" href="/article/985959" title="以“升舱”之名，谈谈云原生数据仓库AnalyticDB的核心技术">
                        <p>以“升舱”之名，谈谈云原生数据仓库AnalyticDB的核心技术</p>
                    </a>
                    <p>
                        <span class="tag">文章</span>
                        <a target="_blank" class="user" href="/profile/ksfmygqjnd3mg">开发者小助手_LS</a>
                        <span class="time">2022-07-15</span>
                        <span class="browse">526浏览量</span>
                    </p>
                </div>
                <a target="_blank" href="/article/985959" title="以“升舱”之名，谈谈云原生数据仓库AnalyticDB的核心技术">
                    <img src="https://ucc.alicdn.com/pic/developer-ecology/d4796cd1a8124e6c8f1f7e5949ecbd91.png?x-oss-process=image/resize,h_118" />
                </a>
            </li>
            ...
      </ul>
    </div>
 </div>

提取文章信息：

url = "https://developer.aliyun.com/group/alitech/article/?spm=a2c6h.12873581.technical-group.166.4c6a36e4XYs70g&pageNum=2"
session = requests.Session()
response = session.get(url, headers=_headers, verify=False)
response.encoding = 'utf-8'

soup = BeautifulSoup(response.text, 'lxml')
articles = soup.select("#article ul.content-tab-list.all-list.show>li")
for article in articles:
    print(article.select('a')[2].attrs['href'])
    print(article.select('a')[2].attrs['title'])