Python 30 天：第 22 天 -- 网页抓取

舍不得，放不下

已于 2023-03-24 13:09:18 修改

阅读量60

点赞数

分类专栏： python30天文章标签： python 开发语言爬虫

于 2023-03-23 16:44:51 首次发布

本文链接：https://blog.csdn.net/qq_62599387/article/details/129734148

版权

python30天专栏收录该内容

19 篇文章 3 订阅

订阅专栏

<< 第 21 天 || 第 23 天 >>

第 22 天

Python 网页爬取

什么是网页爬取

互联网充满了可用于不同目的的大量数据。要收集这些数据，我们需要知道如何从网站上抓取数据。

Web 抓取是从网站提取和收集数据并将其存储在本地计算机或数据库中的过程。

在本节中，我们将使用 beautifulsoup 和 requests 包来抓取数据。我们使用的包版本是 beautifulsoup 4。

要开始抓取网站，您需要请求、beautifoulSoup4和一个网站。

pip install requests
pip install beautifulsoup4

要从网站上抓取数据，需要对 HTML 标签和 CSS 选择器有基本的了解。我们使用 HTML 标签、类或/和 id 从网站定位内容。让我们导入 requests 和 BeautifulSoup 模块

import requests
from bs4 import BeautifulSoup

让我们为要抓取的网站声明 url 变量。


import requests
from bs4 import BeautifulSoup
url = 'https://archive.ics.uci.edu/ml/datasets.php'

# Lets use the requests get method to fetch the data from url

response = requests.get(url)
# lets check the status
status = response.status_code
print(status) # 200 means the fetching was successful

使用 beautifulSoup 解析页面内容

import requests
from bs4 import BeautifulSoup
url = 'https://archive.ics.uci.edu/ml/datasets.php'

response = requests.get(url)
content = response.content # we get all the content from the website
soup = BeautifulSoup(content, 'html.parser') # beautiful soup will give a chance to parse
print(soup.title) # <title>UCI Machine Learning Repository: Data Sets</title>
print(soup.title.get_text()) # UCI Machine Learning Repository: Data Sets
print(soup.body) # gives the whole page on the website
print(response.status_code)

tables = soup.find_all('table', {'cellpadding':'3'})
# We are targeting the table with cellpadding attribute with the value of 3
# We can select using id, class or HTML tag , for more information check the beautifulsoup doc
table = tables[0] # the result is a list, we are taking out data from it
for td in table.find('tr').find_all('td'):
    print(td.text)

如果运行此代码，您会看到提取已完成一半。您可以继续这样做，因为它是练习 1 的一部分。作为参考，请查看beautifulsoup 文档

🌕你是如此特别，你每天都在进步。你只有八天的时间才能走向伟大。现在为你的大脑和肌肉做一些练习。