如何使用Python抓取Wikipedia文章

最新推荐文章于 2025-01-18 21:51:49 发布

源代码杀手

最新推荐文章于 2025-01-18 21:51:49 发布

阅读量1.4k

点赞数

分类专栏： Python 文章标签：爬虫 python

本文链接：https://blog.csdn.net/weixin_41194129/article/details/108250804

版权

Python 专栏收录该内容

54 篇文章

订阅专栏

本文介绍如何使用Python和BeautifulSoup构建一个网络抓取工具，该工具能够从Wikipedia页面抓取标题，并随机跳转到下一个页面，形成无限循环的抓取过程。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

如何使用Python抓取Wikipedia文章

在本文中，我将使用Python创建一个网络抓取工具，该工具将抓取Wikipedia页面。

抓取工具将转到Wikipedia页面，抓取标题，然后随机链接到下一个Wikipedia页面。

我认为看到此刮板将访问哪些随机维基百科页面会很有趣！

设置刮板

首先，我将创建一个名为的新python文件scraper.py：

touch scraper.py

为了发出HTTP请求，我将使用该requests库。您可以使用以下命令进行安装：

pip install requests

让我们以网络抓取维基页面为起点：

import requests

response = requests.get(
	url="https://en.wikipedia.org/wiki/Web_scraping",
)
print(response.status_code)

运行刮板时，它应显示200状态代码：

python3 scraper.py
200

好吧，到目前为止一切顺利！🙌

从页面提取数据

让我们从HTML页面中提取标题。为了让我的生活更轻松，我将为此使用BeautifulSoup软件包。

pip install beautifulsoup4

检查Wikipedia页面时，我看到title标签具有#firstHeadingID。

美丽的汤让您可以通过ID标签查找元素。

title = soup.find(id="firstHeading")

现在将程序整合在一起如下所示：

import requests
from bs4 import BeautifulSoup

response = requests.get(
	url="https://en.wikipedia.org/wiki/Web_scraping",
)
soup = BeautifulSoup(response.content, 'html.parser')

title = soup.find(id="firstHeading")
print(title.string)

运行时，它显示Wiki文章的标题：🚀

python3 scraper.py
Web scraping

刮其他链接

现在，我将深入研究Wikipedia。我将获取<a>另一个Wikipedia文章的随机标签，然后刮取该页面。

为此，我将使用漂亮的汤来查找<a>Wiki文章中的所有标签。然后我将列表随机排列以使其随机。

import requests
from bs4 import BeautifulSoup
import random

response = requests.get(
	url="https://en.wikipedia.org/wiki/Web_scraping",
)
soup = BeautifulSoup(response.content, 'html.parser')

title = soup.find(id="firstHeading")
print(title.content)

# Get all the links
allLinks = soup.find(id="bodyContent").find_all("a")
random.shuffle(allLinks)
linkToScrape = 0

for link in allLinks:
	# We are only interested in other wiki articles
	if link['href'].find("/wiki/") == -1: 
		continue

	# Use this link to scrape
	linkToScrape = link
	break

print(linkToScrape)

如您所见，我使用soup.find(id="bodyContent").find_all("a")来查找<a>主要文章中的所有标签。

由于我只对指向其他Wikipedia文章的链接感兴趣，因此请确保该链接包含/wiki前缀。

现在运行该程序时，它会显示指向另一篇维基百科文章的链接，太好了！

python3 scraper.py
<a href="/wiki/Link_farm" title="Link farm">Link farm</a>

创造无尽的刮板

好吧，让我们使刮板真正刮掉新的链接。

为此，我将所有内容移至一个scrapeWikiArticle函数中。

import requests
from bs4 import BeautifulSoup
import random

def scrapeWikiArticle(url):
	response = requests.get(
		url=url,
	)

	soup = BeautifulSoup(response.content, 'html.parser')

	title = soup.find(id="firstHeading")
	print(title.text)

	allLinks = soup.find(id="bodyContent").find_all("a")
	random.shuffle(allLinks)
	linkToScrape = 0

	for link in allLinks:
		# We are only interested in other wiki articles
		if link['href'].find("/wiki/") == -1: 
			continue

		# Use this link to scrape
		linkToScrape = link
		break

	scrapeWikiArticle("https://en.wikipedia.org" + linkToScrape['href'])

scrapeWikiArticle("https://en.wikipedia.org/wiki/Web_scraping")

该scrapeWikiArticle函数将获取Wiki文章，提取标题并找到随机链接。

然后，它将scrapeWikiArticle使用此新链接再次调用。因此，它会在Wikipedia上创建一个不断循环的Scraper循环。

让我们运行程序，看看会得到什么：

pythron3 scraper.py
Web scraping
Digital object identifier
ISO 8178
STEP-NC
ISO/IEC 2022
EBCDIC 277
Code page 867
Code page 1021
EBCDIC 423
Code page 950
G
R
Mole (unit)
Gram
Remmius Palaemon
Encyclopædia Britannica Eleventh Edition
Geography
Gender studies
Feminism in Brazil

太棒了，从大约10个步骤中，我们从“网页搜刮”到“巴西女权主义”。惊人！