python读取edi_如何从Wikipedi中获取纯文本

本文介绍了三种使用Python从维基百科获取页面纯文本的方法:1) 使用extractsprop;2) 通过parse端点获取HTML并解析;3) 解析wikitext。每种方法都有详细的代码示例,适用于不同需求。
摘要由CSDN通过智能技术生成

这里有几种不同的方法;用对你有用的方法。下面的所有代码示例都对API的HTTP请求使用^{};如果有Pip,可以使用pip install requests安装requests。它们也都使用Mediawiki API,两个使用query端点;如果需要文档,请遵循这些链接。

一。使用extractsprop

注意,这种方法只适用于带有TextExtracts extension的MediaWiki站点。这显然包括维基百科,但不包括一些较小的Mediawiki站点,比如http://www.wikia.com/

你想点击一个像action=query、format=json和title=Bla_Bla_Bla都是标准的MediaWiki API参数

prop=extracts使我们使用TextExtracts扩展

exintro将响应限制为第一个节标题之前的内容

explaintext使响应中的提取内容为纯文本而不是HTML

然后解析JSON响应并提取提取:>>> import requests

>>> response = requests.get(

... 'https://en.wikipedia.org/w/api.php',

... params={

... 'action': 'query',

... 'format': 'json',

... 'titles': 'Bla Bla Bla',

... 'prop': 'extracts',

... 'exintro': True,

... 'explaintext': True,

... }

... ).json()

>>> page = next(iter(response['query']['pages'].values()))

>>> print(page['extract'])

"Bla Bla Bla" is the title of a song written and recorded by Italian DJ Gigi D'Agostino. It was released in May 1999 as the third single from the album, L'Amour Toujours. It reached number 3 in Austria and number 15 in France. This song can also be heard in an added remixed mashup with L'Amour Toujours (I'll Fly With You) in its US radio version.

2。使用parse端点获取页面的完整HTML,解析它,并提取第一段

MediaWiki有一个^{} endpoint,您可以使用类似https://en.wikipedia.org/w/api.php?action=parse&page=Bla_Bla_Bla的URL点击它来获取页面的HTML。然后,您可以使用类似lxml(首先使用pip install lxml安装)的HTML解析器来解析它,以提取第一段。

例如:>>> import requests

>>> from lxml import html

>>> response = requests.get(

... 'https://en.wikipedia.org/w/api.php',

... params={

... 'action': 'parse',

... 'page': 'Bla Bla Bla',

... 'format': 'json',

... }

... ).json()

>>> raw_html = response['parse']['text']['*']

>>> document = html.document_fromstring(raw_html)

>>> first_p = document.xpath('//p')[0]

>>> intro_text = first_p.text_content()

>>> print(intro_text)

"Bla Bla Bla" is the title of a song written and recorded by Italian DJ Gigi D'Agostino. It was released in May 1999 as the third single from the album, L'Amour Toujours. It reached number 3 in Austria and number 15 in France. This song can also be heard in an added remixed mashup with L'Amour Toujours (I'll Fly With You) in its US radio version.

三。自己解析wikitext

您可以使用queryAPI获取页面的wikitext,使用mwparserfromhell解析它(首先使用pip install mwparserfromhell安装它),然后使用^{}将其缩减为人类可读的文本。strip_code在编写时并不能很好地工作(如下面的示例所示),但有望得到改进。>>> import requests

>>> import mwparserfromhell

>>> response = requests.get(

... 'https://en.wikipedia.org/w/api.php',

... params={

... 'action': 'query',

... 'format': 'json',

... 'titles': 'Bla Bla Bla',

... 'prop': 'revisions',

... 'rvprop': 'content',

... }

... ).json()

>>> page = next(iter(response['query']['pages'].values()))

>>> wikicode = page['revisions'][0]['*']

>>> parsed_wikicode = mwparserfromhell.parse(wikicode)

>>> print(parsed_wikicode.strip_code())

{{dablink|For Ke$ha's song, see Blah Blah Blah (song). For other uses, see Blah (disambiguation)}}

"Bla Bla Bla" is the title of a song written and recorded by Italian DJ Gigi D'Agostino. It was released in May 1999 as the third single from the album, L'Amour Toujours. It reached number 3 in Austria and number 15 in France. This song can also be heard in an added remixed mashup with L'Amour Toujours (I'll Fly With You) in its US radio version.

Background and writing

He described this song as "a piece I wrote thinking of all the people who talk and talk without saying anything". The prominent but nonsensical vocal samples are taken from UK band Stretch's song "Why Did You Do It"''.

Music video

The song also featured a popular music video in the style of La Linea. The music video shows a man with a floating head and no arms walking toward what appears to be a shark that multiplies itself and can change direction. This style was also used in "The Riddle", another song by Gigi D'Agostino, originally from British singer Nik Kershaw.

Chart performance

Chart (1999-00)PeakpositionIreland (IRMA)Search for Irish peaks23

References

External links

Category:1999 singles

Category:Gigi D'Agostino songs

Category:1999 songs

Category:ZYX Music singles

Category:Songs written by Gigi D'Agostino

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值