python抓取网页文章
Whether you are data scientist, programmer or AI specialist, you surely can put huge number of news articles to some good use. Getting those articles can be challenging though as you will have to go through quite a few hoops to get to the actual data — finding the right news sources, exploring their APIs, figuring out how to authenticate against them and finally scraping the data. That’s a lot of work and no fun.
无论您是数据科学家,程序员还是AI专家,您都可以肯定地利用大量新闻文章。 但是,获取这些文章可能具有挑战性,因为您将不得不花很多时间才能获得实际数据-找到正确的新闻来源,探索其API,弄清楚如何针对它们进行身份验证,最后抓取数据。 那是很多工作,没有乐趣。
So, to save you some time and get you started, here’s list of public news APIs that I was able to find, with explanation how authenticate against them, query them and most importantly examples for how to get all the data you need from them!
因此,为了节省您的时间并开始使用,这里是我能够找到的公共新闻API的列表,并解释了如何针对它们进行身份验证,查询它们以及最重要的示例,这些示例说明了如何从其中获取所需的所有数据!
纽约时报 (New York Times)
First and the best source of data is in my opinion New Your Times. To start using its API you need to create an account at https://developer.nytimes.com/accounts/create and an application at https://developer.nytimes.com/my-apps/new-app. When creating the application you get to choose which APIs to activate — I recommend activating at least Most Popular, Article Search, Top Stories and Archive APIs. When your application is created you will be presented with the key which you will use to interact all the selected APIs, so copy it and let’s start querying!
我认为,首要的数据来源是《 新时代》 。 要开始使用其API,您需要创建在一个帐户https://developer.nytimes.com/accounts/create和应用https://developer.nytimes.com/my-apps/new-app 。 创建应用程序时,您可以选择要激活的API-我建议至少激活“ 最受欢迎” ,“ 文章搜索” ,“ 热门故事”和“ 存档API” 。 创建应用程序后,将为您提供用于与所有选定的API进行交互的密钥,因此将其复制并开始查询!
The simplest query we can do with NY Times API is look up for current top stories:
我们可以使用NY Times API进行的最简单查询是查找当前热门新闻:
import requests
import os
from pprint import pprint
apikey = os.getenv('NYTIMES_APIKEY', '...')
# Top Stories:
# https://developer.nytimes.com/docs/top-stories-product/1/overview
section = "science"
query_url = f"https://api.nytimes.com/svc/topstories/v2/{section}.json?api-key={apikey}"
r = requests.get(query_url)
pprint(r.json())
The snippet above is very straightforward. We run a GET request against topstories/v2 endpoint supplying section name and our API key. Section in this case is science, but NY Times provides a lot of other options here, e.g. fashion, health, sports or theater. Full list can be found here. This specific request would produce response that would look something like this:
上面的代码片段非常简单。 我们运行一个GET对请求topstories/v2端点提供section名称和我们的API密钥。 在这种情况下,本节是科学 ,但《 纽约时报》在此提供了许多其他选择,例如时装,健康,体育或戏剧。 完整列表可以在这里找到。 此特定请求将产生如下响应:
{ 'last_updated': '2020-08-09T08:07:44-04:00',
'num_results': 25,
'results': [{'abstract': 'New Zealand marked 100 days with no new reported '
'cases of local coronavirus transmission. France '
'will require people to wear masks in crowded '
'outdoor areas.',
'byline': '',
'created_date': '2020-08-09T08:00:12-04:00',
'item_type': 'Article',
'multimedia': [{'caption': '',
'copyright': 'The New York Times',
'format': 'superJumbo',
'height': 1080,
'subtype': 'photo',
'type': 'image',
'url': 'https://static01.nyt.com/images/2020/08/03/us/us-briefing-promo-image-print/us-briefing-promo-image-superJumbo.jpg',
'width': 1920},
],
'published_date': '2020-08-09T08:00:12-04:00',
'section': 'world',
'short_url': 'https://nyti.ms/3gH9NXP',
'title': 'Coronavirus Live Updates: DeWine Stresses Tests’ '
'Value, Even After His False Positive',
'uri': 'nyt://article/27dd9f30-ad63-52fe-95ab-1eba3d6a553b',
'url': 'https://www.nytimes.com/2020/08/09/world/coronavirus-covid-19.html'},
]
}
Next and probably the most useful endpoint when you are trying to get some specific set of data is the article search endpoint:
尝试获取某些特定数据集时,下一个也是最有用的端点是商品搜索端点:
# Article Search:
# https://api.nytimes.com/svc/search/v2/articlesearch.json?q=<QUERY>&api-key=<APIKEY>
# Use - https://developer.nytimes.com/docs/articlesearch-product/1/routes/articlesearch.json/get to explore API
query = "politics"
begin_date = "20200701" # YYYYMMDD
filter_query = "\"body:(\"Trump\") AND glocations:(\"WASHINGTON\")\"" # http://www.lucenetutorial.com/lucene-query-syntax.html
page = "0" # <0-100>
sort = "relevance" # newest, oldest
query_url = f"https://api.nytimes.com/svc/search/v2/articlesearch.json?" \
f"q={query}" \
f"&api-key={apikey}" \
f"&begin_date={begin_date}" \
f"&fq={filter_query}" \
f"&page={page}" \
f"&sort={sort}"
r = requests.get(query_url)
pprint(r.json())
This endpoint features lots of filtering options. The only mandatory field is q (query), which is the search term. Beyond that you can mix and match filter query, date range ( begin_date, end_date), page number, sort order and facet fields. The filter query ( fq) is interesting one, as it allows use of Lucene query syntax, which can be used to create complex filters with logical operators ( AND, OR), negations or wildcards. Nice tutorial can be found here.
该端点具有许多过滤选项。 唯一的必填字段是q (查询),它是搜索词。 除此之外,您还可以混合和匹配过滤器查询,日期范围( begin_date , end_date ),页码,排序顺序和构面字段。 过滤器查询( fq )很有趣,因为它允许使用Lucene查询语法 ,该语法可用于创建具有逻辑运算符( AND , OR ),取反或通配符的复杂过滤器。 不错的教程可以在这里找到。
Example response for above query might like this (some fields were removed for clarity):
上面查询的示例响应可能是这样的(为清楚起见,删除了一些字段):
{'response': {'docs': [{'_id': 'nyt://article/0bf06be1-6699-527f-acb0-09fdd8abb6f6',
'abstract': 'The president sidestepped Congress when it became clear that his nominee for a '
'top Defense Department position would not win Senate approval.',
'byline': {'original': 'By Helene Cooper'},
'document_type': 'article',
'headline': {'main': 'Trump Puts Pentagon in Political Crossfire With Tata Appointment',
'print_headline': 'Bypassing Congress to Appoint Ally, Trump Puts Pentagon in Political Crossfire'},
'keywords': [{'major': 'N', 'name': 'subject', 'rank': 1,
'value': 'United States Politics and Government'},
{'major': 'N', 'name': 'subject', 'rank': 2,
'value': 'Appointments and Executive Changes'},
{'major': 'N', 'name': 'subject', 'rank': 3,
'value': 'Presidential Election of 2020'}],
'lead_paragraph': 'WASHINGTON — In making an end run around Congress to appoint Anthony J. Tata, a retired brigadier '
'general with a history of Islamophobic and other inflammatory views, to a top Defense Department '
'post, President Trump has once again put the military exactly where it does not want to be: in '
'the middle of a political battle that could hurt bipartisan support for the Pentagon.',
'multimedia': [],
'news_desk': 'Washington',
'pub_date': '2020-08-03T21:19:00+0000',
'section_name': 'U.S.',
'source': 'The New York Times',
'subsection_name': 'Politics',
'type_of_material': 'News',
'uri': 'nyt://article/0bf06be1-6699-527f-acb0-09fdd8abb6f6',
'web_url': 'https://www.nytimes.com/2020/08/03/us/politics/tata-pentagon.html',
'word_count': 927}]}}
Last endpoint for NY Times that I will show here is their Archive API which returns list of articles for given month going back all the way to 1851! This can be very useful if you need bulk data and don’t really need to search for specific terms.
我将在这里显示的《 纽约时报》的最后一个端点是他们的Archive API ,它返回给定月份的文章列表,一直到1851年! 如果您需要大量数据并且确实不需要搜索特定术语,那么这将非常有用。
# Archive Search
# https://developer.nytimes.com/docs/archive-product/1/overview
year = "1852" # <1851 - 2020>
month = "6" # <1 - 12>
query_url = f"https://api.nytimes.com/svc/archive/v1/{year}/{month}.json?api-key={apikey}"
r = requests.get(query_url)
pprint(r.json())
The query above searches for all articles from June of 1852 and from the result below we can see that even though we search for really old articles we still got 1888 hits. That said, most of these lack most of the useful data like keywords, word counts, author, etc. so you are probably better off searching for little more recent articles.
上面的查询搜索了1852年6月以来的所有文章,从下面的结果中我们可以看到,即使我们搜索的是非常古老的文章,仍然有1888个点击。 也就是说,这些工具中大多数都缺少关键字,字数统计,作者等大多数有用数据,因此您最好搜索一些较新的文章。
{'response': {
'meta': {'hits': 1888},
'docs': [{'_id': 'nyt://article/fada2905-0108-54a9-8729-ae9cda8b9528',
'byline': {'organization': None, 'original': None, 'person': []},
'document_type': 'article',
'headline': {'content_kicker': None, 'kicker': '1',
'main': 'Sentence for Manslaughter.',
'name': None,
'print_headline': 'Sentence for Manslaughter.'},
'keywords': [], 'news_desk': 'None',
'print_page': '3',
'pub_date': '1852-06-29T05:00:00+0000',
'section_name': 'Archives',
'source': 'The New York Times',
'type_of_material': 'Archives',
'uri': 'nyt://article/fada2905-0108-54a9-8729-ae9cda8b9528',
'web_url': 'https://www.nytimes.com/1852/06/29/archives/sentence-for-manslaughter.html',
'word_count': 0},
...]}
}
These were just some of the (in my opinion) more useful APIs provided by NY Times. Beside these, there are bunch more available at https://developer.nytimes.com/apis. To explore each API, I would also recommend playing with query builder like the one for article search, which lets you and build and execute your test query right on the website without any coding.
这些只是NY Times提供的一些(我认为)更有用的API。 除了这些,在https://developer.nytimes.com/apis上还有更多可用。 为了探索每种API,我还建议您使用查询构建器(例如用于文章搜索的查询构建器),它可以让您直接在网站上构建和执行测试查询,而无需进行任何编码。
守护者 (The Guardian)
Next up is another great source of news and articles — The Guardian. Same as with NY Times, we first need to sign up for an API key. You can do so at https://bonobo.capi.gutools.co.uk/register/developer and you will receive your key in an email. With that out of the way, we can navigate to API documentation and start querying the API.
接下来是新闻和文章的另一个重要来源- 卫报 。 与《 纽约时报》一样 ,我们首先需要注册一个API密钥。 您可以在https://bonobo.capi.gutools.co.uk/register/developer上进行操作 ,您将在电子邮件中收到密钥。 这样,我们就可以导航到API文档并开始查询API。
Let’s start simply by querying content sections of The Guardian:
让我们从查询《卫报》的内容部分开始:
# https://open-platform.theguardian.com/documentation/section
query = "science"
query_url = f"https://content.guardianapis.com/sections?" \
f"api-key={apikey}" \
r = requests.get(query_url)
pprint(r.json())
{'response': {'results': [{'apiUrl': 'https://content.guardianapis.com/science',
'editions': [{'apiUrl': 'https://content.guardianapis.com/science',
'code': 'default',
'id': 'science',
'webTitle': 'Science',
'webUrl': 'https://www.theguardian.com/science'}],
'id': 'science',
'webTitle': 'Science',
'webUrl': 'https://www.theguardian.com/science'}],
'status': 'ok',
'total': 1,
'userTier': 'developer'}}
These sections group content into topics, which can be useful if you are looking for specific type of content, e.g. science or technology. If we omit the query ( q) parameter, we will instead receive full list of sections, which is about 75 records.
这些部分将内容分为主题,如果您正在寻找特定类型的内容(例如科学或技术) ,这将很有用。 如果我们省略查询( q )参数,我们将收到完整的节列表,大约75条记录。
Moving on to something little more interesting — searching by tags:
继续进行一些有趣的事情-通过标签搜索:
# https://open-platform.theguardian.com/documentation/tag
query = "weather"
section = "news"
page = "1"
query_url = f"http://content.guardianapis.com/tags?" \
f"api-key={apikey}" \
f"&q={query}" \
f"&page={page}"
r = requests.get(query_url)
pprint(r.json())
{'response': {'currentPage': 1,
'pageSize': 10,
'pages': 139,
'results': [
{'apiUrl': 'https://content.guardianapis.com/australia-news/australia-weather',
'id': 'australia-news/australia-weather',
'sectionId': 'australia-news',
'sectionName': 'Australia news',
'type': 'keyword',
'webTitle': 'Australia weather',
'webUrl': 'https://www.theguardian.com/australia-news/australia-weather'},
{'apiUrl': 'https://content.guardianapis.com/world/extreme-weather',
'id': 'world/extreme-weather',
'sectionId': 'world',
'sectionName': 'World news',
'type': 'keyword',
'webTitle': 'Extreme weather',
'webUrl': 'https://www.theguardian.com/world/extreme-weather'},
],
'startIndex': 1,
'status': 'ok',
'total': 1385,
'userTier': 'developer'}}
This query looks quite similar to the previous one and also returns similar kinds of data. Tags also group content into categories, but there are a lot more tags (around 50000) than sections. Each of these tags have structure like for example world/extreme-weather. These are very useful when doing search for actual articles, which is what we will do next.
该查询看起来与上一个查询非常相似,并且还返回了类似的数据。 标签还将内容分为几类,但是标签(比50000)多得多。 这些标签中的每个标签都具有例如world/extreme-weather 。 这些在搜索实际文章时非常有用,这是我们接下来要做的。
The one thing you really came here for is article search and for that we will use https://open-platform.theguardian.com/documentation/search:
您真正来到这里的一件事是文章搜索,为此,我们将使用https://open-platform.theguardian.com/documentation/search :
query = "(hurricane OR storm)"
query_fields = "body"
section = "news" # https://open-platform.theguardian.com/documentation/section
tag = "world/extreme-weather" # https://open-platform.theguardian.com/documentation/tag
from_date = "2019-01-01"
query_url = f"https://content.guardianapis.com/search?" \
f"api-key={apikey}" \
f"&q={query}" \
f"&query-fields={query_fields}" \
f"§ion={section}" \
f"&tag={tag}" \
f"&from-date={from_date}" \
f"&show-fields=headline,byline,starRating,shortUrl"
r = requests.get(query_url)
pprint(r.json())
The reason I first showed you section and tag search is that those can be used in the article search. Above you can see that we used section and tag parameters to narrow down our search, which values can be found using previously shown queries. Apart from these parameters, we also included the obvious q parameter for our search query, but also starting date using from-date as well as show-fields parameter, which allows us to request extra fields related to the content - in this case those would be headline, byline, rating and shortened URL. There's bunch more of those with full list available here
我之所以向您显示部分和标签搜索的原因是,这些可用于文章搜索。 在上方,您可以看到我们使用section和tag参数来缩小搜索范围,可以使用先前显示的查询找到这些值。 除了这些参数之外,我们还为搜索查询添加了明显的q参数,还使用了from-date和show-fields参数作为开始日期,这使我们可以请求与内容相关的其他字段-在这种情况下,成为标题,标题,等级和缩短的URL。 还有一堆更多的这些并提供完整的清单在这里
And as with all the previous ones, here is example response:
与所有以前的响应一样,这是示例响应:
{'response': {'currentPage': 1, 'orderBy': 'relevance', 'pageSize': 10, 'pages': 1,
'results': [{'apiUrl': 'https://content.guardianapis.com/news/2019/dec/19/weatherwatch-storms-hit-france-and-iceland-as-australia-overheats',
'fields': {'byline': 'Daniel Gardner (MetDesk)',
'headline': 'Weatherwatch: storms hit France and Iceland as Australia overheats',
'shortUrl': 'https://gu.com/p/dv4dq'},
'id': 'news/2019/dec/19/weatherwatch-storms-hit-france-and-iceland-as-australia-overheats',
'pillarId': 'pillar/news',
'sectionId': 'news',
'type': 'article',
'webPublicationDate': '2019-12-19T11:33:52Z',
'webTitle': 'Weatherwatch: storms hit France and '
'Iceland as Australia overheats',
'webUrl': 'https://www.theguardian.com/news/2019/dec/19/weatherwatch-storms-hit-france-and-iceland-as-australia-overheats'},
{'apiUrl': 'https://content.guardianapis.com/news/2020/jan/31/weatherwatch-how-repeated-flooding-can-shift-levees',
'fields': {'byline': 'David Hambling',
'headline': 'Weatherwatch: how repeated '
'flooding can shift levees',
'shortUrl': 'https://gu.com/p/d755m'},
'id': 'news/2020/jan/31/weatherwatch-how-repeated-flooding-can-shift-levees',
'pillarId': 'pillar/news',
'sectionId': 'news',
'type': 'article',
'webPublicationDate': '2020-01-31T21:30:00Z',
'webTitle': 'Weatherwatch: how repeated flooding can shift levees',
'webUrl': 'https://www.theguardian.com/news/2020/jan/31/weatherwatch-how-repeated-flooding-can-shift-levees'}],
'startIndex': 1, 'status': 'ok', 'total': 7, 'userTier': 'developer'}}
黑客新闻 (HackerNews)
For more tech oriented source of news, one might turn to HackerNews, which also has its public REST API. It’s documented on https://github.com/HackerNews/API. This API, as you will see, is in version v0 and is currently very bare-bones, meaning it doesn't really provide specific endpoints to - for example - query articles, comments or users.
对于更多面向技术的新闻来源,可以转向HackerNews ,它也具有其公共REST API。 它记录在https://github.com/HackerNews/API上 。 正如您将看到的,该API的版本为v0 ,目前非常准,这意味着它实际上并未为查询文章,评论或用户提供特定的端点。
But even though it’s very basic it still provides all that’s necessary to, for example, get top stories:
但是,即使它是非常基本的,它仍然提供了所有必要的信息,例如,获取热门新闻:
query_type = "top" # top/best/new, also ask/show/job
query_url = f"https://hacker-news.firebaseio.com/v0/{query_type}stories.json?print=pretty" # Top Stories
r = requests.get(query_url)
ids = r.json()
top = ids[:10]
for story in top:
query_url = f"https://hacker-news.firebaseio.com/v0/item/{story}.json?print=pretty"
r = requests.get(query_url)
pprint(r.json())
The snippet above is not nearly as obvious as the previous ones, so let’s look at it more closely. We first send request to API endpoint ( v0/topstories), which doesn't return top stories as you would expect, but really just their IDs. To get the actual stories we take these IDs (first 10 of them) and send requests to v0/item/<ID> endpoint which returns data for each of these individual items, which in this case happens to be a story.
上面的代码段不像以前的代码段那么明显,因此让我们更仔细地看一下。 我们首先将请求发送到API端点( v0/topstories ),该端点不会像您期望的那样返回热门新闻,而实际上仅返回其ID。 为了获取实际的故事,我们获取这些ID(其中的前10个),并将请求发送到v0/item/<ID>端点,该端点返回这些单独项目中每个项目的数据,在这种情况下恰好是一个故事。
You surely noticed that the query URL was parametrized with query_type. That's because, HackerNews API also has similar endpoints for all the top sections of the website, that being - ask, show, job or new.
您肯定会注意到,查询URL已使用query_type参数化。 这是因为, HackerNews API在网站的所有顶部区域都具有相似的终结点,即询问,展示,工作或新建。
One nice thing about this API is that it doesn’t require authenticate, so you don’t need to request API key and don’t need to worry about rate limiting like with the other ones.
关于此API的一件好事是,它不需要身份验证,因此您不需要请求API密钥,也不需要像其他端口那样担心速率限制。
Running this code would land response that looks something like this:
运行此代码将使响应看起来像这样:
{'by': 'rkwz',
'descendants': 217,
'id': 24120311,
'kids': [24122571,
...,
24121481],
'score': 412,
'time': 1597154451,
'title': 'Single Page Applications using Rust',
'type': 'story',
'url': 'http://www.sheshbabu.com/posts/rust-wasm-yew-single-page-application/'}
{'by': 'bmgoss',
'descendants': 5,
'id': 24123372,
'kids': [24123579, 24124181, 24123545, 24123929],
'score': 55,
'time': 1597168165,
'title': 'Classic Books for Tech Leads (or those aspiring to be)',
'type': 'story',
'url': 'https://sourcelevel.io/blog/3-classic-books-for-tech-leads-or-those-aspiring-to-be'}
{'by': 'adamnemecek',
'descendants': 7,
'id': 24123283,
'kids': [24123803, 24123774, 24124106, 24123609],
'score': 69,
'time': 1597167845,
'title': 'Bevy: Simple, data-driven, wgpu-based game engine in Rust',
'type': 'story',
'url': 'https://bevyengine.org'}
If you found an interesting articles and wanted to dig a little deeper, then HackerNews API can help with that too. You can find comments of each submission by traversing kids field of said story. Code that would do just that looks like so:
如果您找到了有趣的文章并想进一步深入,那么HackerNews API也可以提供帮助。 您可以遍历所说故事的kids字段来查找每个提交的评论。 可以做到的代码如下所示:
first = 24120311 # Top story
query_url = f"https://hacker-news.firebaseio.com/v0/item/{first}.json?print=pretty"
r = requests.get(query_url)
comment_ids = r.json()["kids"] # IDs of top level comments of first story
for i in comment_ids[:10]: # Print first 10 comments of story
query_url = f"https://hacker-news.firebaseio.com/v0/item/{i}.json?print=pretty"
r = requests.get(query_url)
pprint(r.json())
First, we look up story ( item) by ID like we did in previous example. We then iterate over its kids and run same query with respective IDs retrieving items that in this case refer to story comments. We could also go through these recursively if we wanted to build whole tree/thread of comments of specific story.
首先,我们像前面的示例一样,通过ID查找故事( item )。 然后叠代的我们kids和运行与各个ID检索项目,在这种情况下是指故事的意见相同的查询。 如果我们想构建特定故事的整个树/注释线程,我们也可以递归地进行研究。
As always, here is sample response:
与往常一样,这是示例响应:
{'by': 'Naac',
'id': 24123455,
'kids': [24123485],
'parent': 24120311,
'text': 'So as I understand it Rust is compelling because it is a safer '
'alternative to C++ ( and sometimes C but mainly a C++ replacement '
').<p>We wouldn't usually create a single page app in C++ right? '
'So why would we want to do that in Rust ( other than, "just '
'because" ). Right tool for the right job and all that.',
'time': 1597168558,
'type': 'comment'}
{'by': 'intelleak',
'id': 24123860,
'parent': 24120311,
'text': 'I've been hearing good things about zig, and someone mentioned '
'that zig has better wasm support than rust, is it true? I wish rust '
'had a js ecosystem too ...',
'time': 1597170320,
'type': 'comment'}
{'by': 'praveenperera',
'id': 24120642,
'kids': [24120867, 24120738, 24120940, 24120721],
'parent': 24120311,
'text': 'Great post.<p>I'd love to see one talking about building a full '
'stack app using Yew and Actix (or Rocket). And good ways of sharing '
'types between the frontend and the backend.',
'time': 1597156315,
'type': 'comment'}
{'by': 'devxpy',
'id': 24122583,
'kids': [24122721, 24122756, 24122723],
'parent': 24120311,
'text': 'Can anyone please tell me how the author able to use html syntax in '
'rust?<p>I get that there are macros, but how are html tags valid '
'syntax? Is rust just interpreting the html content as '
'strings?<p>I've only ever seen C macros, and I don't '
'remember seeing\n'
' this kind of wizardry happening there.',
'time': 1597165060,
'type': 'comment'}
潮流 (Currents)
Finding popular and good quality news API is quite difficult as most classic newspapers don’t have free public API. There are however, sources of aggregate news data which can be used to get articles and news from newspapers like for example Financial Times and Bloomberg which only provide paid API services or like CNN doesn’t expose any API at all.
由于大多数经典报纸都没有免费的公共API,因此很难找到流行且高质量的新闻API。 但是,有一些汇总新闻数据源可用于从报纸上获取文章和新闻,例如《 金融时报》和彭博社 ,它们仅提供付费API服务,或者像CNN完全不公开任何API。
One of these aggregators is called Currents API. It aggregates data from thousands of sources, 18 languages and over 70 countries and it’s also free.
这些聚合器之一称为Currents API 。 它汇总了来自数千个来源,18种语言和70多个国家/地区的数据,并且它是免费的。
It’s similar to the APIs shown before. We again need to first get API key. To do so, you need to register at https://currentsapi.services/en/register. After that, go to your profile at https://currentsapi.services/en/profile and retrieve your API token.
它类似于之前显示的API。 我们再次需要首先获取API密钥。 为此,您需要在https://currentsapi.services/en/register进行注册。 之后,请访问https://currentsapi.services/en/profile上的个人资料,并获取您的API令牌。
With key (token) ready we can request some data. There’s really just one interesting endpoint and that’s https://api.currentsapi.services/v1/search:
准备好密钥(令牌)后,我们可以请求一些数据。 确实只有一个有趣的端点,即https://api.currentsapi.services/v1/search :
# https://currentsapi.services/en/docs/search
apikey = os.getenv('CURRENTS_APIKEY', '...')
category = "business"
language = languages['English'] # Mapping from Language to Code, e.g.: "English": "en"
country = regions["Canada"] # Mapping from Country to Code, e.g.: "Canada": "CA",
keywords = "bitcoin"
t = "1" # 1 for news, 2 for article and 3 for discussion content
domain = "financialpost.com" # website primary domain name (without www or blog prefix)
start_date = "2020-06-01T14:30" # YYYY-MM-DDTHH:MM:SS+00:00
query_url = f"https://api.currentsapi.services/v1/search?" \
f"apiKey={apikey}" \
f"&language={language}" \
f"&category={category}" \
f"&country={country}" \
f"&type={t}" \
f"&domain={domain}" \
f"&keywords={keywords}" \
f"&start_date={start_date}"
r = requests.get(query_url)
pprint(r.json())
This endpoint includes lots of filtering options including language, category, country and more, as shown in the snippet above. All of those are pretty self-explanatory, but for those first three I mentioned, you will need some extra information as their possible values aren’t really obvious. These values come from API endpoints available here and in case of languages and regions are really just mappings of value to code (e.g. "English": "en") and in case of categories just a list of possible values. It's omitted in the code above to make it a bit shorter, but I just copied these mappings into Python dict s to avoid calling API every time.
该端点包括许多过滤选项,包括语言,类别,国家/地区以及更多内容,如上面的代码片段所示。 所有这些都是不言自明的,但是对于我提到的前三个,您将需要一些额外的信息,因为它们的可能值并不十分明显。 这些值来自此处提供的API端点,在语言和地区的情况下,它们实际上只是值到代码的映射(例如"English": "en" ),在类别的情况下,仅是可能值的列表。 上面的代码中省略了它,以使其更短一些,但是我只是将这些映射复制到Python dict以避免每次都调用API。
Response to above request lands the following:
对以上请求的响应包括以下内容:
{'news': [{'author': 'Bloomberg News',
'category': ['business'],
'description': '(Bloomberg) — Bitcoin is notoriously volatile, prone to sudden price surges and swift reversals '
'that can wipe out millions of dollars of value in a matter of minutes. Those changes are often...',
'id': 'cb50963e-73d6-4a21-bb76-ec8bc8b9c201',
'image': 'https://financialpostcom.files.wordpress.com/2017/11/fp-512x512.png',
'language': 'ru',
'published': '2020-04-25 05:02:50 +0000',
'title': 'Get Set for Bitcoin ‘Halving’! Here’s What That Means',
'url': 'https://business.financialpost.com/pmn/business-pmn/get-set-for-bitcoin-halving-heres-what-that-means'},
{'author': 'Reuters',
'category': ['business'],
'description': 'NEW YORK — Crushing asset sell-offs ranging from bitcoin to precious metals and European stocks '
'accompanied Wall Street’s slide into bear market territory on Thursday, as investors liqu…',
'id': '3c75b090-ec7d-423e-9487-85becd92d10c',
'image': 'https://financialpostcom.files.wordpress.com/2017/11/fp-512x512.png',
'language': 'en',
'published': '2020-03-12 23:14:18 +0000',
'title': 'Wall Street sell-off batters bitcoin, pounds palladium as '
'investors go to cash',
'url': 'https://business.financialpost.com/pmn/business-pmn/wall-street-sell-off-batters-bitcoin-pounds-palladium-as-investors-go-to-cash'}],
'page': 1,
'status': 'ok'}
If you aren’t searching for specific topic or historical data, then there’s one other options which Currents API provides — the latest news endpoint:
如果您不搜索特定的主题或历史数据,那么Currents API还提供了另一个选项- 最新的新闻端点:
language = languages['English']
query_url = f"https://api.currentsapi.services/v1/latest-news?" \
f"apiKey={apikey}" \
f"&language={language}"
r = requests.get(query_url)
pprint(r.json())
It is very similar to the search endpoint, this one however only provides language parameter and produces results like this:
它与search端点非常相似,但是该端点仅提供language参数并产生如下结果:
{'news': [{'author': 'Isaac Chotiner',
'category': ['funny'],
'description': 'The former U.S. Poet Laureate discusses her decision to tell her mother\'s story in prose, in '
'her new book, "Memorial Drive," and her feelings about the destruction of Confederate monuments...',
'id': '3ded3ed1-ecb8-41db-96d3-dc284f4a61de',
'image': 'https://media.newyorker.com/photos/5f330eba567fa2363b1a19c3/16:9/w_1280,c_limit/Chotiner-NatashaTrethewey.jpg',
'language': 'en',
'published': '2020-08-12 19:15:03 +0000',
'title': 'How Natasha Trethewey Remembers Her Mother',
'url': 'https://www.newyorker.com/culture/q-and-a/how-natasha-trethewey-remembers-her-mother'},
{'author': '@BBCNews',
'category': ['regional'],
'description': 'Firefighters are tackling the blaze that broke out in the engineering department at the university...',
'id': '9e1f1ee2-8041-4864-8cca-0ffaedf9ae2b',
'image': 'https://ichef.bbci.co.uk/images/ic/1024x576/p08ngy6g.jpg',
'language': 'en',
'published': '2020-08-12 18:37:48 +0000',
'title': "Fire at Swansea University's Bay campus",
'url': 'https://www.bbc.co.uk/news/uk-wales-53759352'}],
'page': 1,
'status': 'ok'}
结论 (Conclusion)
There are many great news sites and online newspapers out there on the internet, but in most cases you won’t be able to scrape their data or access them programmatically. The ones shown in this article are the rare few with nice API and free access that you can use for your next project whether it’s some data science, machine learning or simple news aggregator. If you don’t mind paying some money for news API, you might also consider using Financial Times or Bloomberg. Apart from APIs you can also try scraping HTML and parsing the content yourself with something like BeautifulSoup. If you happen to find any other good source of news data, please let me know, so that I can add it to this list. 🙂
互联网上有很多很棒的新闻网站和在线报纸,但是在大多数情况下,您将无法刮取它们的数据或以编程方式访问它们。 本文显示的内容很少见,它们具有不错的API和免费访问权,可用于下一个项目,无论是数据科学,机器学习还是简单的新闻聚合器。 如果您不介意为新闻API付费,则也可以考虑使用Financial Times或Bloomberg 。 除了API之外,您还可以尝试刮HTML并使用BeautifulSoup之类的内容自己解析内容。 如果您碰巧发现了其他任何良好的新闻数据来源,请告诉我,以便将其添加到此列表中。 🙂
翻译自: https://towardsdatascience.com/scraping-news-and-articles-from-public-apis-with-python-be84521d85b9
python抓取网页文章
本文介绍如何利用Python从公共API接口抓取新闻和文章内容,详细阐述了具体的实现步骤和技术要点。

448

被折叠的 条评论
为什么被折叠?



