谷歌浏览器抓取网页数据插件_快速提示:使用Google搜索结果进行网页抓取

谷歌浏览器抓取网页数据插件

Web scraping image

While working on a project recently, I needed to grab some google search results for specific search phrases and then scrape the content from the page results.

最近在从事一个项目时,我需要获取一些Google搜索结果以查找特定的搜索词组,然后从页面结果中抓取内容。

For example, when searching for a Sony 16-35mm f2.8 GM lens on google, I wanted to grab some content (reviews, text, etc) from the results.  While this isn’t hard to build from scratch, I ran across a couple of libraries that are easy to use and make things so much easier.

例如,当在Google上搜索Sony 16-35mm f2.8 GM镜头时,我想从结果中获取一些内容(评论,文字等)。 尽管从头开始构建起来并不难,但是我遇到了两个易于使用且使事情变得如此容易的库。

The first is ‘Google Search‘ (install via pip install google). This library lets you consume google search results with just one line of code. An example is below (this will import google search and run a search for Sony 16-35mm f2.8 GM lens and print out the urls for the search.

第一个是“ Google搜索 ”(通过pip install google进行pip install google )。 该库使您仅用一行代码即可使用Google搜索结果。 下面是一个示例(这将导入Google搜索并运行对Sony 16-35mm f2.8 GM lens的搜索,并打印出搜索的网址。

from googlesearch import search

for url in search('Sony 16-35mm f2.8 GM lens', tld='com', stop=1):
    print url

For the above, I’m using google.com for the search and have told it to stop after the first set of results.

对于上述情况,我使用google.com进行搜索,并告诉它在第一组结果后停止。

The output:

输出:

https://www.bhphotovideo.com/c/product/1338516-REG/sony_sel1635gm_fe_16_35mm_f_2_8_gm.html
https://www.amazon.com/Sony-SEL1635GM-16-35mm-2-8-22-Camera/dp/B071LHLS11
https://www.sony.com/electronics/camera-lenses/sel1635gm

   
   
Sony FE 16-35mm f/2.8 GM lens review: Highest-rated wide-angle zoom
Review: Sony 16-35mm f2.8 G Master FE (Sony E Mount, Full Frame)
https://www.bhphotovideo.com/c/product/1338516-REG/sony_sel1635gm_fe_16_35mm_f_2_8_gm.html
https://www.amazon.com/Sony-SEL1635GM-16-35mm-2-8-22-Camera/dp/B071LHLS11
https://www.sony.com/electronics/camera-lenses/sel1635gm

   
   
Sony FE 16-35mm f/2.8 GM lens review: Highest-rated wide-angle zoom
Review: Sony 16-35mm f2.8 G Master FE (Sony E Mount, Full Frame)

That’s pretty easy.

那很容易。

Now, we can use those url’s to scrape the websites that are returned.

现在,我们可以使用这些网址来抓取返回的网站。

To scrape these sites, you could run some fairly complex scraping systems, build your own fairly complex systems…or…if you just need some basic content and aren’t going to be doing a LOT of scraping, you could use the ‘Newspaper‘ library. Of course, there are plenty of other libraries but the newspaper library really simplifies things for those ‘quick and dirty’ projects.  Note: This is best used in python3.

要抓取这些网站,您可以运行一些相当复杂的抓取系统,构建自己的相当复杂的系统……或者……如果您只需要一些基本内容并且不打算进行很多抓取,则可以使用“ 报纸 ”图书馆。 当然,还有许多其他图书馆,但报纸图书馆确实简化了那些“快速而肮脏”的项目。 注意:最好在python3中使用。

To get started, install newspaper with pip3 install newspaper3k (for python3).

首先,请使用pip3 install newspaper3k (适用于python3) pip3 install newspaper3k

Now, to scrape the urls returned from the google search, you can simply do the following:

现在,要抓取从Google搜索返回的网址,您只需执行以下操作:

from newspaper import Article
article = Article(url)
article.download()
article.parse()

This will grab the url, download it and parse it so you can access the content.  Here’s an example of grabbing the url https://www.the-digital-picture.com/Reviews/Sony-FE-16-35mm-f-2.8-GM-Lens.aspx.

这将获取URL,下载并解析它,以便您可以访问内容。 这是获取网址https://www.the-digital-picture.com/Reviews/Sony-FE-16-35mm-f-2.8-GM-Lens.aspx.的示例https://www.the-digital-picture.com/Reviews/Sony-FE-16-35mm-f-2.8-GM-Lens.aspx.

from newspaper import Article
article = Article('https://www.the-digital-picture.com/Reviews/Sony-FE-16-35mm-f-2.8-GM-Lens.aspx')
article.download()
article.parse()
print(article.text)

The output of the print(article.text is below (I’ve only included an excerpt for this example but this will grab the entire text):

print(article.text的输出print(article.text在下面(我仅包括此示例的摘录,但这将获取整个文本):

‘Those putting together the ultimate Sony E-mount lens kit are going to want this lens included. The Sony FE 16-35mm f/2.8 GM Lens covers a key focal length range in wide aperture with high quality. In this case, the term high quality applies both to the lens’ physical attributes and to the image quality delivered by it.nnMany are first-attracted to the Alpha MILC (Mirrorless Interchangeable Lens Camera) system for Sony’s high-performing full frame imaging sensors, but lenses are as important as cameras and Sony’s lens lineup was initially viewed by many as deficient. Adapting Canon brand lenses for use on Sony cameras was prevalent. The introduction of Sony’s flagship Grand Master line (the “GM” in the name) was very welcomed by Sony owners and this line is proving attractive to those considering a switch to the Sony camp. The 16-35mm f/2.8 GM is one more reason to stay entirely within the Sony brand.nnFocal Length RangennWhen starting a kit, most will first select a general purpose lens (Sony system owners should seriously consider the Sony FE 24-70mm f/2.8 GM Lens) and one of the next-most-needed lenses is typically a wide-angle zoom. This 16-35mm range ideally covers that need.nnThe 107° angle of view provided by a 16mm focal length is ultra-wide and all of the narrower angles of view down to 63°, just modestly-wide, are included. To explore what this focal length range looks like, we head to RB Rickett’s falls in Ricketts Glen State Park.nnOne of the most popular uses for this range is, as illustrated above, landscape photography.

那些将最终的索尼E卡口镜头套件组装在一起的人将希望包括此镜头。 索尼FE 16-35mm f / 2.8 GM镜头可在大光圈和高质量情况下覆盖关键的焦距范围。 在这种情况下,``高质量''一词既适用于镜头的物理属性,也适用于其提供的图像质量.nn许多人首先被索尼高性能全画幅成像传感器的Alpha MILC(无反光镜可换镜头相机)系统吸引,但镜头与相机同样重要,因此许多人最初认为索尼的镜头阵容不足。 佳能品牌的镜头适合在索尼相机上使用的情况很普遍。 索尼的所有者非常欢迎索尼旗舰产品Grand Master系列(名称中的“ GM”)的推出,并且该系列产品对那些考虑改用索尼阵营的人来说非常有吸引力。 16-35mm f / 2.8 GM是完全保留在Sony品牌内的另一个原因.nn焦距范围nn在开始制作套件时,大多数人会首先选择通用镜头(Sony系统所有者应认真考虑Sony FE 24-70mm f / 2.8 GM镜头)和第二个最需要的镜头之一通常是广角变焦镜头。 这个16-35mm范围可完美满足需要。nn16mm焦距提供的107°视角是超宽的,并且包括所有小到63°的较窄视角(仅适度的宽)。 为了探索这个焦距范围是什么样子,我们前往RB Rickett在Ricketts Glen State Park的瀑布。nn如上所示,该范围最受欢迎的用途之一是风景摄影。

Now, one of the really cool features of the newspaper library is that it has built-in natural language processing capabilities and can return keywords, summaries and other interesting tidbits. To get this to work, you must have the Natural Language Toolkit (NLTK) installed (install with pip install nltk) and have the punkt package installed from nltk. Here’s an example using the previous url (and assuming you’ve already done the above steps).

现在, newspaper库的真正酷功能之一是它具有内置的自然语言处理功能,并且可以返回关键字,摘要和其他有趣的花絮。 要使其正常工作,必须安装自然语言工具包(NLTK) (使用pip install nltk ),并从nltk安装punkt软件包。 这是使用先前网址的示例(并假设您已经完成上述步骤)。

import nltk

# Let's download punkt. 
# If already installed punkt,
# you can skip this step
nltk.download('punkt')

article.nlp() #this runs the natural language processing
print(article.keywords)

The result:

结果:

['focal', '1635mm', 'review', 'gm',
 'lens', 'sony', 'focus', 'aperture', 
'f28', 'fe', 'lenses']

That’s quite nice (and easy!).  Of course, If I were doing this as a serious NLP Project, i’d write my own NLP functions but for a quick look at keywords of an article, this is a fast way to do it.

很好(而且很容易!)。 当然,如果我作为一个认真的NLP项目来这样做,我会编写自己的NLP函数,但是为了快速查看文章的关键字,这是一种快速的方法。



If you want to learn more about Natural Language Processing using NLTK, the definitive book is Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit.

如果您想了解有关使用NLTK进行自然语言处理的更多信息,权威书籍是使用Python进行自然语言处理:使用Natural Language Toolkit分析文本



翻译自: https://www.pybloggers.com/2019/01/quick-tip-consuming-google-search-results-to-use-for-web-scraping/

谷歌浏览器抓取网页数据插件

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值