Web scraping is an awesome tool for analysts to sift through and collect large amounts of public data. Using keywords relevant to the topic in question, a good web scraper can gather large amounts of data very quickly and aggregate it into a dataset. There are several libraries in Python that make this extremely easy to accomplish. In this article, I will illustrate an architecture that I have been using for web scraping and summarizing search engine data. The article will be broken up into the following sections…

Web抓取是供分析人员筛选和收集大量公共数据的强大工具。 使用与所讨论主题相关的关键字,好的网页抓取工具可以非常快速地收集大量数据并将其聚合到数据集中。 Python中有几个库使此操作非常容易实现。 在本文中,我将说明一种用于Web抓取和汇总搜索引擎数据的体系结构。 本文将分为以下几部分:

  • Link Scraping

  • Content Scraping

  • Content Summarizing

  • Building a Pipeline


All of the code will be provided herein.


链接抓取 (Link Scraping)

First, we need a way to gather URLs relevant to the topic we are scraping data for. Fortunately, the Python library googlesearch makes it easy to gather URLs in response to an initial google search. Let’s build a class that uses this library to search our keywords and append a fixed number of URLs to a list for further analysis…

首先,我们需要一种方法来收集与要为其抓取数据的主题相关的URL。 幸运的是,Python库googlesearch可以轻松收集URL以响应初始的google搜索。 让我们构建一个使用该库搜索关键字并将其固定数量的URL附加到列表中以进行进一步分析的类……

import googlesearch

class LinkScraper:

    LinkScraper class is used to gather the highest ranking websites
    for the arg:search based on googlesearch (google's search algorithm)

    def __init__(self, search, n):
        # List of returned urls
        self.urls = []
        # for each url returned append to list of urls
        for url in, stop=n):

内容搜集 (Content Scraping)

This is arguably the most important part of the web scraper as it determines what data on a webpage will be gathered. Using a combination of urllib and beautiful soup (bs4) we are able to retrieve and parse the HTML for each URL in our Link Scraper class. Beautiful soup lets us specify the tags we want to extract data from. In the case below I am establishing a URL request and parsing the HTML response with bs4 and storing all the information found in the paragraph (<p></p>) tags…

可以说,这是Web抓取工具中最重要的部分,因为它确定了将收集网页上的哪些数据。 通过结合使用urllib和漂亮的汤(bs4),我们可以检索和解析Link Scraper类中每个URLHTML。 美丽的汤让我们指定了要从中提取数据的标签。 在下面的情况下,我正在建立一个URL请求,并使用bs4解析HTML响应,并存储在段落(<p> </ p>)标记中找到的所有信息……

import urllib.request
import bs4 as bs

class ContentScraper:

    WebScraper class is used to parse the HTML of a url to extract data from
    certain tags

    # TODO: Extract text from a pdf
    def __init__(self, url):
        # Adds a User-Agent Header to the url Request
        req = urllib.request.Request(
                'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/'
                '537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
        # Opens a url request for initialized urllib.request.Request (req)
        scraped_data = urllib.request.urlopen(req)
        # Raw Scraped Text
        article =
        # Parses with bs4 and xml
        parsed_article = bs.BeautifulSoup(article, 'lxml')
        # Find all paragraphs
        paragraphs = parsed_article.find_all('p')
        # Article text string to append parsed HTML
        self.article_text = ""
        # Append all paragraphs to article_text variable
        for p in paragraphs:
            # article_text+='\n' # For readability of raw data
            self.article_text += p.text
        # print(article_text, '\n\n') # For readability of raw data

内容汇总 (Content Summarizing)

This is where we create a summary of the text extracted from each page’s HTML residing in our Content Scraper. To do this we will be using a combination of libraries, mainly NLTK. The way in which we are generating the summary is relatively elementary and there are many ways to improve this method — but it's a great start. After some formatting and voiding of filler words, words get tokenized and ranked by frequency generating a few sentences that aim to accurately summarize the article…

在这里,我们为从内容抓取器中每个页面HTML中提取的文本创建摘要。 为此,我们将使用主要是NLTK的库的组合。 我们生成摘要的方式相对比较基础,并且有很多方法可以改进此方法,但这是一个很好的开始。 在对填充词进行某种格式设置和使它们无效之后,对词进行标记并按频率进行排名,从而生成一些旨在准确总结文章的句子…

import threading
import heapq
import re
import nltk

class Summarizer:

    Summarizer class is used to summarize tags parsed by the WebScraper based
    on frequency and relevance to search

    def __init__(self, article_text, search, n):
        # Preprocessing
        # Removing Square Brackets and Extra Spaces
        article_text = re.sub(r'\[[0-9]*\]', ' ', article_text)
        article_text = re.sub(r'\s+', ' ', article_text)
        # Removing special characters and digits
        formatted_article_text = re.sub('[^a-zA-Z]', ' ', article_text)
        formatted_article_text = re.sub(r'\s+', ' ', formatted_article_text)
        sentence_list = nltk.sent_tokenize(article_text)
        stopwords = nltk.corpus.stopwords.words('english')
        # Find Weighted Frequency of Occurrence
        word_frequencies = {}
        for word in nltk.word_tokenize(formatted_article_text):
            if word not in stopwords:
                if word not in word_frequencies.keys():
                    word_frequencies[word] = 1
                # TODO: Analyze frequency against relevance scores
                # TODO: If its quantitative (number) rank it higher, if its relevant to the
                # keyword arguemnt (add it) make it more relevant
                if word in search.split():  # if relevant rank higher
                    word_frequencies[word] += 5
                    word_frequencies[word] += 1
        maximum_frequency = max(word_frequencies.values())
        for word in word_frequencies.keys():
            word_frequencies[word] = (
                word_frequencies[word] / maximum_frequency
        sentence_scores = {}
        for sent in sentence_list:
            for word in nltk.word_tokenize(sent.lower()):
                if word in word_frequencies.keys():
                    if len(sent.split(' ')) < 30:
                        if sent not in sentence_scores.keys():
                            sentence_scores[sent] = word_frequencies[word]
                            sentence_scores[sent] += word_frequencies[word]
        summary_sentences = heapq.nlargest(
            n, sentence_scores, key=sentence_scores.get
        # New line for summaries
        self.summary = '\n'.join(summary_sentences)
        # TODO: Get Excel plugin for training_data extraction
        # Custom Delimiter for extracting training_data
        self.summary += '***'

建立管道 (Building a Pipeline)

This is the part where we put everything together. One class will instantiate an instance of each other component as needed to build and implement our web scraper. The WebScraper class takes a few parameters…

这是我们将所有内容组合在一起的部分。 一个类将根据需要实例化每个其他组件的实例,以构建和实现我们的Web爬虫。 WebScraper类带有一些参数…

  • search — String, Search engine query

    search —字符串,搜索引擎查询

  • n — Integer, Number of URL sources to analyze

    n —整数,要分析的URL来源数量

  • sl —Integer, Sentence length of the summary


  • fall_through — Boolean, Multithread the process or not

    fall_through —布尔值,是否对进程进行多线程

  • Write_file —Boolean, Write the summaries to a file

    Write_file —布尔值,将摘要写入文件

import threading
import heapq
import urllib.request
import re
import bs4 as bs
import googlesearch
import nltk

class WebScraper:
    WebScraper class is used to pipeline the LinkScraper, ContentScraper, and Summarizer
    classes to better understand the subject of the query
    The comprehension comes from the desired Nx1 output vector for each summary

    # TODO: Give the analyst proper hierarchical purpose
    # Currently the analyst is fetching summaries to be used to train CNN
    # Current issue is merging of rows in excel for labeling
    def __init__(self, search, n, sl, fall_through, write_file):
        # Get most recent news Links
        # WebScrape each Link
        # Summarize each WebScrape
        # Analyze each Summary

        # Item(s) to reserach = search
        # Number of sources
        self.n = n
        # Summary length (sentances) = sl
        # Return to main thread before all existing threads terminate
        self.fall_through = fall_through
        # Write the researched summaries to a file
        self.write_file = write_file
        # List of sources
        self.urls = []
        # List of summaries
        self.summaries = []
        # List of threads
        self.threads = []
        # Gather Links to WebScrape
        for url in LinkScraper(search, n).urls:
            # Create and append a thread for each link and url Request
            # self.scrape_and_summarize(url, search) # Multi-threaded process
            # If n > x, multi-threading will be faster
                    target=self.scrape_and_summarize, args=(
                        url, search
        # Start all threads in list
        for thread in self.threads:

        # fall_through refers to waiting for open threads before returning
        if not fall_through:
            # waiting for all threads to complete
            for thread in self.threads:

        # if the user wishes to write the analysis to a text file
        if write_file:
            # Deny fall through to ensure all data is saved
            for thread in self.threads:
                # Write the file out
                # Avoid UnicodeEncodeError with WebScraped html using utf-8
                file1 = open(
                    "training_data.txt", "w", encoding='utf-8'
                # TODO: Also write the relevant source (paired with summary)

    # Primary target of thread
    def scrape_and_summarize(self, url, search):
            print('Analyzing: ', url)
            # Append the generated summary to list
                    WebScraper(url).article_text, search,
            # Append the respective link to list
            # summaries and urls essentially have 'paired keys'
        except ValueError:
            print('Value Error')
        except TimeoutError:
            print('Timeout Error')
        except urllib.error.URLError:
            print('URL Error')
        except UnicodeError:
            print('Unicode Encode Error')
        except Exception:
            print('Exception not Anticipated')

Let’s now instantiate and run an instance of this WebScraper class…


Analyst(‘AAPL’, 10, 3, False, False)

Running the previous code results in the following output…


We have successfully extracted a few summaries from top search results about AAPL. Some sites have this type of request blocked as seen in the console output. Nevertheless, this has been a comprehensive starter guide to web scraping in Python.

我们已经成功地从AAPL的热门搜索结果中提取了一些摘要。 如控制台输出中所示,某些站点阻止了这种类型的请求。 尽管如此,这仍然是Python中网络抓取的全面入门指南。



