python构建复杂网络_如何在python中构建网络刮板

最新推荐文章于 2023-09-05 15:47:42 发布

weixin_26755331

最新推荐文章于 2023-09-05 15:47:42 发布

阅读量720

点赞数

文章标签：网络 python 人工智能机器学习神经网络

原文链接：https://towardsdatascience.com/how-to-build-a-web-scraper-in-python-c75563ee60b7

版权

python构建复杂网络

网页抓取 (Web Scraping)

Web scraping is an awesome tool for analysts to sift through and collect large amounts of public data. Using keywords relevant to the topic in question, a good web scraper can gather large amounts of data very quickly and aggregate it into a dataset. There are several libraries in Python that make this extremely easy to accomplish. In this article, I will illustrate an architecture that I have been using for web scraping and summarizing search engine data. The article will be broken up into the following sections…

Web抓取是供分析人员筛选和收集大量公共数据的强大工具。使用与所讨论主题相关的关键字，好的网页抓取工具可以非常快速地收集大量数据并将其聚合到数据集中。 Python中有几个库使此操作非常容易实现。在本文中，我将说明一种用于Web抓取和汇总搜索引擎数据的体系结构。本文将分为以下几部分：

Link Scraping
链接抓取
Content Scraping
内容搜集
Content Summarizing
内容汇总
Building a Pipeline
建立管道

All of the code will be provided herein.

本文将提供所有代码。

链接抓取 (Link Scraping)

First, we need a way to gather URLs relevant to the topic we are scraping data for. Fortunately, the Python library googlesearch makes it easy to gather URLs in response to an initial google search. Let’s build a class that uses this library to search our keywords and append a fixed number of URLs to a list for further analysis…

首先，我们需要一种方法来收集与要为其抓取数据的主题相关的URL。幸运的是，Python库googlesearch可以轻松收集URL以响应初始的google搜索。让我们构建一个使用该库搜索关键字并将其固定数量的URL附加到列表中以进行进一步分析的类……

import googlesearch




class LinkScraper:


    """
    LinkScraper class is used to gather the highest ranking websites
    for the arg:search based on googlesearch (google's search algorithm)
    """


    def __init__(self, search, n):
        # List of returned urls
        self.urls = []
        # for each url returned append to list of urls
        for url in googlesearch.search(search, stop=n):
            self.urls.append(url)

内容搜集 (Content Scraping)

This is arguably the most important part of the web scraper as it determines what data on a webpage will be gathered. Using a combination of urllib and beautiful soup (bs4) we are able to retrieve and parse the HTML for each URL in our Link Scraper class. Beautiful soup lets us specify the tags we want to extract data from. In the case below I am establishing a URL request and parsing the HTML response with bs4 and storing all the information found in the paragraph (<p></p>) tags…

可以说，这是Web抓取工具中最重要的部分，因为它确定了将收集网页上的哪些数据。通过结合使用urllib和漂亮的汤(bs4)，我们可以检索和解析Link Scraper类中每个URLHTML。美丽的汤让我们指定了要从中提取数据的标签。在下面的情况下，我正在建立一个URL请求，并使用bs4解析HTML响应，并存储在段落(<p> </ p>)标记中找到的所有信息……

import urllib.request
import bs4 as bs




class ContentScraper:


    """
    WebScraper class is used to parse the HTML of a url to extract data from
    certain tags
    """


    # TODO: Extract text from a pdf
    def __init__(self, url):
        # Adds a User-Agent Header to the url Request
        req = urllib.request.Request(
            url,
            data=None,
            headers={
                'User-Agent':
                'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/'
                '537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
            }
        )
        # Opens a url request for initialized urllib.request.Request (req)
        scraped_data = urllib.request.urlopen(req)
        # Raw Scraped Text
        article = scraped_data.read()
        # Parses with bs4 and xml
        parsed_article = bs.BeautifulSoup(article, 'lxml')
        # Find all paragraphs
        paragraphs = parsed_article.find_all('p')
        # Article text string to append parsed HTML
        self.article_text = ""
        # Append all paragraphs to article_text variable
        for p in paragraphs:
            # article_text+='\n' # For readability of raw data
            self.article_text += p.text
        # print(article_text, '\n\n') # For readability of raw data

内容汇总 (Content Summarizing)

This is where we create a summary of the text extracted from each page’s HTML residing in our Content Scraper. To do this we will be using a combination of libraries, mainly NLTK. The way in which we are generating the summary is relatively elementary and there are many ways to improve this method — but it's a great start. After some formatting and voiding of filler words, words get tokenized and ranked by frequency generating a few sentences that aim to accurately summarize the article…

在这里，我们为从内容抓取器中每个页面HTML中提取的文本创建摘要。为此，我们将使用主要是NLTK的库的组合。我们生成摘要的方式相对比较基础，并且有很多方法可以改进此方法，但这是一个很好的开始。在对填充词进行某种格式设置和使它们无效之后，对词进行标记并按频率进行排名，从而生成一些旨在准确总结文章的句子…

import threading
import heapq
import re
import nltk


class Summarizer:


    """
    Summarizer class is used to summarize tags parsed by the WebScraper based
    on frequency and relevance to search
    """


    def __init__(self, article_text, search, n):
        # Preprocessing
        # Removing Square Brackets and Extra Spaces
        article_text = re.sub(r'\[[0-9]*\]', ' ', article_text)
        article_text = re.sub(r'\s+', ' ', article_text)
        # Removing special characters and digits
        formatted_article_text = re.sub('[^a-zA-Z]', ' ', article_text)
        formatted_article_text = re.sub(r'\s+', ' ', formatted_article_text)
        sentence_list = nltk.sent_tokenize(article_text)
        stopwords = nltk.corpus.stopwords.words('english')
        # Find Weighted Frequency of Occurrence
        word_frequencies = {}
        for word in nltk.word_tokenize(formatted_article_text):
            if word not in stopwords:
                if word not in word_frequencies.keys():
                    word_frequencies[word] = 1
                # TODO: Analyze frequency against relevance scores
                # TODO: If its quantitative (number) rank it higher, if its relevant to the
                # keyword arguemnt (add it) make it more relevant
                if word in search.split():  # if relevant rank higher
                    word_frequencies[word] += 5
                else:
                    word_frequencies[word] += 1
        maximum_frequency = max(word_frequencies.values())
        for word in word_frequencies.keys():
            word_frequencies[word] = (
                word_frequencies[word] / maximum_frequency
                )
        sentence_scores = {}
        for sent in sentence_list:
            for word in nltk.word_tokenize(sent.lower()):
                if word in word_frequencies.keys():
                    if len(sent.split(' ')) < 30:
                        if sent not in sentence_scores.keys():
                            sentence_scores[sent] = word_frequencies[word]
                        else:
                            sentence_scores[sent] += word_frequencies[word]
        summary_sentences = heapq.nlargest(
            n, sentence_scores, key=sentence_scores.get
        )
        # New line for summaries
        self.summary = '\n'.join(summary_sentences)
        # TODO: Get Excel plugin for training_data extraction
        # Custom Delimiter for extracting training_data
        self.summary += '***'
        print(self.summary)

建立管道 (Building a Pipeline)

This is the part where we put everything together. One class will instantiate an instance of each other component as needed to build and implement our web scraper. The WebScraper class takes a few parameters…

这是我们将所有内容组合在一起的部分。一个类将根据需要实例化每个其他组件的实例，以构建和实现我们的Web爬虫。 WebScraper类带有一些参数…

search — String, Search engine query
search —字符串，搜索引擎查询
n — Integer, Number of URL sources to analyze
n —整数，要分析的URL来源数量
sl —Integer, Sentence length of the summary
sl-整数，摘要的句子长度
fall_through — Boolean, Multithread the process or not
fall_through —布尔值，是否对进程进行多线程
Write_file —Boolean, Write the summaries to a file
Write_file —布尔值，将摘要写入文件

import threading
import heapq
import urllib.request
import re
import bs4 as bs
import googlesearch
import nltk




class WebScraper:
    """
    WebScraper class is used to pipeline the LinkScraper, ContentScraper, and Summarizer
    classes to better understand the subject of the query
    The comprehension comes from the desired Nx1 output vector for each summary
    """


    # TODO: Give the analyst proper hierarchical purpose
    # Currently the analyst is fetching summaries to be used to train CNN
    # Current issue is merging of rows in excel for labeling
    def __init__(self, search, n, sl, fall_through, write_file):
        # Get most recent news Links
        # WebScrape each Link
        # Summarize each WebScrape
        # Analyze each Summary


        # Item(s) to reserach
        self.search = search
        # Number of sources
        self.n = n
        # Summary length (sentances)
        self.sl = sl
        # Return to main thread before all existing threads terminate
        self.fall_through = fall_through
        # Write the researched summaries to a file
        self.write_file = write_file
        # List of sources
        self.urls = []
        # List of summaries
        self.summaries = []
        # List of threads
        self.threads = []
        # Gather Links to WebScrape
        for url in LinkScraper(search, n).urls:
            # Create and append a thread for each link and url Request
            # self.scrape_and_summarize(url, search) # Multi-threaded process
            # If n > x, multi-threading will be faster
            self.threads.append(
                threading.Thread(
                    target=self.scrape_and_summarize, args=(
                        url, search
                        )
                )
            )
        # Start all threads in list
        for thread in self.threads:
            thread.start()


        # fall_through refers to waiting for open threads before returning
        if not fall_through:
            # waiting for all threads to complete
            for thread in self.threads:
                thread.join()


        # if the user wishes to write the analysis to a text file
        if write_file:
            # Deny fall through to ensure all data is saved
            for thread in self.threads:
                thread.join()
                # Write the file out
                # Avoid UnicodeEncodeError with WebScraped html using utf-8
                file1 = open(
                    "training_data.txt", "w", encoding='utf-8'
                    )
                # TODO: Also write the relevant source (paired with summary)
                file1.write(''.join(self.summaries))


    # Primary target of thread
    def scrape_and_summarize(self, url, search):
        try:
            print('Analyzing: ', url)
            # Append the generated summary to list
            self.summaries.append(
                Summarizer(
                    WebScraper(url).article_text, search, self.sl
                ).summary
            )
            # Append the respective link to list
            self.urls.append(url)
            # summaries and urls essentially have 'paired keys'
        except ValueError:
            print('Value Error')
        except TimeoutError:
            print('Timeout Error')
        except urllib.error.URLError:
            print('URL Error')
        except UnicodeError:
            print('Unicode Encode Error')
        except Exception:
            print('Exception not Anticipated')

Let’s now instantiate and run an instance of this WebScraper class…

现在让我们实例化并运行此WebScraper类的实例…

Analyst(‘AAPL’, 10, 3, False, False)

Running the previous code results in the following output…

运行前面的代码将导致以下输出…

Analyzing:  http://t1.gstatic.com/images?q=tbn:ANd9GcSjoU2lZ2eJX3aCMfiFDt39uRNcDu9W7pTKcyZymE2iKa7IOVaIAnalyzing:  https://en.wikipedia.org/wiki/Apple_Inc.
Analyzing:  https://www.bloomberg.com/news/articles/2020-08-26/apple-plans-augmented-reality-content-to-boost-tv-video-serviceAnalyzing:  https://www.marketwatch.com/story/apple-stock-rises-after-wedbush-hikes-target-to-new-street-high-of-600-2020-08-26Analyzing:  https://www.marketwatch.com/story/tesla-and-apple-have-had-a-great-run-heres-why-theyre-poised-to-rocket-even-higher-in-the-next-year-2020-08-26Analyzing:  https://finance.yahoo.com/quote/AAPL/Analyzing:  https://seekingalpha.com/article/4370830-apple-sees-extreme-bullishness
Analyzing:  https://seekingalpha.com/news/3608898-apples-newest-street-high-price-target-700-bull-caseAnalyzing:  https://www.marketwatch.com/investing/stock/aaplAnalyzing:  https://stocktwits.com/symbol/AAPLencoding error : input conversion failed due to input error, bytes 0x9D 0x09 0x96 0xA3
encoding error : input conversion failed due to input error, bytes 0x9D 0x09 0x96 0xA3Value Error***For more information you can review our Terms of Service and Cookie Policy.For inquiries related to this message please contact our support team and provide the reference ID below.***URL ErrorValue Error"China remains a key ingredient in Apple's recipe for success as we estimate roughly 20% of iPhone upgrades will be coming from this region over the coming year."
Ives points to recent signs of momentum in China, which he expects will continue for the next six to nine months.
Real-time last sale data for U.S. stock quotes reflect trades reported through Nasdaq only.***By comparison, Amazon AMZN, +0.58% has split its stock three times, rallying an average of 209% the following year.
Apple�s history isn�t quite as stellar as all those, with its four previous splits resulting in an average gain of 10.4% in the following year.
Real-time last sale data for U.S. stock quotes reflect trades reported through Nasdaq only.***The truck and fuel cell maker �could be a major horse in the EV race,� wrote the analyst, while voicing concerns about the stock�s valuation.
stocks edged lower Wednesday, a day after the S&P 500 set its first record close since February, after Federal Reserve officials highlighted the uncertainties facing the economy.
stock-market benchmarks mostly opened higher on Wednesday, pushing the key benchmarks to further records after an economic report came in better than expected.***Apple is the world's largest information technology company by revenue, the world's largest technology company by total assets, and the world's second-largest mobile phone manufacturer after Samsung.
Two million iPhones were sold in the first twenty-four hours of pre-ordering and over five million handsets were sold in the first three days of its launch.
The same year, Apple introduced System 7, a major upgrade to the operating system which added color to the interface and introduced new networking capabilities.***Howley also weighs in on Nintendo potentially releasing an upgraded Switch in 2021.***[Finished in 5.443s]

We have successfully extracted a few summaries from top search results about AAPL. Some sites have this type of request blocked as seen in the console output. Nevertheless, this has been a comprehensive starter guide to web scraping in Python.

我们已经成功地从AAPL的热门搜索结果中提取了一些摘要。如控制台输出中所示，某些站点阻止了这种类型的请求。尽管如此，这仍然是Python中网络抓取的全面入门指南。

翻译自: https://towardsdatascience.com/how-to-build-a-web-scraper-in-python-c75563ee60b7

python构建复杂网络

weixin_26755331

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
python构建复杂网络_如何在python中构建网络刮板

python构建复杂网络网页抓取 (Web Scraping)Web scraping is an awesome tool for analysts to sift through and collect large amounts of public data. Using keywords relevant to the topic in question, a good web scr...
复制链接

扫一扫