如何使Web爬行程序在50行以下的Python代码

最新推荐文章于 2024-03-26 09:24:21 发布

GoodTekken

最新推荐文章于 2024-03-26 09:24:21 发布

阅读量781

点赞数 1

文章标签： python spider web crawler 网络爬虫

想了解Google，Bing或Yahoo的工作原理吗？想知道如何抓取网页，以及一个简单的网络爬虫看起来像什么？在50行以下的Python（版本3）代码中，这里有一个简单的网络爬虫！（带有注释的完整源代码位于本文的底部）。

让我们看看它是如何运行的。请注意，您输入起始网站，要查找的字词以及要搜索的最大页数。

好吧，但它是如何工作的？

让我们先来谈谈网络爬虫的目的是什么。如维基百科页面上所述，网络爬虫是一种以有条不紊的方式浏览万维网收集信息的程序。网络爬虫收集什么样的信息？通常有两件事：

网页内容（页面上的文本和多媒体）
链接（到同一网站上的其他网页，或完全到其他网站）

这正是这个小“机器人”的作用。它从您在spider（）函数中键入的网站开始，并查看该网站上的所有内容。这个特定的机器人不检查任何多媒体，而是它正在寻找“text / html”如代码中所述。每次访问网页时， 它会收集两组数据：页面上的所有文本以及页面上的所有链接。如果在页面上的文本中找不到该单词，机器人将收集其集合中的下一个链接，并重复该过程，再次收集下一页上的文本和链接集。一次又一次地，重复这个过程，直到机器人找到了这个词或已经跑进了你输入到spider（）函数的极限。

这是Google的工作原理吗？

有点。Google有一整套网络抓取工具不断抓取网络，抓取是发现新内容（或跟踪不断更新的网站或添加新内容的网站）的重要组成部分。但是，您可能注意到此搜索需要一段时间才能完成，也许只有几秒钟。在更困难的搜索词可能需要更长时间。搜索引擎有另一个很大的组成部分，称为索引。索引是您对Web爬网程序收集的所有数据的处理。索引意味着您解析（浏览和分析）网页内容并创建一个容易访问和快速检索的信息的大集合（思考数据库或表）。因此，当您访问Google并输入“kitty cat”时，您的搜索字词将直接用于已经抓取，解析和分析的数据集合。事实上，你的搜索结果已经坐在那里等待一个神奇的短语“小猫猫”释放他们。这就是为什么你可以在0.14秒内获得超过1400万的结果。

*您的搜索字词实际上同时访问多个数据库，例如拼写检查，翻译服务，分析和跟踪服务器等。

让我们更详细地看代码！

以下代码应该完全适用于Python 3.x. 它是使用Python 3.2.2在2011年9月编写和测试的。继续并复制+粘贴到您的Python IDE，运行它或修改它！

from html.parser import HTMLParser  
from urllib.request import urlopen  
from urllib import parse

# We are going to create a class called LinkParser that inherits some
# methods from HTMLParser which is why it is passed into the definition
class LinkParser(HTMLParser):

    # This is a function that HTMLParser normally has
    # but we are adding some functionality to it
    def handle_starttag(self, tag, attrs):
        # We are looking for the begining of a link. Links normally look
        # like <a href="www.someurl.com"></a>
        if tag == 'a':
            for (key, value) in attrs:
                if key == 'href':
                    # We are grabbing the new URL. We are also adding the
                    # base URL to it. For example:
                    # www.netinstructions.com is the base and
                    # somepage.html is the new URL (a relative URL)
                    #
                    # We combine a relative URL with the base URL to create
                    # an absolute URL like:
                    # www.netinstructions.com/somepage.html
                    newUrl = parse.urljoin(self.baseUrl, value)
                    # And add it to our colection of links:
                    self.links = self.links + [newUrl]

    # This is a new function that we are creating to get links
    # that our spider() function will call
    def getLinks(self, url):
        self.links = []
        # Remember the base URL which will be important when creating
        # absolute URLs
        self.baseUrl = url
        # Use the urlopen function from the standard Python 3 library
        response = urlopen(url)
        # Make sure that we are looking at HTML and not other things that
        # are floating around on the internet (such as
        # JavaScript files, CSS, or .PDFs for example)
        if response.getheader('Content-Type')=='text/html':
            htmlBytes = response.read()
            # Note that feed() handles Strings well, but not bytes
            # (A change from Python 2.x to Python 3.x)
            htmlString = htmlBytes.decode("utf-8")
            self.feed(htmlString)
            return htmlString, self.links
        else:
            return "",[]

# And finally here is our spider. It takes in an URL, a word to find,
# and the number of pages to search through before giving up
def spider(url, word, maxPages):  
    pagesToVisit = [url]
    numberVisited = 0
    foundWord = False
    # The main loop. Create a LinkParser and get all the links on the page.
    # Also search the page for the word or string
    # In our getLinks function we return the web page
    # (this is useful for searching for the word)
    # and we return a set of links from that web page
    # (this is useful for where to go next)
    while numberVisited < maxPages and pagesToVisit != [] and not foundWord:
        numberVisited = numberVisited +1
        # Start from the beginning of our collection of pages to visit:
        url = pagesToVisit[0]
        pagesToVisit = pagesToVisit[1:]
        try:
            print(numberVisited, "Visiting:", url)
            parser = LinkParser()
            data, links = parser.getLinks(url)
            if data.find(word)>-1:
                foundWord = True
                # Add the pages that we visited to the end of our collection
                # of pages to visit:
                pagesToVisit = pagesToVisit + links
                print(" **Success!**")
        except:
            print(" **Failed!**")
    if foundWord:
        print("The word", word, "was found at", url)
    else:
        print("Word never found")