3 开始爬虫（1）-CSDN博客

遍历一个单域 (single domain)

即使你没有听说过“维基百科的六个度”，你也几乎肯定地听说过它的同名，“Kevin Bacon的六个度”。在这个游戏中，目标是连接两个不相似的主题，通过包含不超过六个的链接。

在这一部分，我们从一个项目开始，这个项目会成为一个“维基百科的六度”解决方案。也就是说，我们能够获得““Eric Idle 页面”，然后点击最少数目的链接来发现“Kevin_Bacon 页面”。

搜索任意一个Wikipedia网页，列举出该网页的链接：

from urllib.request import urlopen
from urllib.error import HTTPError, URLError
from bs4 import BeautifulSoup
import ssl


try:
    context = ssl._create_unverified_context()
    html = urlopen("http://en.wikipedia.org/wiki/Kevin_Bacon",context = context)

except HTTPError as e:
    print("The server could not fulfill the request.")
    print(e.code)
except URLError as e:
    print("Reaching a server is failed.")
    print(e.reason)
else:
    bsObj = BeautifulSoup(html)
    for link in bsObj.findAll("a"):
        if 'href' in link.attrs:
            print(link.attrs['href'])

注意这段代码的第4，7，8 行。与我们之前所写的代码是不一样的，那是因为如果我们还是按照之前的格式来写的话，会报一个错误：

[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:749)

这是因为：Python 升级到 2.7.9 之后引入了一个新特性，当使用urllib.urlopen打开一个 https 链接时，会验证一次 SSL 证书。而当目标网站使用的是自签名的证书时就会抛出一个 urllib2.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:581)> 的错误消息，详细信息可以在这里查看（https://www.python.org/dev/peps/pep-0476/）。

上述的方法的思路是：使用ssl创建未经验证的上下文，在urlopen中传入上下文参数。另外还有一种方法，即“全局取消证书验证”。代码如下：

from urllib.request import urlopen
from urllib.error import HTTPError, URLError
from bs4 import BeautifulSoup
import ssl

try:
#   context = ssl._create_unverified_context()
#   html = urlopen("http://en.wikipedia.org/wiki/Kevin_Bacon",context = context)
    ssl._create_default_https_context = ssl._create_unverified_context
    html = urlopen("http://en.wikipedia.org/wiki/Kevin_Bacon")

except HTTPError as e:
    print("The server could not fulfill the request.")
    print(e.code)
except URLError as e:
    print("Reaching a server is failed.")
    print(e.reason)
else:
    bsObj = BeautifulSoup(html)
    for link in bsObj.findAll("a"):
        if 'href' in link.attrs:
            print(link.attrs['href'])

通过这两种方法，我们能得到网页的所有的链接地址，你能发现你期望的所有的文章：“Apollo 13”，“Philadelphia”，“Primetime Emmy Award”等等。然而，也有一些地址是你不想要的：

／／wikimediafoundation.org/wiki/Privacy_policy
//en.wikipedia.org/wiki/Wikipedia:Contact_us

事实上，Wikipedia每个页面充满了部分，头，脚本链接，这些链接与分类页，交流页，以及其他不包含其他文章的页面：

／wiki/Category:Articles_with_unsourced_statements_from_April_2014
/wiki/Talk:Kevin_Bacon

如果我们试图去检测那些指向文章页面的链接，他们通常有三个特征：

他们属于那些id 为 bodyContent 的 div
URL不包含分号
URL以/wiki/为开头

我们可以使用这些规则去检索那些想要的文章链接的代码：

from urllib.request import urlopen
from urllib.error import HTTPError, URLError
from bs4 import BeautifulSoup,re
import ssl

try:
    #    context = ssl._create_unverified_context()
#   html = urlopen("http://en.wikipedia.org/wiki/Kevin_Bacon",context = context)
    ssl._create_default_https_context = ssl._create_unverified_context
    html = urlopen("http://en.wikipedia.org/wiki/Kevin_Bacon")


except HTTPError as e:
    print("The server could not fulfill the request.")
    print(e.code)
except URLError as e:
    print("Reaching a server is failed.")
    print(e.reason)
else:
    bsObj = BeautifulSoup(html)
for link in bsObj.find("div",{"id":"bodyContent"}).findAll("a",href=re.compile("^(/wiki/)((?!:).)*$")):
        if 'href' in link.attrs:
            print(link.attrs['href'])

得到类似于下面的结果：
这里写图片描述

我们需要将这段代码改为满足下面的要求：

单个函数，getLinks，输入为wiki/<Article_Name> 形式的Wikipedia文章URL，返回一个所有与之链接的文章URL的序列。
一个main 函数，调用getLinks，选择一个其返回的随机文章链接，然后再次调用getLinks，直到我们终止程序或者在新的网页上没有文章链接。

则代码为：

from urllib.request import urlopen
from urllib.error import HTTPError, URLError
from bs4 import BeautifulSoup
import ssl
import datetime
import random
import re

random.seed(datetime.datetime.now())

def getLinks(articleURL):
    try:
        #    context = ssl._create_unverified_context()
        #   html = urlopen("http://en.wikipedia.org/wiki/Kevin_Bacon",context = context)
        ssl._create_default_https_context = ssl._create_unverified_context
        html = urlopen("http://en.wikipedia.org"+articleURL)
    except HTTPError as e:
        return NULL
    except URLError as e:
        return NULL
    else:
        bsObj = BeautifulSoup(html)
        return bsObj.find("div",{"id":"bodyContent"}).findAll("a",href=re.compile("^(/wiki/)((?!:).)*$"))


links = getLinks("/wiki/Kevin_Bacon")
while len(links)>0:
    newArticle = links[random.randint(0,len(links)-1)].attrs["href"]
    print(newArticle)
    links = getLinks(newArticle)