Python 爬虫笔记（对维基百科页面的深度爬取）

最新推荐文章于 2024-08-03 12:52:19 发布

张章章Sam

最新推荐文章于 2024-08-03 12:52:19 发布

阅读量5.2k

点赞数

文章标签： python 爬虫正则表达式图片 utf-8

本文链接：https://blog.csdn.net/qq_16103331/article/details/52680818

版权

*#! /usr/bin/env python
#coding=utf-8
import urllib2
from    bs4 import  BeautifulSoup
import  re
import datetime
import random
random.seed(datetime.datetime.now())
def getLinks(articleUrl):
        html=urllib2.urlopen("http://en.wikipedia.org"+articleUrl)
        bsObj=BeautifulSoup(html)
        return bsObj.find("div",{"id":"bodyContent"}).findAll("a",
                                            href=re.compile("^(/wiki/)((?!:).)*$"))
links=getLinks("/wiki/Kevin_Bacon")
while  len(links)>0:
        newArticle=links[random.randint(0,len(links)-1)].attrs["href"]
        print(newArticle)
        links=getLinks(newArticle)*

PSEUDORANDOM NUMBERS AND RANDOM SEEDS
In the previous example, I used Python’s random number generator to select an article at random on each page in
order to continue a random traversal of Wikipedia. However, random numbers should be used with caution.
While computers are great at calculating correct answers, they’re terrible at just making things up. For this reason,
random numbers can be a challenge. Most random number algorithms strive to produce an evenly distributed and
hard-to-predict sequence of numbers, but a “seed” number is needed to give these algorithms something to work
with initially. The exact same seed will produce the exact same sequence of “random” numbers every time, so for
this reason I’ve used the system clock as a starter for producing new sequences of random numbers, and, thus, new
sequences of random articles. This makes the program a little more exciting to run.
For the curious, the Python pseudorandom number generator is powered by the Mersenne Twister algorithm. While
it produces random numbers that are difficult to predict and uniformly distributed, it is slightly processor intensive.
Random numbers this good don’t come cheap!