Python 爬虫笔记(对维基百科页面的深度爬取)

*#! /usr/bin/env python
#coding=utf-8
import urllib2
from    bs4 import  BeautifulSoup
import  re
import datetime
import random
random.seed(datetime.datetime.now())
def getLinks(articleUrl):
        html=urllib2.urlopen("http://en.wikipedia.org"+articleUrl)
        bsObj=BeautifulSoup(html)
        return bsObj.find("div",{"id":"bodyContent"}).findAll("a",
                                            href=re.compile("^(/wiki/)((?!:).)*$"))
links=getLinks("/wiki/Kevin_Bacon")
while  len(links)>0:
        newArticle=links[random.randint(0,len(links)-1)].attrs["href"]
        print(newArticle)
        links=getLinks(newArticle)*

PSEUDORANDOM NUMBERS AND RANDOM SEEDS
In the previous example, I used Python’s random number generator to select an article at random on each page in
order to continue a random traversal of Wikipedia. However, random numbers should be used with caution.
While computers are great at calculating correct answers, they’re terrible at just making things up. For this reason,
random numbers can be a challenge. Most random number algorithms strive to produce an evenly distributed and
hard-to-predict sequence of numbers, but a “seed” number is needed to give these algorithms something to work
with initially. The exact same seed will produce the exact same sequence of “random” numbers every time, so for
this reason I’ve used the system clock as a starter for producing new sequences of random numbers, and, thus, new
sequences of random articles. This makes the program a little more exciting to run.
For the curious, the Python pseudorandom number generator is powered by the Mersenne Twister algorithm. While
it produces random numbers that are difficult to predict and uniformly distributed, it is slightly processor intensive.
Random numbers this good don’t come cheap!

评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值