学习小记 - Python爬虫 (2) 爬虫闯关系列

最新推荐文章于 2024-08-22 15:35:37 发布

awakeBird

最新推荐文章于 2024-08-22 15:35:37 发布

阅读量508

点赞数

文章标签：爬虫

本文链接：https://blog.csdn.net/awakeBird/article/details/53792868

版权

偶然发现了一个很有意思的网站爬虫闯关

第一关：

这里是最简单的静态网页爬取，只需要爬到页面中的五个数字然后重复请求URL即可。
但这里有个小bug，到最后的时候会爬到页面下方的另一串数字。

import urllib, urllib2, re

url = 'http://www.heibanke.com/lesson/crawler_ex00/'
plus_str = ''
user_agent = 'Mozilla/4.0 (co,patible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent' : user_agent}
urlp=''
while True:
    try:
        urlp = url+plus_str
        print urlp
        request = urllib2.Request(urlp, headers=headers)
        response = urllib2.urlopen(request)
        content = response.read().decode('utf-8')
        pattern = re.compile('<div class="row">.*?<h3>.*?(\d\d\d\d\d)', re.S)
        plus  = re.findall(pattern, content)
        plus_str = ''
        for i in plus:
            plus_str += str(i)
    except urllib2.URLError, e:
        print e
        print urlp
        break

用bs4和urllib模块实现：（python3.5）

# -*- coding:utf-8 -*-

from urllib.request import urlopen
from urllib.error import HTTPError, URLError
from bs4 import BeautifulSoup
import re

def getTitle(newNum):
    url = 'http://www.heibanke.com/lesson/crawler_ex00/'+newNum
    print ("Calling..." + url)
    page = urlopen(url)
    bsobj = BeautifulSoup(page.read(), "html.parser")
    next_Num = re.findall(r"\d{5}", bsobj.h3.get_text())
    if next_Num == []:
        print ("Finished.")
        return
    return getTitle(next_Num[0])

getTitle('')