黑板客Python爬虫闯关第一关 urllib简单通关

最新推荐文章于 2024-03-07 10:55:50 发布

Evixenon

最新推荐文章于 2024-03-07 10:55:50 发布

阅读量677

点赞数

文章标签： python

本文链接：https://blog.csdn.net/XenonL/article/details/105019822

版权

题目

按指示将数字输入在末尾回车：

重复几次后，还是一样的界面，还老实告诉我这样的数字有几百个。

右键看一下网页源代码：



<!DOCTYPE html>
<html lang="zh-CN" >
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" >
    <meta name="baidu-site-verification" content="f2dwZkRE36" />
    <meta name="description" content="该博客用于记录网易云课堂中相关课程讲述的思路和一些扩展." />
    <meta name="keywords" content="blog, 博客, 黑板客, 爬虫闯关, 机器学习, django后端开发, 用python做些事." />
    <meta name="author" content="heibanke" />
    <head>
        <title>爬虫闯关——1</title>

        <link rel="stylesheet" href="/static/libs/bootstrap/css/bootstrap.min.css">
        <link href="/static/libs/jquery-ui-1.11.2/jquery-ui.min.css" rel="stylesheet" type="text/css"/>
 
        <!--[if lt IE 9]><script src="/static/libs/js/ie8-responsive-file-warning.js"></script><![endif]-->


        <!--[if lt IE 9]>
          <script src="/static/libs/js/html5shiv.js"></script>
          <script src="/static/libs/js/respond.min.js"></script>
        <![endif]-->
        <script src="/static/libs/jquery-ui-1.11.2/external/jquery/jquery.js"></script>
        <script src="/static/libs/bootstrap/js/bootstrap.min.js"></script>
        <script src="/static/libs/js/d3.min.js"></script>
        <script src="/static/libs/jquery-ui-1.11.2/jquery-ui.min.js"></script>
        
        <script>
            var _hmt = _hmt || [];
            (function() {
              var hm = document.createElement("script");
              hm.src = "//hm.baidu.com/hm.js?74e694103cf02b31b28db0a346da0b6b";
              var s = document.getElementsByTagName("script")[0]; 
              s.parentNode.insertBefore(hm, s);
            })();
        </script>
    
         
    </head>
    <body>
        <div class="container-fluid">
         
        <div class="row">
            <div class="col-sm-1 col-md-2 col-lg-3"></div>
            <div class="col-xs-12 col-sm-10 col-md-8 col-lg-6">
                
<h1>这里是黑板客爬虫闯关的第一关</h1>
<h3>下一个你需要输入的数字是43396. </h3>



                <div class="text-center">
<table class="table">
<thead>
<tr>
<td><a href="http://www.miitbeian.gov.cn" target="_blank" ><small>京ICP备14042321号</small></a></td>
<td><a href="http://www.heibanke.com/jizhang" target="_blank" ><small>云记账demo</small></a></td>
<td><a href="http://www.heibanke.com" target="_blank" ><small>黑板客blog</small></a></td>
<td><a href="http://study.163.com/course/courseMain.htm?courseId=1000035" target="_blank" ><small>用Python做些事</small></a></td>

</tr>
</thead>
</table>
</div>    
            </div>
            <div class="col-sm-1 col-md-2 col-lg-3"></div>
        </div>  
        </div>
    </body>
</html>

我们要看的重点在这里，数字是在h3标签中的：

<h1>这里是黑板客爬虫闯关的第一关</h1>
<h3>下一个你需要输入的数字是43396. </h3>

分析

我们要编写的程序需要做到这些事：

发起url请求并获得响应代码
找出响应代码中的数字
将数字拼接在原网址后面得到新网址
向新网址发起url请求并获得响应代码(递归）
返回2. 除非找不到数字了（关卡完成）

代码

我的urllib和re模块都是初学，代码不够优雅请见谅。

#coding:utf-8

import re
import urllib.request


url1="http://www.heibanke.com/lesson/crawler_ex00/"
# pattern是用来匹配包含5个数字的h3标签的正则表达式
pattern = re.compile(r"<h3>.*(\d){5}.*</h3>")
count=0

def get_url(url):
    # 每调用一次，计数加一
    global count
    count += 1

    # 1.用urlopen()访问url并获得网页代码
    response = urllib.request.urlopen(url,timeout=1000)
    text = response.read().decode('utf-8')    # text保存网页源代码
    s = re.search(pattern,text)    # s以Match对象的形式保存了h3标签的内容，s.group()可以读取匹配到的内容

    # 5.设置中断点，如s匹配不到会返回None对象
    if s==None:      
        print("finished")

    # 如果匹配成功，进行下一次调用
    else: 
        # 打印出当前次数和h3标签作为logging，脚本运行时能有一些反馈信息
        print(count,' ', s.group())
        # 2.读取h3标签中的数字
        pt = re.compile(r"(\d){5}")
        s2 = re.search(pt,s.group())
        # 3.拼接数字得到新url
        newUrl = "http://www.heibanke.com/lesson/crawler_ex00/" + s2.group() +"/"
        # 4.递归调用自身
        get_url(newUrl)

if __name__=='__main__':
    get_url(url1)