一.爬虫:python网络爬虫基础(16讲.课堂笔记)

最新推荐文章于 2022-11-21 23:21:24 发布
gzg----rxq
最新推荐文章于 2022-11-21 23:21:24 发布
阅读量3.2k
点赞数
本文链接：https://blog.csdn.net/gzgrxq521/article/details/81074191
版权
python网络爬虫基础
1.HTTP简单了解
    1.1 HTTP请求格式请求
        当浏览器向web服务器发出请求时,它向服务器传递了一个数据块,也就是请求信息,htt[请求信息由三部分组成:
         * 请求方法url协议/版本
         * 请求头(Request Header):http://www.baidu.com/httpttest.php
         * 请求正文
            例如:
                GET/samp.jsp HTTP/1.1   -->请求方式.url,请求协议版本
                Accept:image/gif.image/jpeg,*/*
                Accept-Language:zh-ch
                Connection:keep-Alive
                Host:localhost
                User-Agent:*********************
                Accept-Encoding:gzip,deflate  -->请求头信息
                username = jingqiao&password=123  -->请求体内容
    1.2 HTTP请求方式
        * 常见的HTTP请求方式有get和post
        * get时比较简单的http请求,直接回发送给web服务器的数据在请求地址的后面,
            使用?key1 = value1 & key2 = value2  形式传递数据,只适合数据量请少,且没有安全性要求的请求
        * post 是讲需要发送给web服务器的数据经过编码放在请求中,可以传递大量数据,冰球有一定的安全性,常用于表单提交 
    1.3 浏览器开发者工具
        * 使用浏览器用F12打开,打开web发者工具

    1.4 HTTP GET 请求内容查看
        * 网页 F12 NETWORK /网络
    1.5 HTTP POST请求内容查看
        * 网页 F12 NETWORK /网络
2.urllib和urllib2模块的使用
    2.1 urllib和urllib2模块介绍
        * urllib 和 urllib2  ,都是功能强大的网络编程函数库,通过他们在我网络上访问文件就像访问本地电脑上的文件一样,通过一个简单的函数调用,几乎可以把任何url
          所指向的东西用做程序的输入
        * urllib 和 urllib2 与re模块(正则表达式)结合,可以下载web页面,提起页面上的数据信息,以及自动生成报告
    2.2 urllib和urllib2两模块间比较
        * urllib2可以接受一个Resquest 类的实例来设置URL请求headers,urllib仅可以接受URL,因此,你不可以通过urllib模板伪装你的user Agent字符串等(伪装浏览器)
        * urllib 提供urlencode 方法用来GET查询字符串的产生,而urllib2没有,这是为何urllib常和urllib2一起使用的原因.
        * urllib 模块比较优势的地方时urllib2.urlopen可以接受Request对象作为参数,从而控制http Request的header部
        * 但是urllib.urlretrieve函数以及urllib.quote等一系列quote和unquote功能没有 被加入urllib2中，因此有时也需要urllib的辅助。

    2.3 使用urllib2访问指定的url并获取页面内容


            import  urllib
            import  urllib2
            def download_with_retry(url, num_retries = 2):
                print  'Downloading' ,url
                try:
                    html = urllib2.urlopen(url).read()
                except urllib2.URLError as e:
                    html = None
                    if hasattr(e, 'reason'):
                        print "lian jie fu wu q i shi bai"
                        print "Reson:",e.reason
                    if hasattr(e,'code'):
                        print "the Server could't fullfill the request"
                        print "Erro Code",e.code
                        if num_retries > 0 and 500<e.code < 600:

                            return  download_with_retry(url,num_retries-1)
                return html
            # download_with_retry('http://www.bai.com')
              download_with_retry('http://httpstat.us/500')
    2.4 urllib2结合re提取页面信息
            import  urllib
            import  urllib2
            def download(url,headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"},num_retries = 2):
                print 'Downloading',url
                # 设置请求
                request = urllib2.Request(url,headers = headers)
                try:
                    html = urllib2.urlopen(request).read()
                except urllib2.URLError as e:
                    html = Nome
                    if hasattr(e, 'reason'):
                        print "lian jie fu wu q i shi bai"
                        print "Reson:", e.reason
                    if hasattr(e, 'code'):
                        print "the Server could't fullfill the request"
                        print "Erro Code", e.code
                        if num_retries > 0 and 500 < e.code < 600:
                            return download_with_retry(url, num_retries - 1)
                    return html
            download('http://www.baidu.com')
    2.5 urllib2使用代理IP访问页面
        import  urllib
        import  urllib2
        import random
        def get_html(url, headers, proxies, num_retries =2):
            print 'downling',url

            req = urllib2.Request(url)
            req.add_header("User-Agent", random.choice(headers['User-Agent']))

            proxies_support = urllib2.ProxyHandler({'http':random.choice(proxies)})
            opener = urllib2.build_opener(property)
            urllib2.install_opener(opener)
            try:
                html = urllib2.urlopen(req).read()
            except urllib2.URLError as e:
                html = None
                if hasattr(e, 'reason'):
                    print "lian jie fu wu q i shi bai"
                    print "Reson:", e.reason
                if hasattr(e, 'code'):
                    print "the Server could't fullfill the request"
                    print "Erro Code", e.code
                    if num_retries > 0 and 500 < e.code < 600:
                        return download_with_retry(url, num_retries - 1)
                return html
        headers={
            "User-Agent":["Mozilla/5.0 (windows NT 10.0 ;WOW64: rv:50.0) Gecko/20100101 Firefox/50.0"]
        }
        proxies = ["220.189.249.80:80","124.248.32.43:80"]
        html  =get_html("https://www.tmall.com", headers,proxies)
        html

request模块的使用
        1 Requests模块介绍
            1.1 Requests 是用Python语言编写，基于 urllib，采用 Apache2 Licensed 开源协 议的 HTTP 库。它比 urllib 更加方便，可以节约我们大量的工作，完全满足 HTTP 测试需求。
                Requests是为人类编写的http库
                安装： conda install requests 或者 pip install requests
                参考资料 http://docs.python-requests.org/zh_CN/latest/user/quickstart.html

        2 RequestS模块常见API使用
            2.1
                # -*-encoding:utf-8-*-
            # 使用dir+help探索Requests 模块
                import requests
                # help(requests)
                # 使用dir + help 探索 request 模块
                # dir(requests)
                # help(requests.get)
                # help(requests.post)

                ------------------------------------------------
                # requests模块实现http post请求
                payload = {"key":"value","key2":"values"}

                # try:
                #     r = requests.post("http://httpbin.org/post",data = payload)
                # except requests.exceptions.ConnectionError as e:
                #     requests =  None
                #     print("服务器连接失败")
                # if r:
                #     print(r)
                #     print(r.text)

                -------------------???--------
                # r = requests.get('http://yangrong.blog.cto.com/6945369/1339593/')
                # print(r)

        # print(r)

        3 设置请求信息模拟浏览器访问
                # -*-encoding:utf-8-*--
                    # 设置请求头信息模拟浏览器访问
                    import  requests
                    headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0)  Gecko/20100101 Firefox/50"}
                    html = requests.get("https://www.tmall.com",headers=headers)

                    html.status_code
                    html.content
                    html.text
                    print(html.content)
                    print(html.status_code)
                    print(html.text)

        4 使用IP代理


                import requests
                #  使用代理IP
                headers ={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0)  Gecko/20100101 Firefox/50"}
                proxies = {"http":"112.72.32.73:80",\
                           "https":"58.67.159.50:80",}
                html = requests.get('https://www.tmall.com',headers = headers,
                        proxies=proxies)
                print(html.text)