python菜鸟心路历程(三)

前言

本次的练习是通过爬虫伪装为正常用户访问页面,从而达到刷访问量的目的


2019.4.11失败
失败原因:程序可以运行,但是一直报代理有问题的错误,下一步要采用一些高匿代理进行尝试


参考的文章有以下几个:
https://blog.csdn.net/c406495762/article/details/72793480#23-正常的访问速度 --(重点推荐,写的非常详细,看完很有收获)
https://blog.csdn.net/dala_da/article/details/79401163
https://www.jb51.net/article/99984.htm
https://blog.csdn.net/qq_41782425/article/details/84993073
https://blog.csdn.net/wenxuhonghe/article/details/85226490

正文

本次练习中还学习了如何爬取网页中的表格,也就是带有’tr’ 'td’这两种标签的网页
通过使用beautifulsoup 以及自带的get_text()取出想要的数据,代码如下:

    for tr in soup.findAll('tr'):
        for td in soup.findAll('td', {"data-title": "IP"}):
            # print(td.get_text())
            ip_list.append(td.get_text())

其他的就是通过第一篇文章(重点推荐),学习到如何伪装成正常用户访问网页,同时也借用大神一句话,爬虫的同时也要考虑到别人的感受,多加time.sleep和谐爬虫

附上完整代码,欢迎各位大佬指点错误~~

from bs4 import BeautifulSoup
import requests
import re,random,time

#将代理IP存入数组准备使用
def get_proxy():
    user_agent = random.choice(UA)
    #print(user_agent)
    headers = {'User-Agent':user_agent}

    try:
        r = requests.get('https://www.kuaidaili.com/free/intr/',headers = headers)
    except requests.HTTPError as e:
        print(e)
        print("httpError")
    except requests.RequestException as e:
        print(e)
    except:
        print("Unknown Error")

    soup = BeautifulSoup(r.text, "lxml")

    for tr in soup.findAll('tr'):
        for td in soup.findAll('td', {"data-title": "IP"}):
            # print(td.get_text())
            ip_list.append(td.get_text())

    #print(ip_list)

def main():
    times = 0
    finish_time  = 0

    referer_list = [
        {'http://blog.csdn.net/dala_da/article/details/79401163'},
        {'http://blog.csdn.net/'},
        {
        'https://www.sogou.com/tx?query=%E4%BD%BF%E7%94%A8%E7%88%AC%E8%99%AB%E5%88%B7csdn%E8%AE%BF%E9%97%AE%E9%87%8F&hdq=sogou-site-706608cfdbcc1886-0001&ekv=2&ie=utf8&cid=qb7.zhuye&'},
        {
        'https://www.baidu.com/s?tn=98074231_1_hao_pg&word=%E4%BD%BF%E7%94%A8%E7%88%AC%E8%99%AB%E5%88%B7csdn%E8%AE%BF%E9%97%AE%E9%87%8F'}
    ]

    url = 'https://blog.csdn.net/neolz/article/details/89025388'

    headers = {
        'User-Agent': random.choice(UA),
        'Referer':random.choice(referer_list)
    }

    while 1:
        ip = ip_list[random.randint(1, len(ip_list))]
        proxy_ip = 'http://' + ip
        proxy_ips = 'https://' + ip
        proxy = {'https':proxy_ips , 'http':proxy_ip}

        try:
            response = requests.get(url , headers = headers , proxies = proxy)
            print(response)
        except:
            print('代理有问题'+proxy['https'])
            time.sleep(5)
        else:
            print('已刷%d次,%s') % (finish_time + 1, proxy["https"])

        times+=1
        if not times%len(ip_list):
            time.sleep(10)

if __name__ == '__main__':
    UA = [
        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60',
        'Opera/8.0 (Windows NT 5.1; U; en)',
        'Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50',
        'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50',
        'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0',
        'Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10',
        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2',
        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36',
        'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
        'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16',
        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11',
        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
        'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)',
        'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0)',
        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36'
    ]
    ip_list = []
    get_proxy()
    main()


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值