python菜鸟心路历程（三）

最新推荐文章于 2024-04-26 09:58:48 发布

neolz

最新推荐文章于 2024-04-26 09:58:48 发布

阅读量154

点赞数

文章标签：爬虫 python

本文链接：https://blog.csdn.net/neolz/article/details/89214632

版权

前言

本次的练习是通过爬虫伪装为正常用户访问页面，从而达到刷访问量的目的

2019.4.11失败
失败原因：程序可以运行，但是一直报代理有问题的错误，下一步要采用一些高匿代理进行尝试

参考的文章有以下几个：
https://blog.csdn.net/c406495762/article/details/72793480#23-正常的访问速度 --（重点推荐，写的非常详细，看完很有收获）
https://blog.csdn.net/dala_da/article/details/79401163
https://www.jb51.net/article/99984.htm
https://blog.csdn.net/qq_41782425/article/details/84993073
https://blog.csdn.net/wenxuhonghe/article/details/85226490

正文

本次练习中还学习了如何爬取网页中的表格，也就是带有’tr’ 'td’这两种标签的网页
通过使用beautifulsoup 以及自带的get_text()取出想要的数据，代码如下：

    for tr in soup.findAll('tr'):
        for td in soup.findAll('td', {"data-title": "IP"}):
            # print(td.get_text())
            ip_list.append(td.get_text())

其他的就是通过第一篇文章（重点推荐），学习到如何伪装成正常用户访问网页，同时也借用大神一句话，爬虫的同时也要考虑到别人的感受，多加time.sleep和谐爬虫

附上完整代码，欢迎各位大佬指点错误~~

from bs4 import BeautifulSoup
import requests
import re,random,time

#将代理IP存入数组准备使用
def get_proxy():
    user_agent = random.choice(UA)
    #print(user_agent)
    headers = {'User-Agent':user_agent}

    try:
        r = requests.get('https://www.kuaidaili.com/free/intr/',headers = headers)
    except requests.HTTPError as e:
        print(e)
        print("httpError")
    except requests.RequestException as e:
        print(e)
    except:
        print("Unknown Error")

    soup = BeautifulSoup(r.text, "lxml")

    for tr in soup.findAll('tr'):
        for td in soup.findAll('td', {"data-title": "IP"}):
            # print(td.get_text())
            ip_list.append(td.get_text())

    #print(ip_list)

def main():
    times = 0
    finish_time  = 0

    referer_list = [
        {'http://blog.csdn.net/dala_da/article/details/79401163'},
        {'http://blog.csdn.net/'},
        {
        'https://www.sogou.com/tx?query=%E4%BD%BF%E7%94%A8%E7%88%AC%E8%99%AB%E5%88%B7csdn%E8%AE%BF%E9%97%AE%E9%87%8F&hdq=sogou-site-706608cfdbcc1886-0001&ekv=2&ie=utf8&cid=qb7.zhuye&'},
        {
        'https://www.baidu.com/s?tn=98074231_1_hao_pg&word=%E4%BD%BF%E7%94%A8%E7%88%AC%E8%99%AB%E5%88%B7csdn%E8%AE%BF%E9%97%AE%E9%87%8F'}
    ]

    url = 'https://blog.csdn.net/neolz/article/details/89025388'

    headers = {
        'User-Agent': random.choice(UA),
        'Referer':random.choice(referer_list)
    }

    while 1:
        ip = ip_list[random.randint(1, len(ip_list))]
        proxy_ip = 'http://' + ip
        proxy_ips = 'https://' + ip
        proxy = {'https':proxy_ips , 'http':proxy_ip}

        try:
            response = requests.get(url , headers = headers , proxies = proxy)
            print(response)
        except:
            print('代理有问题'+proxy['https'])
            time.sleep(5)
        else:
            print('已刷%d次,%s') % (finish_time + 1, proxy["https"])

        times+=1
        if not times%len(ip_list):
            time.sleep(10)

if __name__ == '__main__':
    UA = [
        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60',
        'Opera/8.0 (Windows NT 5.1; U; en)',
        'Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50',
        'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50',
        'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0',
        'Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10',
        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2',
        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36',
        'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
        'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16',
        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11',
        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
        'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)',
        'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0',
        'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0)',
        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36'
    ]
    ip_list = []
    get_proxy()
    main()

neolz

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python菜鸟心路历程（三）

前言本次的练习是通过爬虫伪装为正常用户访问页面，从而达到刷访问量的目的2019.4.11失败失败原因：程序可以运行，但是一直报代理有问题的错误，下一步要采用一些高匿代理进行尝试参考的文章有以下几个：https://blog.csdn.net/c406495762/article/details/72793480#23-正常的访问速度 --（重点推荐，写的非常详细，看完很有收获）ht...
复制链接

扫一扫