使用爬虫刷blog访问量随机代理IP 随机user_agent

最新推荐文章于 2024-04-16 18:18:19 发布

机尾云拉长

最新推荐文章于 2024-04-16 18:18:19 发布

阅读量3.9k

点赞数 4

分类专栏： python 文章标签：网络爬虫代理ip 刷访问量 python

本文链接：https://blog.csdn.net/qq_41011336/article/details/83216791

版权

python 专栏收录该内容

14 篇文章 4 订阅

订阅专栏

好羞耻地写下这篇博客，不过大家一看博主这可怜的访问量，就知道博主十足好孩子！！！呵呵，莫道石人一只眼,挑动黄河天下反
首先了解一下常见反爬虫的检测方法

频率监测：有些网站会设置一种频率监测的机制，对于同一IP，若在一定时间内访问的速度超过了设置的阈值，那么便会判定该IP的源头是一个爬虫机器人，从而限制访问频率或暂时禁止IP对其的访问

频数监测：与1有些类似，但是是对用户访问页面的数量进行统计，并在监测结果超过阈值后进行限制操作

Headers识别：这里常用的是对UA的识别和对Referer的判断。User-Agent是用户所使用的操作系统以及浏览器的标识。Referer表示请求是从哪个链接而来的。
这里我们随机了UA，对于headers里面的Referer一样可以随机化，做法和user_agent一样，伪装地越好，被抓概率越小
废话不多说，上代码……

# -*- coding: utf-8 -*-
import urllib2
import requests
import time
import random
#定义代理列表，以便使用随机代理访问，
user_agent_list=[
            'Mozilla/5.0(compatible;MSIE9.0;WindowsNT6.1;Trident/5.0)',
            'Mozilla/4.0(compatible;MSIE8.0;WindowsNT6.0;Trident/4.0)',
            'Mozilla/4.0(compatible;MSIE7.0;WindowsNT6.0)',
            'Opera/9.80(WindowsNT6.1;U;en)Presto/2.8.131Version/11.11',
            'Mozilla/5.0(WindowsNT6.1;rv:2.0.1)Gecko/20100101Firefox/4.0.1',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
            'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)',
            'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.4.3.4000 Chrome/30.0.1599.101 Safari/537.36',  
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36' 
]
#url列表，包含要访问的网站
url_list=[
        'https://blog.csdn.net/qq_41011336/article/details/83213550',
        'https://blog.csdn.net/qq_41011336/article/details/83211781',
        'https://blog.csdn.net/qq_41011336/article/details/83009231',
        'https://blog.csdn.net/qq_41011336/article/details/82984319',
        'https://blog.csdn.net/qq_41011336/article/details/82982616',
        'https://blog.csdn.net/qq_41011336/article/details/82591545',
        'https://blog.csdn.net/qq_41011336/article/details/83217577',
        'https://blog.csdn.net/qq_41011336/article/details/83016709',
        'https://blog.csdn.net/qq_41011336/article/details/83015986',
        'https://blog.csdn.net/qq_41011336/article/details/82785311',
        'https://blog.csdn.net/qq_41011336/article/details/82498124',
        'https://blog.csdn.net/qq_41011336/article/details/82528008',
        'https://blog.csdn.net/qq_41011336/article/details/83216791'
        
        ]

#生成ip地址列表，以便使用代理ip访问，这里ip地址从txt文档中读取
def get_proxy_ip_list():
    global proxy_list
    proxy_list=[]
    proxy_ip_list=[]
    print "导入proxy_list..."
    f=open("C:\Users\Administrator.USER-20160909TG\Desktop\ip1.txt")
    line=f.readline().strip('\n')
    while line:
        proxy_list.append(line)
        line=f.readline().strip('\n')
    f.close()
    for i in range(len(proxy_list)):
        ip = 'http://' + proxy_list[i]
        proxy_ip_list.append(ip)
    return proxy_ip_list
#另一种生成ip地址列表的方法，这里ip地址直接写在函数中的ip_list中
def get_proxy_ip_list2():
    global proxy_ip_list2
    proxy_ip_list2=[]
    ip_list=['61.135.217.7:80','118.190.95.35:9001','112.115.57.20:3128','124.235.181.175:80']
    for i in range(len(ip_list)):
        ip = 'http://' + ip_list[i]
        proxy_ip_list2.append(ip)
    print proxy_ip_list2
    return proxy_ip_list2  
    
#生成爬虫，爬取网页    
def visit(url,headers,proxies):
    times=0
    while(1):
        url=random.choice(url_list)
        header={'User_Agent':random.choice(user_agent_list)}
        proxy={'http':random.choice(proxies)}
        try:
            #print '%s%s%s'%(url,header,proxy)
            res=requests.get(url,headers=header,proxies=proxy)
            print res.headers
        except:
            print'wrong'
            time.sleep(0.1)
        else:
            print'visit %d times'%times
            time.sleep(random.random())
            #time.sleep(15)
            times+=1 
        if times%20==0:
            time.sleep(15)
        
if __name__=="__main__":
    get_proxy_ip_list2()
    visit(url_list,user_agent_list,proxy_ip_list2)

机尾云拉长

关注

4
点赞
踩
14

收藏

觉得还不错? 一键收藏
0
评论
使用爬虫刷blog访问量随机代理IP 随机user_agent

好羞耻地写下这篇博客，不过大家一看博主这可怜的访问量，就知道博主十足好孩子！！！呵呵，莫道石人一只眼,挑动黄河天下反首先了解一下常见反爬虫的检测方法频率监测：有些网站会设置一种频率监测的机制，对于同一IP，若在一定时间内访问的速度超过了设置的阈值，那么便会判定该IP的源头是一个爬虫机器人，从而限制访问频率或暂时禁止IP对其的访问频数监测：与1有些类似，但是是对用户访问页面的数量进行统计，并在...
复制链接

扫一扫