谈谈反爬虫政策与对策

最新推荐文章于 2023-10-11 10:15:21 发布

找到点头绪了

最新推荐文章于 2023-10-11 10:15:21 发布

阅读量953

点赞数

本文链接：https://blog.csdn.net/gfdfhjj/article/details/87931964

版权

分享一下我老师大神的人工智能教程。零基础！通俗易懂！风趣幽默！还带黄段子！希望你也加入到我们人工智能的队伍中来！https://blog.csdn.net/jiangjunshow

写博客，一部分是为了让自己今后能快速地复习之前学过的知识，整理下思路；另一方面是帮助到其他也遇到类似问题的童鞋。但是写博客很难坚持下来，原因嘛，各种各样。但说到底是没有“共鸣”。

高山流水，难觅知音。

其实，建立起写博客的习惯，就是那些点点滴滴的小事：每天看着博客的访问量，点赞数增加；看到自己的文章被别人评论等等。

好了，废话不多说。今天来谈谈关于刷浏览量的问题。虽然这远远的偏离了写博客的初衷，但是了解下这类问题还是不错的，毕竟“技术并不犯法！”。

反（反）爬虫机制

说到反爬虫，不得不说下爬虫了。其实这就是一个概念，爬虫就是将手动完成的事情交给了代码去自动化的实现罢了。而反爬虫就是探查用户是真实用户还是代码的一种手段。而反反爬虫就是针对反爬虫机制的一种手段。

都说“双重否定，表示肯定”，那么爬虫和反反爬虫应该是一样的了。其实不然，表面上行为是一致的，但是实际上反反爬虫做了更多的处理，而不是简单的小爬虫啦。

大体上来讲，反爬虫会从如下几个层面入手：
- header 浏览器的请求头
- User-Agent 用户代理，表明访问源身份的一种方式
- Referer 访问的目标链接是从哪个链接跳转过来的（做防盗链的话，就可以从它入手）
- Host 同源地址判断，用它会很有用
- IP 同一个IP短时多次访问，就很有可能是爬虫，反爬虫会对此做处理。
- 访问频率：短时多次高并发的访问，基本上就是有问题的访问。
上面这几个都是常见的反爬虫措施，当然还有更加高深的机制，比如最恶心的验证码（使用tesseract可以处理较为简单的验证码识别），用户行为分析，等等等等。

既然了解了常见的反爬虫机制，那相对应的进行“政策-对策”实现反反爬虫也就不是那么的没有头绪了。是的，针对上面的限制，我们会有一些对策。

针对User-Agent 的，可以整理一些常见的浏览器代理头，每次访问随机使用其中一个就好了。
针对IP的，可以使用代理IP嘛
针对频率限制的，做下访问间隙做下随机休眠就挺不错的。
……

实战

之前我一直是在CSDN上写博客，它的反爬虫机制说实话，做的比较的浅，一方面必要性不是很大，二来做反爬虫经纪上不太划算，估计他们也不愿意在这上面浪费吧。

所以，在CSDN上刷浏览量还是很随意的，说下我的思路。
- 代理IP爬取，验证清洗数据，定期更新。
- 浏览器User-Agent整理，添加访问的随机性。
- 随即休眠策略，日志处理，错误记录，定时重试等。

代理IP处理

# coding: utf8# @Author: 郭 璞# @File: proxyip.py                                                                 # @Time: 2017/10/5                                   # @Contact: 1064319632@qq.com# @blog: http://blog.csdn.net/marksinoberg# @Description: 抓取代理IP，并保存到redis相关的key中import requestsfrom bs4 import BeautifulSoupfrom redishelper import RedisHelperclass ProxyIP(object):    """    抓取代理IP，清洗，验证。    """    def __init__(self):        self.rh = RedisHelper()    def crawl(self):        """        不管是http还是https统统存进去再说。        """        # 先处理http模式的代理ip        httpurl = "http://www.xicidaili.com/nn/"        headers = {            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'        }        html = requests.get(url=httpurl, headers=headers).text        soup = BeautifulSoup(html, "html.parser")        ips = soup.find_all("tr")        for index in range(1, len(ips)):            tds = ips[index].find_all('td')            ip = tds[1].text            port = tds[2].text            ipinfo = "{}:{}".format(ip, port)            if self._check(ip):                self.rh.sAddAvalibeIp(ipinfo)            # print(ipinfo)    def _check(self, ip):        """        检测代理IP的有效性        """        checkurl = "http://47.94.19.186/common/checkip.php"        localip = self._getLocalIp()        # print("Local: {}, proxy: {}".format(localip, ip))        return False if localip==ip else True    def _getLocalIp(self):        """        获取本机的IP地址, 接口方式不太靠谱，暂时用手工方式在https://www.baidu.com/s?ie=UTF-8&wd=ip 进行手动复制粘贴即可        """        return "223.91.239.159"    def clean(self):        ips = self.rh.sGetAllAvalibleIps()        for ipinfo in ips:            ip, port = ipinfo.split(":")            if self._check(ip):                self.rh.sAddAvalibeIp(ipinfo)            else:                self.rh.sRemoveAvalibleIp(ipinfo)    def update(self):        passif __name__ == '__main__':    pip = ProxyIP()    # result = pip._check("223.91.239.159", 53281)    # print(result)    pip.crawl()    # pip.clean()
  
  1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77

Redis工具类

# coding: utf8# @Author: 郭 璞# @File: redishelper.py                                                                 # @Time: 2017/10/5                                   # @Contact: 1064319632@qq.com# @blog: http://blog.csdn.net/marksinoberg# @Description: 涉及redis的一些操作工具方法import redisclass RedisHelper(object):    """    用于保存爬取到的博客内容链接。    保存代理IP    """    def __init__(self):        self.articlepool = "redis:set:article:pool"        self.avalibleips = "redis:set:avalible:ips"        self.unavalibleips = "redis:set:unavalibe:ips"        pool = redis.ConnectionPool(host="localhost", port=6379)        self.redispool = redis.Redis(connection_pool=pool)    def sAddArticleId(self, articleid):        """        添加爬取到的博客id。        :param articleid:        :return:        """        self.redispool.sadd(self.articlepool, articleid)    def sRemoveArticleId(self, articleid):        self.redispool.srem(self.articlepool, articleid)    def popupArticleId(self):        return int(self.redispool.srandmember(self.articlepool))    def sAddAvalibeIp(self, ip):        self.redispool.sadd(self.avalibleips, ip)    def sRemoveAvalibeIp(self, ip):        self.redispool.srem(self.avalibleips, ip)    def sGetAllAvalibleIps(self):        return [ip.decode('utf8') for ip in self.redispool.smembers(self.avalibleips)]    def popupAvalibeIp(self):        return self.redispool.srandmember(self.avalibleips)    def sAddUnavalibeIp(self, ip):        self.redispool.sadd(self.unavalibleips, ip)    def sRemoveUnavaibleIp(self, ip):        self.redispool.srem(self.unavalibleips, ip)
  
  1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55

csdn博文工具类

# coding: utf8# @Author: 郭 璞# @File: csdn.py                                                                 # @Time: 2017/10/5                                   # @Contact: 1064319632@qq.com# @blog: http://blog.csdn.net/marksinoberg# @Description: 爬取一个博主的全部博客链接工具类以及其他设计到的操作。import reimport requestsfrom bs4 import BeautifulSoupclass BlogScanner(object):    """    抓取博主id下的所有文章链接id。    """    def __init__(self, bloger="marksinoberg"):        self.bloger = bloger        # self.blogpagelink = "http://blog.csdn.net/{}/article/list/{}".format(self.bloger, 1)    def _getTotalPages(self):        blogpagelink = "http://blog.csdn.net/{}/article/list/{}?viewmode=contents".format(self.bloger, 1)        html = requests.get(url=blogpagelink).text        soup = BeautifulSoup(html, "html.parser")        # 比较hack的操作，实际开发还是不要这么随意的好        temptext = soup.find('div', {"class": "pagelist"}).find("span").get_text()        restr = re.findall(re.compile("(\d+).*?(\d+)"), temptext)        # print(restr)        pages = restr[0][-1]        return pages    def _parsePage(self, pagenumber):        blogpagelink = "http://blog.csdn.net/{}/article/list/{}?viewmode=contents".format(self.bloger, int(pagenumber))        html = requests.get(url=blogpagelink).text        soup = BeautifulSoup(html, "html.parser")        links = soup.find("div", {"id": "article_list"}).find_all("span", {"class": "link_title"})        articleids = []        for link in links:            temp = link.find("a").attrs['href']            articleids.append(temp.split("/")[-1])        # print(len(articleids))        # print(articleids)        return articleids    def get_all_articleids(self):        pages = int(self._getTotalPages())        articleids = []        for index in range(pages):            tempids = self._parsePage(int(index+1))            articleids.extend(tempids)        return articleidsif __name__ == '__main__':    bs = BlogScanner(bloger="marksinoberg")    # print(bs._getTotalPages())    # bs._parsePage(1)    articleids = bs.get_all_articleids()    print(len(articleids))    print(articleids)
  
  1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61

Brush工具类

# coding: utf8# @Author: 郭 璞# @File: brushhelper.py                                                                 # @Time: 2017/10/5                                   # @Contact: 1064319632@qq.com# @blog: http://blog.csdn.net/marksinoberg# @Description: 开刷import requestsimport randomimport timefrom redishelper import RedisHelperclass FakeUserAgent(object):    """    搜集到的一些User-Agent，每次popup出不同的ua，减少反爬虫机制的影响。    更多内容：http://www.73207.com/useragent    """    def __init__(self):        self.uas = [            "Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",            "MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",            "JUC (Linux; U; 2.3.7; zh-cn; MB200; 320*480) UCWEB7.9.3.103/139/999",            "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0a1) Gecko/20110623 Firefox/7.0a1 Fennec/7.0a1",            "Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10",            "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",            "Mozilla/5.0 (iPhone; U; CPU iPhone OS 3_0 like Mac OS X; en-us) AppleWebKit/420.1 (KHTML, like Gecko) Version/3.0 Mobile/1A542a Safari/419.3",            "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_0 like Mac OS X; en-us) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8A293 Safari/6531.22.7",            "Mozilla/5.0 (iPad; U; CPU OS 3_2 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Version/4.0.4 Mobile/7B334b Safari/531.21.10",            "Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) Version/6.0.0.337 Mobile Safari/534.1+",            "Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.0; U; en-US) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/233.70 Safari/534.6 TouchPad/1.0",            "Mozilla/5.0 (SymbianOS/9.4; Series60/5.0 NokiaN97-1/20.0.019; Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/525 (KHTML, like Gecko) BrowserNG/7.1.18124",            "Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)",            "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36",            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36",            "Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10",            "Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50",            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2",            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36",            "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER) ",            "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)",            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36",            "Mozilla/5.0 (Linux; U; Android 2.2.1; zh-cn; HTC_Wildfire_A3333 Build/FRG83D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",            "Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) Version/6.0.0.337 Mobile Safari/534.1+",            "Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)",            "Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999",            "Openwave/ UCWEB7.0.2.37/28/999",            "NOKIA5700/ UCWEB7.0.2.37/28/999",            "UCWEB7.0.2.37/28/999",            "Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.0; U; en-US) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/233.70 Safari/534.6 TouchPad/1.0",            "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",            "Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10",            "Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",        ]    def _generateIndexes(self):        numbers = random.randint(0, len(self.uas))        indexes = []        while len(indexes) < numbers:            temp = random.randrange(0, len(self.uas))            if temp not in indexes:                indexes.append(temp)        return indexes    def popupUAs(self):        uas = []        indexes = self._generateIndexes()        for index in indexes:            uas.append(self.uas[index])        return uasclass Brush(object):    """    开刷浏览量    """    def __init__(self, bloger="marksinoberg"):        self.bloger = "http://blog.csdn.net/{}".format(bloger)        self.headers = {            'Host': 'blog.csdn.net',            'Upgrade - Insecure - Requests': '1',            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36',        }        self.rh = RedisHelper()    def getRandProxyIp(self):        ip = self.rh.popupAvalibeIp()        proxyip = {}        ipinfo = "http://{}".format(str(ip.decode('utf8')))        proxyip['http'] = ipinfo        # print(proxyip)        return proxyip    def brushLink(self, articleid, randuas=[]):        # http://blog.csdn.net/marksinoberg/article/details/78058279        bloglink = "{}/article/details/{}".format(self.bloger, articleid)        for ua in randuas:            self.headers['User-Agent'] = ua            timeseed = random.randint(1, 3)            print("临时休眠: {}秒".format(timeseed))            time.sleep(timeseed)            for index in range(timeseed):                # requests.get(url=bloglink, headers=self.headers, proxies=self.getRandProxyIp())                requests.get(url=bloglink, headers=self.headers)if __name__ == '__main__':    # fua = FakeUserAgent()    # indexes = [0, 2,5,7]    # indexes = generate_random_numbers(0, 18, 7)    # randuas = fua.popupUAs(indexes)    # randuas = fua.popupUAs()    # print(len(randuas))    # print(randuas)    # print(fua._generateIndexes())    brush = Brush("marksinoberg")    # brush.brushLink(78058279, randuas)    print(brush.getRandProxyIp())
  
  1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127

入口

# coding: utf8# @Author: 郭 璞# @File: Main.py                                                                 # @Time: 2017/10/5                                   # @Contact: 1064319632@qq.com# @blog: http://blog.csdn.net/marksinoberg# @Description: 入口from csdn import *from redishelper import RedisHelperfrom brushhelper import *import threadingdef main():    rh = RedisHelper()    bs = BlogScanner(bloger="marksinoberg")    fua = FakeUserAgent()    brush = Brush(bloger="marksinoberg")    counter = 0    while counter < 12:        # 开刷        print("第{}次！".format(counter))        try:            uas = fua.popupUAs()            articleid = rh.popupArticleId()            brush.brushLink(articleid, uas)        except Exception as e:            print(e)            # 待添加日志处理程序        counter+=1if __name__ == '__main__':    for i in range(280):        temp = threading.Thread(target=main)        temp.start()
  
  1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38