分享一下我老师大神的人工智能教程。零基础!通俗易懂!风趣幽默!还带黄段子!希望你也加入到我们人工智能的队伍中来!https://blog.csdn.net/jiangjunshow
写博客,一部分是为了让自己今后能快速地复习之前学过的知识,整理下思路;另一方面是帮助到其他也遇到类似问题的童鞋。但是写博客很难坚持下来,原因嘛,各种各样。但说到底是没有“共鸣”。
高山流水,难觅知音。
其实,建立起写博客的习惯,就是那些点点滴滴的小事:每天看着博客的访问量,点赞数增加;看到自己的文章被别人评论等等。
好了,废话不多说。今天来谈谈关于刷浏览量的问题。虽然这远远的偏离了写博客的初衷,但是了解下这类问题还是不错的,毕竟“技术并不犯法!”。
反(反)爬虫机制
说到反爬虫,不得不说下爬虫了。其实这就是一个概念,爬虫就是将手动完成的事情交给了代码去自动化的实现罢了。而反爬虫就是探查用户是真实用户还是代码的一种手段。而反反爬虫就是针对反爬虫机制的一种手段。
都说“双重否定,表示肯定”,那么爬虫和反反爬虫应该是一样的了。其实不然,表面上行为是一致的,但是实际上反反爬虫做了更多的处理,而不是简单的小爬虫啦。
大体上来讲,反爬虫会从如下几个层面入手:
- header 浏览器的请求头
- User-Agent 用户代理,表明访问源身份的一种方式
- Referer 访问的目标链接是从哪个链接跳转过来的(做防盗链的话,就可以从它入手)
- Host 同源地址判断,用它会很有用
- IP 同一个IP短时多次访问,就很有可能是爬虫,反爬虫会对此做处理。
- 访问频率:短时多次高并发的访问,基本上就是有问题的访问。
上面这几个都是常见的反爬虫措施,当然还有更加高深的机制,比如最恶心的验证码(使用tesseract可以处理较为简单的验证码识别),用户行为分析,等等等等。
既然了解了常见的反爬虫机制,那相对应的进行“政策-对策”实现反反爬虫也就不是那么的没有头绪了。是的,针对上面的限制,我们会有一些对策。
- 针对User-Agent 的,可以整理一些常见的浏览器代理头,每次访问随机使用其中一个就好了。
- 针对IP的,可以使用代理IP嘛
- 针对频率限制的,做下访问间隙做下随机休眠就挺不错的。
- ……
实战
之前我一直是在CSDN上写博客,它的反爬虫机制说实话,做的比较的浅,一方面必要性不是很大,二来做反爬虫经纪上不太划算,估计他们也不愿意在这上面浪费吧。
所以,在CSDN上刷浏览量还是很随意的,说下我的思路。
- 代理IP爬取,验证清洗数据,定期更新。
- 浏览器User-Agent整理,添加访问的随机性。
- 随即休眠策略,日志处理,错误记录,定时重试等。
代理IP处理
# coding: utf8# @Author: 郭 璞# @File: proxyip.py # @Time: 2017/10/5 # @Contact: 1064319632@qq.com# @blog: http://blog.csdn.net/marksinoberg# @Description: 抓取代理IP,并保存到redis相关的key中import requestsfrom bs4 import BeautifulSoupfrom redishelper import RedisHelperclass ProxyIP(object): """ 抓取代理IP,清洗,验证。 """ def __init__(self): self.rh = RedisHelper() def crawl(self): """ 不管是http还是https统统存进去再说。 """ # 先处理http模式的代理ip httpurl = "http://www.xicidaili.com/nn/" headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36' } html = requests.get(url=httpurl, headers=headers).text soup = BeautifulSoup(html, "html.parser") ips = soup.find_all("tr") for index in range(1, len(ips)): tds = ips[index].find_all('td') ip = tds[1].text port = tds[2].text ipinfo = "{}:{}".format(ip, port) if self._check(ip): self.rh.sAddAvalibeIp(ipinfo) # print(ipinfo) def _check(self, ip): """ 检测代理IP的有效性 """ checkurl = "http://47.94.19.186/common/checkip.php" localip = self._getLocalIp() # print("Local: {}, proxy: {}".format(localip, ip)) return False if localip==ip else True def _getLocalIp(self): """ 获取本机的IP地址, 接口方式不太靠谱,暂时用手工方式在https://www.baidu.com/s?ie=UTF-8&wd=ip 进行手动复制粘贴即可 """ return "223.91.239.159" def clean(self): ips = self.rh.sGetAllAvalibleIps() for ipinfo in ips: ip, port = ipinfo.split(":") if self._check(ip): self.rh.sAddAvalibeIp(ipinfo) else: self.rh.sRemoveAvalibleIp(ipinfo) def update(self): passif __name__ == '__main__': pip = ProxyIP() # result = pip._check("223.91.239.159", 53281) # print(result) pip.crawl() # pip.clean()
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
Redis工具类
# coding: utf8# @Author: 郭 璞# @File: redishelper.py # @Time: 2017/10/5 # @Contact: 1064319632@qq.com# @blog: http://blog.csdn.net/marksinoberg# @Description: 涉及redis的一些操作工具方法import redisclass RedisHelper(object): """ 用于保存爬取到的博客内容链接。 保存代理IP """ def __init__(self): self.articlepool = "redis:set:article:pool" self.avalibleips = "redis:set:avalible:ips" self.unavalibleips = "redis:set:unavalibe:ips" pool = redis.ConnectionPool(host="localhost", port=6379) self.redispool = redis.Redis(connection_pool=pool) def sAddArticleId(self, articleid): """ 添加爬取到的博客id。 :param articleid: :return: """ self.redispool.sadd(self.articlepool, articleid) def sRemoveArticleId(self, articleid): self.redispool.srem(self.articlepool, articleid) def popupArticleId(self): return int(self.redispool.srandmember(self.articlepool)) def sAddAvalibeIp(self, ip): self.redispool.sadd(self.avalibleips, ip) def sRemoveAvalibeIp(self, ip): self.redispool.srem(self.avalibleips, ip) def sGetAllAvalibleIps(self): return [ip.decode('utf8') for ip in self.redispool.smembers(self.avalibleips)] def popupAvalibeIp(self): return self.redispool.srandmember(self.avalibleips) def sAddUnavalibeIp(self, ip): self.redispool.sadd(self.unavalibleips, ip) def sRemoveUnavaibleIp(self, ip): self.redispool.srem(self.unavalibleips, ip)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
csdn博文工具类
# coding: utf8# @Author: 郭 璞# @File: csdn.py # @Time: 2017/10/5 # @Contact: 1064319632@qq.com# @blog: http://blog.csdn.net/marksinoberg# @Description: 爬取一个博主的全部博客链接工具类以及其他设计到的操作。import reimport requestsfrom bs4 import BeautifulSoupclass BlogScanner(object): """ 抓取博主id下的所有文章链接id。 """ def __init__(self, bloger="marksinoberg"): self.bloger = bloger # self.blogpagelink = "http://blog.csdn.net/{}/article/list/{}".format(self.bloger, 1) def _getTotalPages(self): blogpagelink = "http://blog.csdn.net/{}/article/list/{}?viewmode=contents".format(self.bloger, 1) html = requests.get(url=blogpagelink).text soup = BeautifulSoup(html, "html.parser") # 比较hack的操作,实际开发还是不要这么随意的好 temptext = soup.find('div', {"class": "pagelist"}).find("span").get_text() restr = re.findall(re.compile("(\d+).*?(\d+)"), temptext) # print(restr) pages = restr[0][-1] return pages def _parsePage(self, pagenumber): blogpagelink = "http://blog.csdn.net/{}/article/list/{}?viewmode=contents".format(self.bloger, int(pagenumber)) html = requests.get(url=blogpagelink).text soup = BeautifulSoup(html, "html.parser") links = soup.find("div", {"id": "article_list"}).find_all("span", {"class": "link_title"}) articleids = [] for link in links: temp = link.find("a").attrs['href'] articleids.append(temp.split("/")[-1]) # print(len(articleids)) # print(articleids) return articleids def get_all_articleids(self): pages = int(self._getTotalPages()) articleids = [] for index in range(pages): tempids = self._parsePage(int(index+1)) articleids.extend(tempids) return articleidsif __name__ == '__main__': bs = BlogScanner(bloger="marksinoberg") # print(bs._getTotalPages()) # bs._parsePage(1) articleids = bs.get_all_articleids() print(len(articleids)) print(articleids)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
Brush工具类
# coding: utf8# @Author: 郭 璞# @File: brushhelper.py # @Time: 2017/10/5 # @Contact: 1064319632@qq.com# @blog: http://blog.csdn.net/marksinoberg# @Description: 开刷import requestsimport randomimport timefrom redishelper import RedisHelperclass FakeUserAgent(object): """ 搜集到的一些User-Agent,每次popup出不同的ua,减少反爬虫机制的影响。 更多内容:http://www.73207.com/useragent """ def __init__(self): self.uas = [ "Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "JUC (Linux; U; 2.3.7; zh-cn; MB200; 320*480) UCWEB7.9.3.103/139/999", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0a1) Gecko/20110623 Firefox/7.0a1 Fennec/7.0a1", "Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10", "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13", "Mozilla/5.0 (iPhone; U; CPU iPhone OS 3_0 like Mac OS X; en-us) AppleWebKit/420.1 (KHTML, like Gecko) Version/3.0 Mobile/1A542a Safari/419.3", "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_0 like Mac OS X; en-us) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8A293 Safari/6531.22.7", "Mozilla/5.0 (iPad; U; CPU OS 3_2 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Version/4.0.4 Mobile/7B334b Safari/531.21.10", "Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) Version/6.0.0.337 Mobile Safari/534.1+", "Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.0; U; en-US) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/233.70 Safari/534.6 TouchPad/1.0", "Mozilla/5.0 (SymbianOS/9.4; Series60/5.0 NokiaN97-1/20.0.019; Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/525 (KHTML, like Gecko) BrowserNG/7.1.18124", "Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36", "Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10", "Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36", "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER) ", "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36", "Mozilla/5.0 (Linux; U; Android 2.2.1; zh-cn; HTC_Wildfire_A3333 Build/FRG83D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) Version/6.0.0.337 Mobile Safari/534.1+", "Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)", "Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999", "Openwave/ UCWEB7.0.2.37/28/999", "NOKIA5700/ UCWEB7.0.2.37/28/999", "UCWEB7.0.2.37/28/999", "Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.0; U; en-US) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/233.70 Safari/534.6 TouchPad/1.0", "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13", "Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10", "Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5", ] def _generateIndexes(self): numbers = random.randint(0, len(self.uas)) indexes = [] while len(indexes) < numbers: temp = random.randrange(0, len(self.uas)) if temp not in indexes: indexes.append(temp) return indexes def popupUAs(self): uas = [] indexes = self._generateIndexes() for index in indexes: uas.append(self.uas[index]) return uasclass Brush(object): """ 开刷浏览量 """ def __init__(self, bloger="marksinoberg"): self.bloger = "http://blog.csdn.net/{}".format(bloger) self.headers = { 'Host': 'blog.csdn.net', 'Upgrade - Insecure - Requests': '1', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36', } self.rh = RedisHelper() def getRandProxyIp(self): ip = self.rh.popupAvalibeIp() proxyip = {} ipinfo = "http://{}".format(str(ip.decode('utf8'))) proxyip['http'] = ipinfo # print(proxyip) return proxyip def brushLink(self, articleid, randuas=[]): # http://blog.csdn.net/marksinoberg/article/details/78058279 bloglink = "{}/article/details/{}".format(self.bloger, articleid) for ua in randuas: self.headers['User-Agent'] = ua timeseed = random.randint(1, 3) print("临时休眠: {}秒".format(timeseed)) time.sleep(timeseed) for index in range(timeseed): # requests.get(url=bloglink, headers=self.headers, proxies=self.getRandProxyIp()) requests.get(url=bloglink, headers=self.headers)if __name__ == '__main__': # fua = FakeUserAgent() # indexes = [0, 2,5,7] # indexes = generate_random_numbers(0, 18, 7) # randuas = fua.popupUAs(indexes) # randuas = fua.popupUAs() # print(len(randuas)) # print(randuas) # print(fua._generateIndexes()) brush = Brush("marksinoberg") # brush.brushLink(78058279, randuas) print(brush.getRandProxyIp())
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
- 78
- 79
- 80
- 81
- 82
- 83
- 84
- 85
- 86
- 87
- 88
- 89
- 90
- 91
- 92
- 93
- 94
- 95
- 96
- 97
- 98
- 99
- 100
- 101
- 102
- 103
- 104
- 105
- 106
- 107
- 108
- 109
- 110
- 111
- 112
- 113
- 114
- 115
- 116
- 117
- 118
- 119
- 120
- 121
- 122
- 123
- 124
- 125
- 126
- 127
入口
# coding: utf8# @Author: 郭 璞# @File: Main.py # @Time: 2017/10/5 # @Contact: 1064319632@qq.com# @blog: http://blog.csdn.net/marksinoberg# @Description: 入口from csdn import *from redishelper import RedisHelperfrom brushhelper import *import threadingdef main(): rh = RedisHelper() bs = BlogScanner(bloger="marksinoberg") fua = FakeUserAgent() brush = Brush(bloger="marksinoberg") counter = 0 while counter < 12: # 开刷 print("第{}次!".format(counter)) try: uas = fua.popupUAs() articleid = rh.popupArticleId() brush.brushLink(articleid, uas) except Exception as e: print(e) # 待添加日志处理程序 counter+=1if __name__ == '__main__': for i in range(280): temp = threading.Thread(target=main) temp.start()
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
运行结果
我拿了之前写过的一篇文章做了下测试。
博文链接:http://blog.csdn.net/marksinoberg/article/details/78058279
开刷之前为301个浏览量,简单刷了下之后,访问量为下图:
总结
大致就是这个样子啦,虽然这顶多算个原型,因为代码完成度45%左右。有兴趣的可以加我QQ1064319632, 或者在评论中留下您的建议,大家一起交流,一起学习。
分享一下我老师大神的人工智能教程。零基础!通俗易懂!风趣幽默!还带黄段子!希望你也加入到我们人工智能的队伍中来!https://blog.csdn.net/jiangjunshow