前言
本次的练习是通过爬虫伪装为正常用户访问页面,从而达到刷访问量的目的
2019.4.11失败
失败原因:程序可以运行,但是一直报代理有问题的错误,下一步要采用一些高匿代理进行尝试
参考的文章有以下几个:
https://blog.csdn.net/c406495762/article/details/72793480#23-正常的访问速度 --(重点推荐,写的非常详细,看完很有收获)
https://blog.csdn.net/dala_da/article/details/79401163
https://www.jb51.net/article/99984.htm
https://blog.csdn.net/qq_41782425/article/details/84993073
https://blog.csdn.net/wenxuhonghe/article/details/85226490
正文
本次练习中还学习了如何爬取网页中的表格,也就是带有’tr’ 'td’这两种标签的网页
通过使用beautifulsoup 以及自带的get_text()取出想要的数据,代码如下:
for tr in soup.findAll('tr'):
for td in soup.findAll('td', {"data-title": "IP"}):
# print(td.get_text())
ip_list.append(td.get_text())
其他的就是通过第一篇文章(重点推荐),学习到如何伪装成正常用户访问网页,同时也借用大神一句话,爬虫的同时也要考虑到别人的感受,多加time.sleep和谐爬虫
附上完整代码,欢迎各位大佬指点错误~~
from bs4 import BeautifulSoup
import requests
import re,random,time
#将代理IP存入数组准备使用
def get_proxy():
user_agent = random.choice(UA)
#print(user_agent)
headers = {'User-Agent':user_agent}
try:
r = requests.get('https://www.kuaidaili.com/free/intr/',headers = headers)
except requests.HTTPError as e:
print(e)
print("httpError")
except requests.RequestException as e:
print(e)
except:
print("Unknown Error")
soup = BeautifulSoup(r.text, "lxml")
for tr in soup.findAll('tr'):
for td in soup.findAll('td', {"data-title": "IP"}):
# print(td.get_text())
ip_list.append(td.get_text())
#print(ip_list)
def main():
times = 0
finish_time = 0
referer_list = [
{'http://blog.csdn.net/dala_da/article/details/79401163'},
{'http://blog.csdn.net/'},
{
'https://www.sogou.com/tx?query=%E4%BD%BF%E7%94%A8%E7%88%AC%E8%99%AB%E5%88%B7csdn%E8%AE%BF%E9%97%AE%E9%87%8F&hdq=sogou-site-706608cfdbcc1886-0001&ekv=2&ie=utf8&cid=qb7.zhuye&'},
{
'https://www.baidu.com/s?tn=98074231_1_hao_pg&word=%E4%BD%BF%E7%94%A8%E7%88%AC%E8%99%AB%E5%88%B7csdn%E8%AE%BF%E9%97%AE%E9%87%8F'}
]
url = 'https://blog.csdn.net/neolz/article/details/89025388'
headers = {
'User-Agent': random.choice(UA),
'Referer':random.choice(referer_list)
}
while 1:
ip = ip_list[random.randint(1, len(ip_list))]
proxy_ip = 'http://' + ip
proxy_ips = 'https://' + ip
proxy = {'https':proxy_ips , 'http':proxy_ip}
try:
response = requests.get(url , headers = headers , proxies = proxy)
print(response)
except:
print('代理有问题'+proxy['https'])
time.sleep(5)
else:
print('已刷%d次,%s') % (finish_time + 1, proxy["https"])
times+=1
if not times%len(ip_list):
time.sleep(10)
if __name__ == '__main__':
UA = [
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60',
'Opera/8.0 (Windows NT 5.1; U; en)',
'Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50',
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0',
'Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0)',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36'
]
ip_list = []
get_proxy()
main()