目标为问卷星网站的某问卷,问卷星网站有同类网站最鸡贼的反爬机制
频繁访问劝退(短时间同IP22次以上提交)、校验码以及验证码的采用都会阻止爬虫。
事实上,在github上找不到可行的爬虫
# coding=utf-8
import urllib2
import random
import requests
from time import time, strftime, localtime
a1={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
a2={
'curid':'15231557',
'starttime':strftime("%Y/%m/%d %H:%M:%S", localtime()),
'source':'directphone',
'submittype':'1',
'rn':'1960163843.31946821', #rn项小数点后的数字猜测为校验码,每次提交都会改变,若爬虫提交失败,就是这个原因
't':str(int(time() * 1000))}
print a2
try:
r = requests.post('https://sojump.com/m/15231557.aspx?from=timeline',headers=a1,data={'submitdata': '1$1}2$3}3$2}4$1}5$2}6$2}7$2}8$1}9$2}10$3}11$3}12$2}13$2}14$1}15$1}16$2}17$1}18$1}19$1}20$1'}, params=a2)
print (r.text)
except:
print('failure')