目标为问卷星网站的某问卷,问卷星网站有同类网站最鸡贼的反爬机制
频繁访问劝退(短时间同IP22次以上提交)、校验码以及验证码的采用都会阻止爬虫。
事实上,在github上找不到可行的爬虫
# coding=utf-8
import urllib2
import random
import requests
from time import time, strftime, localtime
a1={
'User-Agent' :'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36' }
a2={
'curid' :'15231557' ,
'starttime' :strftime("%Y/%m/%d %H:%M:%S" , localtime()),
'source' :'directphone' ,
'submittype' :'1' ,
'rn' :'1960163843.31946821' , #rn项小数点后的数字猜测为校验码,每次提交都会改变,若爬虫提交失败,就是这个原因
't' :str (int (time() * 1000 ))}
print a2
try :
r = requests.post('https://sojump.com/m/15231557.aspx?from=timeline' ,headers =a1,data ={'submitdata' : '1$1}2$3}3$2}4$1}5$2}6$2}7$2}8$1}9$2}10$3}11$3}12$2}13$2}14$1}15$1}16$2}17$1}18$1}19$1}20$1' }, params =a2)
print (r.text)
except :
print ('failure' )