python爬虫动态解析js验证码_Python爬虫实例动态ip+抓包+验证码自动识别

最新推荐文章于 2024-02-07 11:30:00 发布

weixin_39849387

最新推荐文章于 2024-02-07 11:30:00 发布

阅读量689

点赞数

文章标签： python爬虫动态解析js验证码

最近出于某种不可描述的原因，需要爬一段数据，大概长这样：

是一个价格走势图，鼠标移到上面会显示某个时刻的价格，需要爬下来日期和价格。

第一步肯定先看源代码，找到了这样一段：

历史记录应该是从这个iframe发过来的，点进去看看，找到这样一段：

可以大概看出来是通过get一个json文件来获取数据，我们要的东西应该就在这个json里面。打开浏览器的开发者工具(F12)，一个个看发过来的json，发现这样一个：

打开看看：

ok，我们找到了想要的东西了，接下来分析下这个url，发现一些规律，可以直接从第一页的url构造出来这个的url，除了一个token...从源代码里找到这玩意长这样...

菜鸡如我当然不知所措了...只能模拟浏览器抓包了...加载完从这个frame的src里可以找到这个token，问题解决，开爬！

以上部分的代码如下：

#coding=utf8

importurllib.requestimportjsonimportrequestsimportrefrom selenium importwebdriverimporttimefrom bs4 importBeautifulSoupimportrequestsimportrandomimportpytzimportcv2from matplotlib importpyplot as pltfrom PIL importImage, ImageEnhanceimportpytesseractfrom selenium.webdriver.common.keys importKeysimportsysimportnumpy as npimportgcdefget_data_one_page(source, options, page):

key1= 'a href="http://tool\.manmanbuy\.com/historyLowest\.aspx?.+" target'key2= 'a href="http://www\.manmanbuy\.com/disSitePro.+" v'r= requests.get(source, headers=headers)

pattern1=re.compile(key1)

pattern2=re.compile(key2)

html=r.text#通过正则化匹配找到需要的url

#这里有多种情况，从源代码中发现

url =[]

list1=re.findall(pattern1, html)

list2=re.findall(pattern2, html)for i inlist1:

i= i.replace('a href="', '')

i= i.replace('" target', '')

url.append(i)for i inlist2:

i= i.replace('a href="', '')

i= i.replace('" v', '')

url.append(requests.get(i).url)

cnt=0

pattern_token= re.compile('token=.+')

driver= webdriver.Firefox(firefox_options=options)#设置超时

driver.set_page_load_timeout(8)

i= -1try_time= 3

while(i < len(url) - 1):

i+= 1this_url=url[i]try:

driver.get(this_url)except:iftry_time:

i-= 1try_time-= 1

continue

#get token

#找到需要的frame，获取url，从里面提取token

ret = driver.find_element_by_id('iframeId').get_attribute('src')

token=re.findall(pattern_token, ret)

json_url= this_url.replace('http://tool.manmanbuy.com/historyLowest.aspx?', '')

json_url= json_url.replace('item.tmall', 'detail.tmall')

json_url= 'http://tool.manmanbuy.com/history.aspx?DA=1&action=gethistory&' +\

json_url+ '&bjid=&spbh=&cxid=&zkid=&w=951&' +token[0]#获取json文件，解析并写入文件

try:

data= requests.get(json_url, proxies=proxy, headers=headers)

data=json.loads(data.text)except:iftry_time:

i-= 1try_time-= 1

continue

if not ('spUrl' in data) or data['spUrl'] == 'https://detail.tmall.com/item.htm?id=544471454551':

json_url= json_url.replace('detail.tmall', 'item.tmall')try:

data= requests.get(json_url, proxies=proxy, headers=headers)

data=json.loads(data.text)except:iftry_time:

i-= 1try_time-= 1

continue

if 'spName' indata:print(data['spName'])if not ('spUrl' in data) or data['spUrl'] == 'https://detail.tmall.com/item.htm?id=544471454551':iftry_time:

i-= 1try_time-= 1

else:

file= open('data/error_data_' + str(page) + '_' + str(cnt), 'w')

file.write(json_url+'\n')

file.write(data['datePrice']+'\n')

file.close()continue

else:

file= open('data/data_' + str(page) + '_' + str(cnt), 'w')if 'spName' indata:

file.write(data['spName']+'\n')

file.write(data['datePrice']+'\n')

file.close()

cnt+= 1try_time= 3

#关闭浏览器后记得手动释放内存

driver.quit()deldriver

gc.collect()defget_data():print('firefox start')

options=webdriver.FirefoxOptions()

options.set_headless()

source= 'http://s.manmanbuy.com/Default.aspx?key=%BF%DA%BA%EC&btnSearch=%CB%D1%CB%F7'driver= webdriver.Firefox(firefox_options=options)

driver.get(source)print('pass')

get_data_one_page(source, options, current_page)while(current_page <= 1200):

current_page+= 1

while (1):try:

driver.get(source)except:

time.sleep(2)continue

breakpagenum= driver.find_element_by_id('pagenum')

pagenum.send_keys(current_page+ 1)

button= driver.find_element_by_xpath('/html/body/div[1]/div[5]/div[3]/div[5]/input[2]')

button.click()

get_data_one_page(source, options, current_page)

driver.close()#if __name__ == '__main__':

old=sys.stdout

sys.stdout= open('log', 'r+')

get_data()

然后就被封ip，出验证码 - -

封ip好说，搞个动态ip，爬几条换一个，至于ip怎么获取，花钱买几个会比较稳定，也有免费提供的，大多数不可用。

还有一个问题，上面没有改过header，也可能是被封的原因。

首先设置header

headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:59.0) Gecko/20100101 Firefox/59.0','Accept-Encoding': 'gzip, deflate','Accept- Language': 'en - US, en;q = 0.5'}

具体内容可以通过正常访问去看看应该填什么。

然后在浏览器启动前加上：

agent =random.choice(user_agent)

headers['User-Agent'] =random.choice(agent)#ip = ['183.159.82.25', '18118']

ip =random_ip()

options.add_argument('user-agent="' + agent + '"')

options.add_argument('--proxy-server=http://' + ip[0] + ':' + ip[1])

random_ip就是在可用的ip里随机取一个。

还有在requests.get前面加

proxy = {'https': 'https://' + ip[0] + ':' + ip[1]}

requests.get改为

data = requests.get(json_url, proxies=proxy, headers=headers)