python爬取aspx动态页面_如何使用python抓取aspx页面

最新推荐文章于 2023-12-12 11:38:29 发布

weixin_39832448

最新推荐文章于 2023-12-12 11:38:29 发布

阅读量606

点赞数

文章标签： python爬取aspx动态页面

I am trying to scrape a site, https://www.searchiqs.com/nybro/ (you have to click the "Log In as Guest" to get to the search form. If I search for a Party 1 term like say "Andrew" the results have pagination and also, the request type is POST so the URL does not change and also the sessions time out very quickly. So quickly that if i wait ten minutes and refresh the search url page it gives me a timeout error.

I got started with scraping recently, so I have mostly been doing GET posts where I can decipher the URL. So so far I have realized that I will have to look at the DOM. Using Chrome Tools, I have found the headers. From the Network Tabs, I have also found out the following as the form data that is passed on from the search page to the results page

__EVENTTARGET:

__EVENTARGUMENT:

__LASTFOCUS:

__VIEWSTATE:/wEPaA8FDzhkM2IyZjUwNzg...(i have truncated this for length)

__VIEWSTATEGENERATOR:F92D01D0

__EVENTVALIDATION:/wEdAJ8BsTLFDUkTVU3pxZz92BxwMddqUSAXqb... (i have truncated this for length)

BrowserWidth:1243

BrowserHeight:705

ctl00$ContentPlaceHolder1$scrollPos:0

ctl00$ContentPlaceHolder1$txtName:david

ctl00$ContentPlaceHolder1$chkIgnorePartyType:on

ctl00$ContentPlaceHolder1$txtFromDate:

ctl00$ContentPlaceHolder1$txtThruDate:

ctl00$ContentPlaceHolder1$cboDocGroup:(ALL)

ctl00$ContentPlaceHolder1$cboDocType:(ALL)

ctl00$ContentPlaceHolder1$cboTown:(ALL)

ctl00$ContentPlaceHolder1$txtPinNum:

ctl00$ContentPlaceHolder1$txtBook:

ctl00$ContentPlaceHolder1$txtPage:

ctl00$ContentPlaceHolder1$txtUDFNum:

ctl00$ContentPlaceHolder1$txtCaseNum:

ctl00$ContentPlaceHolder1$cmdSearch:Search

All the ones in caps are hidden. I have also managed to figure out the results structure.

My script thus far is really pathetic as I am completely blank on what to do next. I am still to do the form submission, analyze the pagination and scrape the result but i have absolutely no idea how to proceed.

import re

import urlparse

import mechanize

from bs4 import BeautifulSoup

class DocumentFinderScraper(object):

def __init__(self):

self.url = "https://www.searchiqs.com/nybro/SearchResultsMP.aspx"

self.br = mechanize.Browser()

self.br.addheaders = [('User-agent',

'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.63 Safari/535.7')]

##TO DO

##submit form

#get return URL

#scrape results

#analyze pagination

if __name__ == '__main__':

scraper = DocumentFinderScraper()

scraper.scrape()

Any help would be dearly appreciated

解决方案

I disabled Javascript and visited https://www.searchiqs.com/nybro/ and the form looks like this:

As you can see the Log In and Log In as Guest buttons are disabled. This will make it impossible for Mechanize to work because it can not process Javascript and you won't be able to submit the form.

For this kind of problems you can use Selenium, that will simulate a full Browser with the disadvantage of being slower than Mechanize.

This code should log you in using Selenium:

from selenium import webdriver

from selenium.webdriver.common.keys import Keys

usr = ""

pwd = ""

driver = webdriver.Firefox()

driver.get("https://www.searchiqs.com/nybro/")

assert "IQS" in driver.title

elem = driver.find_element_by_id("txtUserID")

elem.send_keys(usr)

elem = driver.find_element_by_id("txtPassword")

elem.send_keys(pwd)

elem.send_keys(Keys.RETURN)

weixin_39832448

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python爬取aspx动态页面_如何使用python抓取aspx页面

I am trying to scrape a site, https://www.searchiqs.com/nybro/ (you have to click the "Log In as Guest" to get to the search form. If I search for a Party 1 term like say "Andrew" the results have pag...
复制链接

扫一扫

python爬取aspx动态页面_如何使用python抓取aspx页面

“相关推荐”对你有帮助么？