python爬取aspx动态页面_如何使用python抓取aspx页面

I am trying to scrape a site, https://www.searchiqs.com/nybro/ (you have to click the "Log In as Guest" to get to the search form. If I search for a Party 1 term like say "Andrew" the results have pagination and also, the request type is POST so the URL does not change and also the sessions time out very quickly. So quickly that if i wait ten minutes and refresh the search url page it gives me a timeout error.

I got started with scraping recently, so I have mostly been doing GET posts where I can decipher the URL. So so far I have realized that I will have to look at the DOM. Using Chrome Tools, I have found the headers. From the Network Tabs, I have also found out the following as the form data that is passed on from the search page to the results page

__EVENTTARGET:

__EVENTARGUMENT:

__LASTFOCUS:

__VIEWSTATE:/wEPaA8FDzhkM2IyZjUwNzg...(i have truncated this for length)

__VIEWSTATEGENERATOR:F92D01D0

__EVENTVALIDATION:/wEdAJ8BsTLFDUkTVU3pxZz92BxwMddqUSAXqb... (i have truncated this for length)

BrowserWidth:1243

BrowserHeight:705

ctl00$ContentPlaceHolder1$scrollPos:0

ctl00$ContentPlaceHolder1$txtName:david

ctl00$ContentPlaceHolder1$chkIgnorePartyType:on

ctl00$ContentPlaceHolder1$txtFromDate:

ctl00$ContentPlaceHolder1$txtThruDate:

ctl00$ContentPlaceHolder1$cboDocGroup:(ALL)

ctl00$ContentPlaceHolder1$cboDocType:(ALL)

ctl00$ContentPlaceHolder1$cboTown:(ALL)

ctl00$ContentPlaceHolder1$txtPinNum:

ctl00$ContentPlaceHolder1$txtBook:

ctl00$ContentPlaceHolder1$txtPage:

ctl00$ContentPlaceHolder1$txtUDFNum:

ctl00$ContentPlaceHolder1$txtCaseNum:

ctl00$ContentPlaceHolder1$cmdSearch:Search

All the ones in caps are hidden. I have also managed to figure out the results structure.

My script thus far is really pathetic as I am completely blank on what to do next. I am still to do the form submission, analyze the pagination and scrape the result but i have absolutely no idea how to proceed.

import re

import urlparse

import mechanize

from bs4 import BeautifulSoup

class DocumentFinderScraper(object):

def __init__(self):

self.url = "https://www.searchiqs.com/nybro/SearchResultsMP.aspx"

self.br = mechanize.Browser()

self.br.addheaders = [('User-agent',

'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.63 Safari/535.7')]

##TO DO

##submit form

#get return URL

#scrape results

#analyze pagination

if __name__ == '__main__':

scraper = DocumentFinderScraper()

scraper.scrape()

Any help would be dearly appreciated

解决方案

I disabled Javascript and visited https://www.searchiqs.com/nybro/ and the form looks like this:

As you can see the Log In and Log In as Guest buttons are disabled. This will make it impossible for Mechanize to work because it can not process Javascript and you won't be able to submit the form.

For this kind of problems you can use Selenium, that will simulate a full Browser with the disadvantage of being slower than Mechanize.

This code should log you in using Selenium:

from selenium import webdriver

from selenium.webdriver.common.keys import Keys

usr = ""

pwd = ""

driver = webdriver.Firefox()

driver.get("https://www.searchiqs.com/nybro/")

assert "IQS" in driver.title

elem = driver.find_element_by_id("txtUserID")

elem.send_keys(usr)

elem = driver.find_element_by_id("txtPassword")

elem.send_keys(pwd)

elem.send_keys(Keys.RETURN)

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值