python 访问网页aspx_asp.net – 如何向python中的.aspx页面提交查询

最新推荐文章于 2021-03-26 12:42:36 发布

weixin_39614109

最新推荐文章于 2021-03-26 12:42:36 发布

阅读量356

点赞数

文章标签： python 访问网页aspx

本文链接：https://blog.csdn.net/weixin_39614109/article/details/111436656

版权

作为概述，您将需要执行四个主要任务：

>向网站提交请求，

>从站点检索响应

>来解析这些响应

>有一些逻辑来迭代上面的任务，与导航相关的参数(到结果列表中的“下一个”页面)

http请求和响应处理使用Python标准库的urllib和urllib2中的方法和类来完成.html页面的解析可以使用Python的标准库HTMLParser或其他模块，如Beautiful Soup

以下代码段演示了在问题中指出的站点上请求和接收搜索。该网站是ASP驱动的，因此我们需要确保我们发送几个表单域，其中一些具有“可怕”值，因为ASP逻辑使用这些字段来维护状态并在一定程度上验证请求。确实提交。必须使用http POST方法发送请求，因为这是ASP应用程序的预期。主要的困难在于识别ASP期望的表单域和相关值(使用Python获取页面很容易)。

这个代码是有效的，更确切地说是功能，直到我删除了大部分的VSTATE值，并且可能通过添加注释引入了一个或两个。

import urllib

import urllib2

uri = 'http://legistar.council.nyc.gov/Legislation.aspx'

#the http headers are useful to simulate a particular browser (some sites deny

#access to non-browsers (bots,etc.)

#also needed to pass the content type.

headers = {

'HTTP_USER_AGENT': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.13) Gecko/2009073022 Firefox/3.0.13','HTTP_ACCEPT': 'text/html,application/xhtml+xml,application/xml; q=0.9,*/*; q=0.8','Content-Type': 'application/x-www-form-urlencoded'

}

# we group the form fields and their values in a list (any

# iterable,actually) of name-value tuples. This helps

# with clarity and also makes it easy to later encoding of them.

formFields = (

# the viewstate is actualy 800+ characters in length! I truncated it

# for this sample code. It can be lifted from the first page

# obtained from the site. It may be ok to hardcode this value,or

# it may have to be refreshed each time / each day,by essentially

# running an extra page request and parse,for this specific value.

(r'__VSTATE',r'7TzretNIlrZiKb7EOB3AQE ... ...2qd6g5xD8CGXm5EftXtNPt+H8B'),# following are more of these ASP form fields

(r'__VIEWSTATE',r''),(r'__EVENTVALIDATION',r'/wEWDwL+raDpAgKnpt8nAs3q+pQOAs3q/pQOAs3qgpUOAs3qhpUOAoPE36ANAve684YCAoOs79EIAoOs89EIAoOs99EIAoOs39EIAoOs49EIAoOs09EIAoSs99EI6IQ74SEV9n4XbtWm1rEbB6Ic3/M='),(r'ctl00_RadScriptManager1_HiddenField',''),(r'ctl00_tabTop_ClientState',(r'ctl00_ContentPlaceHolder1_menuMain_ClientState',(r'ctl00_ContentPlaceHolder1_gridMain_ClientState',#but then we come to fields of interest: the search

#criteria the collections to search from etc.

# Check boxes

(r'ctl00$ContentPlaceHolder1$chkOptions$0','on'),# file number

(r'ctl00$ContentPlaceHolder1$chkOptions$1',# Legislative text

(r'ctl00$ContentPlaceHolder1$chkOptions$2',# attachement

# etc. (not all listed)

(r'ctl00$ContentPlaceHolder1$txtSearch','york'),# Search text

(r'ctl00$ContentPlaceHolder1$lstYears','All Years'),# Years to include

(r'ctl00$ContentPlaceHolder1$lstTypeBasic','All Types'),#types to include

(r'ctl00$ContentPlaceHolder1$btnSearch','Search Legislation') # Search button itself

)

# these have to be encoded

encodedFields = urllib.urlencode(formFields)

req = urllib2.Request(uri,encodedFields,headers)

f= urllib2.urlopen(req) #that's the actual call to the http site.

# *** here would normally be the in-memory parsing of f

# contents,but instead I store this to file

# this is useful during design,allowing to have a

# sample of what is to be parsed in a text editor,for analysis.

try:

fout = open('tmp.htm','w')

except:

print('Could not open output file\n')

fout.writelines(f.readlines())

fout.close()

这是关于获取初始页面的。如上所述，然后需要解析页面，即找到感兴趣的部分并适当地收集它们，并将它们存储到文件/数据库/ whereever。这个工作可以通过很多方式完成：使用html解析器，或XSLT类型的技术(确实在解析html到xml之后)，甚至是粗略的作业，简单的正则表达式。而且，通常提取的一个项目是“下一个信息”，即可以在服务器的新请求中使用以获得后续页面的各种链接。

这应该给你一个粗俗的味道，“长手”html刮擦是关于。还有许多其他方法，例如Mozilla(FireFox)GreaseMonkey插件，XSLT中的专用工具，脚本

weixin_39614109

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python 访问网页aspx_asp.net – 如何向python中的.aspx页面提交查询

作为概述，您将需要执行四个主要任务：>向网站提交请求，>从站点检索响应>来解析这些响应>有一些逻辑来迭代上面的任务，与导航相关的参数(到结果列表中的“下一个”页面)http请求和响应处理使用Python标准库的urllib和urllib2中的方法和类来完成.html页面的解析可以使用Python的标准库HTMLParser或其他模块，如Beautiful Soup以下代码段...
复制链接

扫一扫