穿越网页表单与登录窗口进行采集

最新推荐文章于 2024-09-14 08:25:16 发布

Dmnq

最新推荐文章于 2024-09-14 08:25:16 发布

阅读量326

点赞数

分类专栏： python 文章标签： python requests 表格

本文链接：https://blog.csdn.net/weixin_42750907/article/details/89316529

版权

python 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

1提交一个基本表单
网页：http://pythonscraping.com/pages/files/form.html
这个表单的源代码是：

<form method="post" action="processing.php">
First name: <input type="text" name="firstname"><br>
Last name: <input type="text" name="lastname"><br>
<input type="submit" value="Submit" id="submit">
</form>

这个表单的真实行为是发生在processing.php中即（http://pythonscraping.com/files/processing.php），表单的所有post请求都发生在这个页面上。这里强调下：HTML表单的目的，只是帮助网站的访问者发送格式合理的请求，向服务器请求没有出现的页面。

源代码为：

import requests

url = 'http://pythonscraping.com/pages/files/processing.php'
params = {'firstname':'Ran', 'lastname':'Mitchell'}
r = requests.post(url,data=params)
print(r.text)

这里注意，网页链接是processing.php，是发生动作的页面，不是html页面。

表单提交后，程序会返回执行页面的源代码，包括这行内容：

Hello there,Ran Mitchell!

2处理登录和Cookie

import requests

session = requests.Session()
params = {'username':'username','password':'password'}
s = session.post('http://pythonscraping.com/pages/cookies/welcome.php',params)
print('Cookie is set to:')
print(s.cookie.get_dict())
print('..........')
print('Going to profile page...')
s = session.get('http://pythonscraping.com/pages/cookies/profile.php')
print(s.text)

会话（session）对象（调用requests.Session()获取）会持续跟踪会话信息，像cookie,header,甚至包括运行HTTP协议的信息，比如HTTPAdapter（为HTTP和HTTPS的链接会话提供统一接口）