2021SC@SDUSC
类方法classmethod from_response(response[, formname=None, formid=None, formnumber=0, formdata=None, formxpath=None, formcss=None, clickdata=None, dont_click=False, ...])
的主要参数:
Parameters
response (Response object) – the response containing a HTML form which will be used to pre-populate the form fields
formname (str) – if given, the form with name attribute set to this value will be used.
formid (str) – if given, the form with id attribute set to this value will be used.
formxpath (str) – if given, the first form that matches the xpath will be used.
formcss (str) – if given, the first form that matches the css selector will be used.
formnumber (int) – the number of form to use, when the response contains multiple forms. The first one (and also the default) is
0
.formdata (dict) – fields to override in the form data. If a field was already present in the response
<form>
element, its value is overridden by the one passed in this parameter. If a value passed in this parameter isNone
, the field will not be included in the request, even if it was present in the response<form>
element.clickdata (dict) – attributes to lookup the control clicked. If it’s not given, the form data will be submitted simulating a click on the first clickable element. In addition to html attributes, the control can be identified by its zero-based index relative to other submittable inputs inside the form, via the
nr
attribute.dont_click (bool) – If True, the form data will be submitted without clicking in any element.
formname (str) – 如果给定,将使用 name 属性设置为该值的表单。
formid (str) - 如果给定,将使用 id 属性设置为该值的表单。
formxpath (str) – 如果给定,将使用第一个与 xpath 匹配的形式。
formcss (str) – 如果给定,将使用与 css 选择器匹配的第一个表单。
formnumber (int) – 当响应包含多个表单时要使用的表单数。第一个(也是默认值)是 0。
formdata (dict) – 要在表单数据中覆盖的字段。如果一个字段已经存在于响应 <form> 元素中,它的值会被传入这个参数的值覆盖。如果此参数中传递的值为 None,则该字段将不会包含在请求中,即使它存在于响应 <form> 元素中。
clickdata (dict) – 用于查找点击控件的属性。如果没有给出,表单数据将被提交,模拟点击第一个可点击元素。除了 html 属性之外,控件还可以通过 nr 属性通过其相对于表单内其他可提交输入的从零开始的索引来标识。
dont_click (bool) – 如果为 True,表单数据将被提交而不点击任何元素。
@classmethod
def from_response(cls, response, formname=None, formid=None, formnumber=0, formdata=None,
clickdata=None, dont_click=False, formxpath=None, formcss=None, **kwargs):
kwargs.setdefault('encoding', response.encoding)
if formcss is not None:
from parsel.csstranslator import HTMLTranslator
formxpath = HTMLTranslator().css_to_xpath(formcss)
form = _get_form(response, formname, formid, formnumber, formxpath)
formdata = _get_inputs(form, formdata, dont_click, clickdata, response)
url = _get_form_url(form, kwargs.pop('url', None))
method = kwargs.pop('method', form.method)
if method is not None:
method = method.upper()
if method not in cls.valid_form_methods:
method = 'GET'
return cls(url=url, method=method, formdata=formdata, **kwargs)
class: scrapy.http.FormRequest(url[, formdata, ...])
FormRequest 类向 __init__ 方法添加了一个新的关键字参数。 其余参数与 Request 类相同,此处未记录。
Parameters
formdata (dict or collections.abc.Iterable) – is a dictionary (or iterable of (key, value) tuples) containing HTML Form data which will be url-encoded and assigned to the body of the request.
参数
formdata(dict 或 collections.abc.Iterable)——是一个字典(或(键,值)元组的迭代),包含 HTML 表单数据,这些数据将被 url 编码并分配给请求的主体。
几个requests用例:
使用 FormRequest 通过 HTTP POST 发送数据:
如果需要在爬虫中模拟一个 HTML 表单 POST 并发送几个键值字段,可以像这样返回一个 FormRequest 对象(来自我们的爬虫):
return [FormRequest(url="http://www.example.com/post/action",
formdata={'name': 'John Doe', 'age': '27'},
callback=self.after_post)]
使用 FormRequest.from_response() 模拟用户登录
网站通常通过 <input type="hidden"> 元素提供预填充的表单字段,例如会话相关数据或身份验证令牌(用于登录页面)。 抓取时,我们需要自动预填充这些字段,并且只覆盖其中的几个字段,例如用户名和密码。 可以使用 FormRequest.from_response() 方法。 一个使用它的示例:
import scrapy
def authentication_failed(response):
# TODO: Check the contents of the response and return True if it failed
# or False if it succeeded.
pass
class LoginSpider(scrapy.Spider):
name = 'example.com'
start_urls = ['http://www.example.com/users/login.php']
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formdata={'username': 'john', 'password': 'secret'},
callback=self.after_login
)
def after_login(self, response):
if authentication_failed(response):
self.logger.error("Login failed")
return
# continue scraping with authenticated session...