python-爬虫工作中突破的有趣点

最新推荐文章于 2023-12-22 10:51:38 发布

巴啦啦小魔仙变身

最新推荐文章于 2023-12-22 10:51:38 发布

阅读量300

点赞数

分类专栏： Python 文章标签： python

本文链接：https://blog.csdn.net/qq_22038327/article/details/97934096

版权

Python 专栏收录该内容

37 篇文章 0 订阅

订阅专栏

python-爬虫有趣点

1、对于asp.net网站中

直接回传post(__EVENTTARGET、__VIEWSTATE、__EVENTVALIDATION)即可
具体语句如

scrapy.FormRequest(self.startUrl, 
                   callback=self.parseSpflist, 
                   formdata=formData, 
                   dont_filter=False)

其中FormRequest是Request的子类。

2、对于 66ip 这类cookie随IP、浏览器、时间变化的

怎么解决？
还没有成功，后面有时间再看

3、获取下一页时遇到多个a标签无name、无id，根据text定位到a的位置

3.1、参考 python爬虫：scrapy框架xpath和css选择器语法
xpath定位 aTag = response.xpath("//a[contains(text(),'下一页')]")
继而获取href = aTag.xpath('./@href').get()

- 3.2、css定位？？有哪位大佬提供一下方法？

4、获取到无属性的td标签下的input标签，回头取td的text

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>获取到无属性的td标签下的input标签，回头取td的text</title>
</head>
<body>
    <table>
        <tr>
            <td>
                <input name='radiobuild' bid="bid1">
                td1
            </td>
            <td>
                <input name='radiobuild' bid="bid2">
                td2
            </td>
            <td>
                <input name='radiobuild' bid="bid3">
                td3
            </td>
            <td>
                <input name='radiobuild' bid="bid4">
                td4
            </td>
        </tr>
    </table>
</body>
</html>

4.1、获取input - inputTags = response.xpath("//input[@name='radiobuild']")
4.2、获取td-text，多种方法：--=~！@#￥……%……*,,,参考 python爬虫：scrapy框架xpath和css选择器语法
4.2.0、inputTags[0].xpath("..")可得到tdTag，注意: "../"不行的，会报错，另外"../." = ".."
4.2.1、inputTags[0].xpath("../text()").extract_first()
4.2.2、inputTags[0].xpath("string(..)").extract_first()
4.2.3、inputTags[0].xpath("./parent::*/text()").extract_first()

5、通过css定位元素，再取元素的属性值

scrapy shell http://newhouse.0557fdc.com/
response.css("[onclick]")，取到所有a标签，如何再取a标签内的id属性值？？
使用xpath是这样的，response.css("[onclick='reurl(this)']")[0].xpath("./@id")[0]
使用css是??