Python下 selenium + GraphQuery 采集小例

最新推荐文章于 2023-09-30 17:16:20 发布

待鸣

最新推荐文章于 2023-09-30 17:16:20 发布

阅读量239

点赞数

分类专栏：后端技术文章标签：爬虫 python

本文链接：https://blog.csdn.net/oZuoYu123/article/details/93891445

版权

后端技术专栏收录该内容

18 篇文章 0 订阅

订阅专栏

现在反爬措施日新月异，爬虫技术也道高一尺魔高一丈，经历了IP封禁、js防爬等防御手段，总结了一套还算是不错的采集组合

GraphQuery: https://github.com/storyicon/graphquery

国内能查到的资料貌似不多，但是功能还是很强的，用一种类似于接口请求的方式去获取所需的格式化数据。

selenium：

控制 chorme ，操作浏览器的一个工具；

基本原理是：

selenium获取到数据，带着想要的数据接口，直接转给 GraphQuery ，获取到想要的数据。

贴个代码，写的比较简单，就是个临时的小工具：

import requests
import json
from selenium import webdriver
# 先把所有的内容链接搞回来
# url = "http://yjj.sh.gov.cn/XingZhengChuFa/xxgk2.aspx?pu=&qymc=&slrqstart=&slrqend=&pageindex=1&pagesize=100"

driver_path = r"G:\chromedriver_win32\chromedriver.exe"
opt = webdriver.ChromeOptions()

opt.add_argument('--headless')
opt.add_argument('--disable-gpu')

driver = webdriver.Chrome(executable_path=driver_path, options=opt)



res = []
# 传入链接
def GraphQuery(document, expr):
    response = requests.post("http://127.0.0.1:8559", data={
        "document": document,
        "expression": expr,
    })
    return response.text

def go(url):
    driver.get(url)
    content = driver.page_source
    conseq = GraphQuery(content, r"""
                    {
                        url `css("table a")` [u `attr("href")`]
                    }
                """)
    count = json.loads(conseq)
    #print(count)
    # 把内容链接循环遍历打开
    for i in count['data']['url']:
        u = 'http://yjj.sh.gov.cn/XingZhengChuFa/'+i
        # 检查页面内是否存在 色拉 沙拉 相关数据
        i = requests.get(u)
        content = i.content.decode("UTF-8")
        keys = ["沙拉", "色拉"]
        for key in keys:
            if key in content:
                # 存在则记录 链接
                res.append(u)
                f1 = open('test.txt', 'a+')
                f1.write(u+'\n')

                f1.close()


i = 100
while i < 620:
    url = "http://yjj.sh.gov.cn/XingZhengChuFa/xxgk2.aspx?pu=&qymc=&slrqstart=&slrqend=&pageindex="+i.__str__()+"&pagesize=50"
    try:
        i += 1
        print(i)
        go(url)
        print(res)
    except:
        continue

print(res)