搜索引擎AP调研报告-CSDN博客

本文链接：https://blog.csdn.net/star1210644725/article/details/140364419

这篇文章分享给做AI搜索的朋友们。少走一些重复的路把。希望这些结果能够帮到大家。

这里再分享一些心得。其实我们也在想，为什么现在已经有的AI搜素，他们的联网查询为何这么快？为什么这么稳定？像国内AI搜索做的比较好的，kimi，秘塔，360ai搜索。其中秘塔是被猎豹控股的（猎豹本身就是做搜索引擎的），360浏览器本身也是做搜索引擎的。他们的联网搜索并不是走的爬取网页的方式。前段时间分析过kimi的联网查询，大概率是bing的接口。不过现在就不一定了。

搜索引擎的搜索通常只有快照信息。也就是不包括网页的完整的正文内容。大概爬取一次搜索引擎就需要1s（在过防爬的前提下），然后根据返回的原网页的链接，再获取网页的正文，至少需要5s，如果是动态加载的网页，安全认证的网页，所需的时间就更长了。平均一次获取网页的详情在10s左右。这和jina的 reader是一致的。

这篇文章给大家分享我了解到的搜索引擎的API供应商和获取详情页的供应商。

需求: 一个快速的（5s内获取正文并返回）、稳定的（能够跳过验证，跳过防爬，能够解析动态加载的网页）、支持高并发的、能够获取到内容详情页的联网查询接口。

快报：只有jina reader能够满足以上的需求（但是jina reader的解析能力不够，有些权限验证它无法通过）。

参考文章：

2023年15个最佳搜索引擎结果页面 API | 代理 • Proxy

一、测试结论

1.1 搜索过程分两步

第一步根据query从搜索引擎上获取搜索快照，第二步根据返回的url获取网页的内容。

除了crawlbase 以外，其余的搜索引擎API返回的结果都只包含搜索快照。

crawlbase 获取页面详情页的接口，需要额外调用。价格为149美元每月，总计100W个请求。每秒可支持的获取详情页的并发为30，换算成Ai搜索为3个请求（每个请求，获取10个快照页）。可以解析需要动态加载的网页，耗时在 4s- 10s。

1.2 对比jina reader

jina reader 提供了联网查询并返回页面top5数据的接口。能够支持的访问速率，40/每分钟 （提供了升级的支持，需要提交公司信息，没有提到升级的价格），平均响应时间10s。收费标准：按照输出的token计算。10亿/ 20美元

Reader API

二、搜索引擎

用户获取搜索快照

2.1 brightdata

登录方式

github 或者 google 账号，需要邮箱验证

5美元试用

SERP API - SERP scraper API - Free Trial

试用体验地址

Bright Data - Web Data Platform

问题点:

给的示例代码无法获取到数据。

在体验页面上获取的数据，只有快照，没有正文内容。

得到脚本和python代码，执行都报错，请求被终止。

urllib.error.URLError: <urlopen error [WinError 10054] 远程主机强迫关闭了一个现有的连接。

echo -e "\n\nThis is the VERBOSE version sample cURL code for SERP API.\nIn order to instantly use SERP API, you need to either install an SSL certificate\nor to ignore SSL errors in your code.\n\nThis cURL includes the '-k' option to ignore SSL errors.\n\nPress Enter to continue..." && read input && echo -e "\nThanks. I am going to run the following cURL command now:\n" && echo "curl --proxy brd.superproxy.io:22225 --proxy-user brd-customer-hl_8f2ca9c0-zone-serp_api1:tpyf2k4i5eyr -k \"https://www.google.com/search?q=pizza\"" && echo -e "\nCopy this cURL if you want to run it in non-verbose mode.\n\nHere's the result of the cURL:\n" && curl --proxy brd.superproxy.io:22225 --proxy-user brd-customer-hl_8f2ca9c0-zone-serp_api1:tpyf2k4i5eyr -k "https://www.google.com/search?q=pizza" && echo -e "\n\nFor additional information visit:\nhttps://docs.brightdata.com/general/account/ssl-certificate\n"

#!/usr/bin/env python
print('If you get error "ImportError: No module named \'six\'" install six:\n'+\
    '$ sudo pip install six');
print('To enable your free eval account and get CUSTOMER, YOURZONE and ' + \
    'YOURPASS, please contact sales@brightdata.com')
import sys
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
if sys.version_info[0]==2:
    import six
    from six.moves.urllib import request
    opener = request.build_opener(
        request.ProxyHandler(
            {'http': 'http://brd-customer-hl_8f2ca9c0-zone-serp_api1:tpyf2k4i5eyr@brd.superproxy.io:22225',
            'https': 'http://brd-customer-hl_8f2ca9c0-zone-serp_api1:tpyf2k4i5eyr@brd.superproxy.io:22225'}))
    print(opener.open('http://www.google.com/search?q=pizza').read())
if sys.version_info[0]==3:
    import urllib.request
    opener = urllib.request.build_opener(
        urllib.request.ProxyHandler(
            {'http': 'http://brd-customer-hl_8f2ca9c0-zone-serp_api1:tpyf2k4i5eyr@brd.superproxy.io:22225',
            'https': 'http://brd-customer-hl_8f2ca9c0-zone-serp_api1:tpyf2k4i5eyr@brd.superproxy.io:22225'}))
    print(opener.open('http://www.google.com/search?q=pizza').read())