搜索爬行原理_使用Diffbot爬行和搜索整个域

搜索爬行原理

In this tutorial, I’ll show you how to build a custom SitePoint search engine that far outdoes anything WordPress could ever put out. We’ll be using Diffbot as a service to extract structured data from SitePoint automatically, and this matching API client to do both the searching and crawling.

在本教程中,我将向您展示如何构建自定义SitePoint搜索引擎 ,该引擎远远超过了WordPress可以提供的任何功能。 我们将使用Diffbot作为服务来自动从SitePoint中提取结构化数据,并使用此匹配的API客户端来进行搜索和爬网。

Diffbot logo

I’ll also be using my trusty Homestead Improved environment for a clean project, so I can experiment in a VM that’s dedicated to this project and this project alone.

我还将在一个干净的项目中使用值得信赖的Homestead Improvementd环境,因此我可以在专门用于该项目和该项目的VM中进行实验。

什么啊 (What’s what?)

To make a SitePoint search engine, we need to do the following:

要制作SitePoint搜索引擎,我们需要执行以下操作:

  1. Build a Crawljob which will index and process the entire SitePoint.com domain and keep itself up to date with newly published content.

    构建一个Crawljob,它将为整个SitePoint.com域建立索引并对其进行处理,并使其与最新发布的内容保持最新。
  2. Build a GUI for submitting search queries to the saved set produced by this crawljob. Searching is done via the Search API. We’ll do this in a followup post.

    构建一个GUI,以将搜索查询提交给此crawljob生成的已保存集。 通过Search API完成搜索 。 我们将在后续帖子中进行介绍。

A Diffbot Crawljob does the following:

Diffbot Crawljob执行以下操作:

  1. It spiders a URL pattern for URLs. This does not mean processing – it means looking for links to process on all the pages it can find, starting from the domain you originally passed in as seed. For the difference between crawling and processing, see here.

    它搜寻URL的URL模式。 这并不意味着处理-这意味着寻找链接过程对所有能找到的网页,从您最初传递的种子域开始。 有关抓取和处理之间的区别,请参见此处

  2. It processes the pages found on the spidered URLs with the designated API engine – for example, using Product API, it processes all products it found on Amazon.com and saves them into a structured database of items on offer.

    它使用指定的API引擎处理在蜘蛛URL上找到的页面-例如,使用Product API,它处理在Amazon.com上找到的所有产品,并将它们保存到商品的结构化数据库中。

创建一个Crawljob (Creating a Crawljob)

Jobs can be created through Diffbot’s GUI, but I find creating them via the crawl API is a more customizable experience. In an empty folder, let’s first install the client library.

可以通过Diffbot的GUI创建作业,但是我发现通过爬网API创建作业是一种更可定制的体验。 在一个空文件夹中,让我们首先安装客户端库。

composer require swader/diffbot-php-client

I now need a job.php file into which I’ll just dump the job creation procedure, as per the README:

现在我需要一个job.php文件,按照

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值