web scraping_使用PhearJS运行自己的Scraping API

web scraping

So-called 'client-side dynamic rendering' gives clients cool experiences, but makes it harder for machines to comprehend. In case you want to do data mining, scrape websites or send static versions of your slick single-page application to Altavista, you essentially need a browser in the loop. This is especially important given the amount of sites that use React, Angular, jQuery or some other fancy Javascript framework.

所谓的“客户端动态渲染”为客户提供了不错的体验,但是却使机器难以理解。 如果您想进行数据挖掘,抓取网站或将光滑的单页应用程序的静态版本发送到Altavista,则实际上需要在循环中使用浏览器。 鉴于使用React,Angular,jQuery或其他一些漂亮的Javascript框架的网站数量众多,这一点尤其重要。

PhearJS is an open-source software that exposes the power of the PhantomJS headless browser through an HTTP API. You make HTTP-requests to your PhearJS API to fetch a web page and get a nice JSON, containing the rendered HTML and relevant meta data.

PhearJS是一款开源软件,可通过HTTP API展现PhantomJS无头浏览器的强大功能。 您向您的PhearJS API发出HTTP请求,以获取网页并获得漂亮的JSON,其中包含呈现HTML和相关的元数据。

In this tutorial we'll check out how you can have this.

在本教程中,我们将介绍如何使用它。

配置 (Setting up)

PhearJS at least runs on popular, recent Linux distros and Mac OS X. First we need some dependencies:

PhearJS至少可以在最近流行的Linux发行版和Mac OS X上运行。首先,我们需要一些依赖项:

  • Memcached, do: brew install memcached. Replace brew with something like apt-get depending on your OS.

    Memcached ,执行: brew install memcached 。 根据您的操作系统,用apt-get类的东西替换brew

  • NodeJS, you probably have it, but if not, get it.

    NodeJS ,您可能拥有它,但如果没有,请获取它

  • PhantomJS 2+, installation for version 2+ currently differs quite a bit between OS's, so it's best to follow their installation instructions.

    PhantomJS 2+ ,版本2+的安装当前在操作系统之间有很大差异,因此最好遵循其安装说明

Woo! Dependencies down, now get PhearJS:

! 依赖性下降,现在获取PhearJS:

git clone https://github.com/Tomtomgo/phearjs.git
cd phearjs
npm install


Boom, that's it! You can verify PhearJS is well by running it, you should see some info on the terminal:

oom,就是这样! 您可以通过运行来验证PhearJS是否正常,您应该在终端上看到一些信息:

node phear.js


If you open your browser and go to http://localhost:8100/status it should show you some stats on the server.

如果打开浏览器并转到http:// localhost:8100 / status,它将显示服务器上的一些统计信息。

发出请求 (Making requests)

Okay, so by now we have PhearJS running. Rendering a web page is simple, I'll use cUrl here, but you can also use your browser with a JSON viewer plugin:

好的,到目前为止,我们已经运行了PhearJS。 渲染网页很简单,我将在此处使用cUrl,但您也可以将浏览器与JSON查看器插件一起使用:

# URL is URL-encoded, like you'd do with encodeURIComponent()
curl "http://localhost:8100/" \
      "?fetch_url=https%3A%2F%2Fdavidwalsh.name%2F"


In about five seconds you will see a response JSON with the rendered HTML and meta data, like request headers. Try it again and you will get it in an instant.

在大约五秒钟内,您将看到一个响应JSON,其中包含呈现HTML和元数据,例如请求标头。 再试一次,您将立即获得它。

But wait, why does it take five seconds the first time? Well, these five seconds are a delay that we use on purpose. It allows PhearJS some time for fetching AJAX requests and rendering. Subsequent requests are served from cache and hence quick.

但是,等等,为什么第一次需要五秒钟? 好吧,这五秒钟是我们故意使用的延迟。 它使PhearJS有一些时间来获取AJAX请求和呈现。 后续请求是从缓存中获取的,因此很快。

Now if you are on a slow connection or know that you will be scraping heavy pages you could increase this delay:

现在,如果您的连接速度很慢,或者知道您将要抓取大量页面,则可以增加此延迟:

curl "http://localhost:8100/" \
      "?fetch_url=https%3A%2F%2Fdavidwalsh.name%2F" \
      "&parse_delay=10000" \ # milliseconds
      "&force=true" # force a cache refresh


This is the simplest usage of PhearJS. There are many more configuration and run-time options that are documented on Github.

这是PhearJS的最简单用法。 Github上记录了更多的配置和运行时选项。

刮ing (Scraping)

Let's look at a common use case for PhearJS: scraping. Say we want to get images from a blog page that are not visible without Javascript enabled, e.g. https://davidwalsh.name/.

让我们看一下PhearJS的一个常见用例:抓取。 假设我们要从博客页面获取在未启用Javascript的情况下不可见的图像,例如https://davidwalsh.name/

依存关系 (Dependencies)

We will use Cheerio and Request for parsing and making requests:

我们将使用CheerioRequest进行解析和发出请求:

npm install cheerio requests


编写scrape.js (Writing scrape.js)

Once that's done we can go through some simple steps to retrieve all images on this page:

完成后,我们可以通过一些简单的步骤来检索此页面上的所有图像:

// 1. load dependencies
var cheerio = require('cheerio'),
    request = require('request'),
    url = require('url');

var page_url = 'https://davidwalsh.name';
var results = [];

// 2. encode the URL and add to PhearJS endpoint
var target = 'http://localhost:8100?fetch_url=' + encodeURIComponent(page_url);

// 3. use request to GET the page
request.get(target, function(error, response, body) {

    // 4. load the DOM from the response JSON
    var $ = cheerio.load(JSON.parse(body).content);

    // 5. use cheerio's jQuery-style selectors to get all images
    $("img").each(function(i, image) {

        // 6. resolve absolute URL and add to our results array
        results.push(url.resolve(page_url, $(image).attr('src')));
    });

    // 7. and boom! there's our images
    console.log(results);
});


运行! (Run it!)

Running this script will give you a list of all the images on the page:

运行此脚本将为您提供页面上所有图像的列表:

# run PhearJS
node phear.js

# in another shell run the script
node scrape.js
[ <url>, ..., <url> ]


下一个 (Next)

This is a very trivial of scraping with PhearJS. It's up to you to apply it to different scenarios, like crawling or automating for batch scraping, whatever. I'd be interested to hear what you've used PhearJS for!

用PhearJS进行刮擦非常简单。 您可以将其应用到不同的场景,例如抓取或自动进行批抓取等。 我很想听听您使用PhearJS的目的!

结论 (Conclusion)

PhearJS is open-source software that allows you to run your own scraping or prerendering "microservice". It renders web pages and returns them as JSON over HTTP.

PhearJS是开源软件,可让您运行自己的抓取或预渲染“微服务”。 它呈现网页并将其作为JSON通过HTTP返回。

Here we focussed on how to set up PhearJS for a very simple scraping task. SEO is another important one, for which the phearjs-express middleware might be relevant.

在这里,我们集中于如何为非常简单的抓取任务设置PhearJS。 SEO是另一个重要的方面,与phearjs-express中间件可能相关。

翻译自: https://davidwalsh.name/run-scraping-api-phearjs

web scraping

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值