symfony_使用Symfony的DomCrawler进行图像抓取

symfony

A photographer friend of mine implored me to find and download images of picture frames from the internet. I eventually landed on a web page that had a number of them available for free but there was a problem: a link to download all the images together wasn’t present.

我的一位摄影师朋友恳求我从互联网上查找和下载相框图像。 我最终登陆了一个免费提供许多图片的网页,但是出现了一个问题:没有将所有图像一起下载的链接。

I didn’t want to go through the stress of downloading the images individually, so I wrote this PHP class to find, download and zip all images found on the website.

我不想承受单独下载图像的压力,所以我写了这个PHP类来查找,下载和压缩在网站上找到的所有图像。

班级的工作方式 (How the Class works)

It searches a URL for images, downloads and saves the images into a folder, creates a ZIP archive of the folder and finally deletes the folder.

它在URL中搜索图像,下载图像并将其保存到文件夹中,创建该文件夹的ZIP存档,最后删除该文件夹。

The class uses Symfony’s DomCrawler component to search for all image links found on the webpage and a custom zip function that creates the zip file. Credit to David Walsh for the zip function.

该类使用Symfony的DomCrawler组件搜索在网页上找到的所有图像链接,以及用于创建zip文件的自定义zip function 。 感谢David Walsh提供的zip功能。

编码课程 (Coding the Class)

The class consists of five private properties and eight public methods including the __construct magic method.

该类由五个私有属性和八个公共方法(包括__construct magic方法)组成。

alt

Below is the list of the class properties and their roles. 1. $folder: stores the name of the folder that contains the scraped images. 2. $url: stores the webpage URL. 3. $html: stores the HTML document code of the webpage to be scraped. 4. $fileName: stores the name of the ZIP file. 5. $status: saves the status of the operation. I.e if it was a success or failure.

下面是类属性及其角色的列表。 1. $folder :存储包含抓取图像的文件夹的名称。 2. $url :存储网页URL。 3. $html :存储要抓取的网页HTML文档代码。 4. $fileName :存储ZIP文件的名称。 5. $status :保存操作状态。 即是成功还是失败。

Let’s get started building the class.

让我们开始构建课程。

Create the class ZipImages containing the above five properties.

创建包含上述五个属性的类ZipImages

<?php
class ZipImages {
    private $folder;
    private $url;
    private $html;
    private $fileName;
    private $status;

Create a __construct magic method that accepts a URL as an argument. The method is quite self-explanatory.

创建一个接受URL作为参数的__construct magic方法。 该方法是不言自明的。

public function __construct($url) {
    $this->url = $url; 
    $this->html = file_get_contents($this->url);
    $this->setFolder();
}

The created ZIP archive has a folder that contains the scraped images. The setFolder method below configures this.

创建的ZIP归档文件包含一个包含抓取图像的文件夹。 下面的setFolder方法对此进行配置。

By default, the folder name is set to images but the method provides an option to change the name of the folder by simply passing the folder name as its argument.

默认情况下,文件夹名称设置为images但是该方法提供了通过简单地传递文件夹名称作为其参数来更改文件夹名称的选项。

public function setFolder($folder="image") {
    // if folder doesn't exist, attempt to create one and store the folder name in property $folder
    if(!file_exists($folder)) {
        mkdir($folder);
    }
    $this->folder = $folder;
}

setFileName provides an option to change the name of the ZIP file with a default name set to zipImages:

setFileName提供一个选项来更改ZIP文件的名称,默认名称设置为zipImages

public function setFileName($name = "zipImages") {
    $this->fileName = $name;
}

At this point, we instantiate the Symfony crawler component to search for images, then download and save all the images into the folder.

至此,我们实例化了Symfony crawler组件以搜索图像,然后将所有图像下载并保存到该文件夹​​中。

public function domCrawler() {
    //instantiate the symfony DomCrawler Component
    $crawler = new Crawler($this->html);
    // create an array of all scrapped image links
    $result = $crawler
        ->filterXpath('//img')
        ->extract(array('src'));

// download and save the image to the folder 
    foreach ($result as $image) {
        $path = $this->folder."/".basename($image);
        $file = file_get_contents($image);
        $insert = file_put_contents($path, $file);
        if (!$insert) {
            throw new \Exception('Failed to write image');
        }
    }
}

After the download is complete, we compress the image folder to a ZIP Archive using our custom create_zip function.

下载完成后,我们使用自定义的create_zip函数将图像文件夹压缩为ZIP存档。

public function createZip() {
    $folderFiles = scandir($this->folder);
    if (!$folderFiles) {
        throw new \Exception('Failed to scan folder');
    }
    $fileArray = array();
    foreach($folderFiles as $file){
        if (($file != ".") && ($file != "..")) {
            $fileArray[] = $this->folder."/".$file;
        }
    }

    if (create_zip($fileArray, $this->fileName.'.zip')) {
        $this->status = <<<HTML
File successfully archived. <a href="$this->fileName.zip">Download it now</a>
HTML;
    } else {
        $this->status = "An error occurred";
    }
}

Lastly, we delete the created folder after the ZIP file has been created.

最后,我们在创建ZIP文件后删除创建的文件夹。

public function deleteCreatedFolder() {
    $dp = opendir($this->folder) or die ('ERROR: Cannot open directory');
    while ($file = readdir($dp)) {
        if ($file != '.' && $file != '..') {
            if (is_file("$this->folder/$file")) {
                unlink("$this->folder/$file");
            }
        }
    }
    rmdir($this->folder) or die ('could not delete folder');
}

Get the status of the operation. I.e if it was successful or an error occurred.

获取操作状态。 即是否成功或发生错误。

public function getStatus() {
    echo $this->status;
}

Process all the methods above.

处理上述所有方法。

public function process() {
    $this->domCrawler();
    $this->createZip();
    $this->deleteCreatedFolder();
    $this->getStatus();
}

You can download the full class from Github.

您可以从Github下载完整的课程。

类依赖 (Class Dependency)

For the class to work, the Domcrawler component and create_zip function need to be included. You can download the code for this function here.

为了使该类正常工作,需要包含Domcrawler组件和create_zip函数。 您可以在此处下载此功能的代码。

Download and install the DomCrawler component via Composer simply by adding the following require statement to your composer.json file:

只需通过在composer.json文件中添加以下require语句,即可通过Composer下载并安装DomCrawler组件:

"symfony/dom-crawler": "2.3.*@dev"

Run $ php composer.phar install to download the library and generate the vendor/autoload.php autoloader file.

运行$ php composer.phar install来下载该库并生成vendor/autoload.php自动vendor/autoload.php器文件。

使用课程 (Using the Class)

  • Make sure all required files are included, via autoload or explicitly.

    确保通过自动加载或显式包含所有必需的文件。
  • Call the setFolder , and setFileName method and pass in their respective arguments. Only call the setFolder method when you need to change the folder name.

    调用setFoldersetFileName方法并传递它们各自的参数。 仅在需要更改文件夹名称时才调用setFolder方法。

  • Call the process method to put the class to work.

    调用process方法使该类正常工作。

<?php
    require_once 'zipfunction.php';
    require_once 'vendor/autoload.php';
    use Symfony\Component\DomCrawler\Crawler;
    require_once 'vendor/autoload.php';

    //instantiate the ZipImages class
    $object = new ArchiveFile('https://www.sitepoint.com');
    // set the zip file name
    $object->setFolder('pictureFrames');
    // set the zip file name
    $object->setFileName('myframes');
    // initialize the class process
    $object->process();
alt

摘要 (Summary)

In this article, we learned how to create a simple PHP image scraper that automatically compresses downloaded images into a Zip archive. If you have alternative solutions or suggestions for improvement, please leave them in the comments below, all feedback is welcome!

在本文中,我们学习了如何创建一个简单PHP图像抓取工具,该工具自动将下载的图像压缩到Zip存档中。 如果您有其他解决方案或改进建议,请将其留在下面的评论中,欢迎所有反馈!

翻译自: https://www.sitepoint.com/image-scraping-symfonys-domcrawler/

symfony

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值