使用jsp读取服务器文件_如何使用PHP读取大文件（不杀死服务器）

最新推荐文章于 2023-05-18 20:47:46 发布

culh2177

最新推荐文章于 2023-05-18 20:47:46 发布

阅读量346

点赞数

文章标签：过滤器 python java linux 大数据

原文链接：https://www.sitepoint.com/performant-reading-big-files-php/

版权

使用jsp读取服务器文件

It’s not often that we, as PHP developers, need to worry about memory management. The PHP engine does a stellar job of cleaning up after us, and the web server model of short-lived execution contexts means even the sloppiest code has no long-lasting effects.

作为PHP开发人员，我们并不经常需要担心内存管理。 PHP引擎在我们之后进行了出色的清理工作，并且短期执行上下文的Web服务器模型意味着即使最草率的代码也没有持久的影响。

There are rare times when we may need to step outside of this comfortable boundary — like when we’re trying to run Composer for a large project on the smallest VPS we can create, or when we need to read large files on an equally small server.

在极少数情况下，我们可能需要走出这个舒适的界限—例如，当我们试图在可以创建的最小VPS上为大型项目运行Composer时，或者需要在同样小的服务器上读取大文件时。

It’s the latter problem we’ll look at in this tutorial.

这是我们将在本教程中讨论的后一个问题。

The code for this tutorial can be found on GitHub.

本教程的代码可以在GitHub上找到。

衡量成功 (Measuring Success)

The only way to be sure we’re making any improvement to our code is to measure a bad situation and then compare that measurement to another after we’ve applied our fix. In other words, unless we know how much a “solution” helps us (if at all), we can’t know if it really is a solution or not.

确保我们对代码进行任何改进的唯一方法是测量一个糟糕的情况，然后在应用我们的修复程序之后将该测量结果与另一个情况进行比较。换句话说，除非我们知道“解决方案”对我们有多大帮助(如果有的话)，否则我们将无法确定它是否真的是一个解决方案。

There are two metrics we can care about. The first is CPU usage. How fast or slow is the process we want to work on? The second is memory usage. How much memory does the script take to execute? These are often inversely proportional — meaning that we can offload memory usage at the cost of CPU usage, and vice versa.

我们可以关注两个指标。首先是CPU使用率。我们要进行的过程有多快？第二个是内存使用率。该脚本执行需要多少内存？这些通常成反比，这意味着我们可以以CPU使用为代价来卸载内存使用，反之亦然。

In an asynchronous execution model (like with multi-process or multi-threaded PHP applications), both CPU and memory usage are important considerations. In traditional PHP architecture, these generally become a problem when either one reaches the limits of the server.

在异步执行模型(如多进程或多线程PHP应用程序)中，CPU和内存使用率都是重要的考虑因素。在传统PHP体系结构中，当任何一个达到服务器的极限时，这些通常会成为问题。

It’s impractical to measure CPU usage inside PHP. If that’s the area you want to focus on, consider using something like top, on Ubuntu or macOS. For Windows, consider using the Linux Subsystem, so you can use top in Ubuntu.

测量PHP内部的CPU使用率是不切实际的。 如果这是您要关注的领域，请考虑在Ubuntu或macOS上使用诸如top东西。 对于Windows，请考虑使用Linux子系统，因此可以在Ubuntu中使用top 。

For the purposes of this tutorial, we’re going to measure memory usage. We’ll look at how much memory is used in “traditional” scripts. We’ll implement a couple of optimization strategies and measure those too. In the end, I want you to be able to make an educated choice.

在本教程中，我们将测量内存使用情况。我们将看看“传统”脚本中使用了多少内存。我们将实施一些优化策略，并对其进行评估。最后，我希望您能够做出明智的选择。

The methods we’ll use to see how much memory is used are:

我们将用来查看使用了多少内存的方法是：

// formatBytes is taken from the php.net documentation

memory_get_peak_usage();

function formatBytes($bytes, $precision = 2) {
    $units = array("b", "kb", "mb", "gb", "tb");

    $bytes = max($bytes, 0);
    $pow = floor(($bytes ? log($bytes) : 0) / log(1024));
    $pow = min($pow, count($units) - 1);

    $bytes /= (1 << (10 * $pow));

    return round($bytes, $precision) . " " . $units[$pow];
}

We’ll use these functions at the end of our scripts, so we can see which script uses the most memory at one time.

我们将在脚本末尾使用这些功能，因此我们可以一次查看哪个脚本使用最多的内存。

我们有什么选择？ (What Are Our Options?)

There are many approaches we could take to read files efficiently. But there are also two likely scenarios in which we could use them. We could want to read and process data all at the same time, outputting the processed data or performing other actions based on what we read. We could also want to transform a stream of data without ever really needing access to the data.

我们可以采取许多方法来有效地读取文件。但是，在两种可能的情况下，我们也可以使用它们。我们可能希望同时读取和处理所有数据，输出已处理的数据或根据读取的内容执行其他操作。我们可能还想转换数据流，而无需真正访问数据。

Let’s imagine, for the first scenario, that we want to be able to read a file and create separate queued processing jobs every 10,000 lines. We’d need to keep at least 10,000 lines in memory, and pass them along to the queued job manager (whatever form that may take).

让我们想象一下，对于第一种情况，我们希望能够读取文件并每10,000行创建单独的排队处理作业。我们需要在内存中至少保留10,000行，并将它们传递给排队的作业管理器(无论采用哪种形式)。

For the second scenario, let’s imagine we want to compress the contents of a particularly large API response. We don’t care what it says, but we need to make sure it’s backed up in a compressed form.

对于第二种情况，假设我们要压缩一个特别大的API响应的内容。我们不在乎它说什么，但是我们需要确保以压缩形式备份它。

In both scenarios, we need to read large files. In the first, we need to know what the data is. In the second, we don’t care what the data is. Let’s explore these options…

在这两种情况下，我们都需要读取大文件。首先，我们需要知道数据是什么。第二，我们不在乎数据是什么。让我们探索这些选项...

逐行读取文件 (Reading Files, Line By Line)

There are many functions for working with files. Let’s combine a few into a naive file reader:

有许多处理文件的功能。让我们将一些内容合并到一个天真的文件阅读器中：

// from memory.php

function formatBytes($bytes, $precision = 2) {
    $units = array("b", "kb", "mb", "gb", "tb");

    $bytes = max($bytes, 0);
    $pow = floor(($bytes ? log($bytes) : 0) / log(1024));
    $pow = min($pow, count($units) - 1);

    $bytes /= (1 << (10 * $pow));

    return round($bytes, $precision) . " " . $units[$pow];
}

print formatBytes(memory_get_peak_usage());

// from reading-files-line-by-line-1.php

function readTheFile($path) {
    $lines = [];
    $handle = fopen($path, "r");

    while(!feof($handle)) {
        $lines[] = trim(fgets($handle));
    }

    fclose($handle);
    return $lines;
}

readTheFile("shakespeare.txt");

require "memory.php";

We’re reading a text file containing the complete works of Shakespeare. The text file is about 5.5MB, and the peak memory usage is 12.8MB. Now, let’s use a generator to read each line:

我们正在阅读一个文本文件，其中包含莎士比亚的全部著作。文本文件约为5.5MB ，峰值内存使用量为12.8MB 。现在，让我们使用一个生成器来读取每一行：

// from reading-files-line-by-line-2.php

function readTheFile($path) {
    $handle = fopen($path, "r");

    while(!feof($handle)) {
        yield trim(fgets($handle));
    }

    fclose($handle);
}

readTheFile("shakespeare.txt");

require "memory.php";

The text file is the same size, but the peak memory usage is 393KB. This doesn’t mean anything until we do something with the data we’re reading. Perhaps we can split the document into chunks whenever we see two blank lines. Something like this:

文本文件大小相同，但峰值内存使用量为393KB 。直到我们对所读取的数据进行处理后，这才意味着什么。也许每当我们看到两个空白行时，我们都可以将文档拆分为多个块。像这样：

// from reading-files-line-by-line-3.php

$iterator = readTheFile("shakespeare.txt");

$buffer = "";

foreach ($iterator as $iteration) {
    preg_match("/\n{3}/", $buffer, $matches);

    if (count($matches)) {
        print ".";
        $buffer = "";
    } else {
        $buffer .= $iteration . PHP_EOL;
    }
}

require "memory.php";

Any guesses how much memory we’re using now? Would it surprise you to know that, even though we split the text document up into 1,216 chunks, we still only use 459KB of memory? Given the nature of generators, the most memory we’ll use is that which we need to store the largest text chunk in an iteration. In this case, the largest chunk is 101,985 characters.

任何猜测我们现在正在使用多少内存？知道我们将文本文档分为1,216个块，但仍仅使用459KB的内存，这会让您感到惊讶吗？鉴于生成器的性质，我们将使用的最大内存是在迭代中需要存储最大文本块的内存。在这种情况下，最大块为101,985个字符。

I’ve already written about the performance boosts of using generators and Nikita Popov’s Iterator library, so go check that out if you’d like to see more!

我已经写过关于使用生成器和Nikita Popov的Iterator库的性能提升的文章，所以请检查一下您是否想了解更多！

Generators have other uses, but this one is demonstrably good for performant reading of large files. If we need to work on the data, generators are probably the best way.

生成器还有其他用途，但是这种用途显然可以很好地读取大型文件。如果我们需要处理数据，生成器可能是最好的方法。

文件间管道 (Piping Between Files)

In situations where we don’t need to operate on the data, we can pass file data from one file to another. This is commonly called piping (presumably because we don’t see what’s inside a pipe except at each end … as long as it’s opaque, of course!). We can achieve this by using stream methods. Let’s first write a script to transfer from one file to another, so that we can measure the memory usage:

在不需要处理数据的情况下，我们可以将文件数据从一个文件传递到另一个文件。这通常被称为管道 (大概是因为我们看不到管道的每个末端，除了两端……当然，只要它是不透明的！)。我们可以通过使用流方法来实现。首先，让我们编写一个脚本来从一个文件传输到另一个文件，以便我们可以测量内存使用情况：

// from piping-files-1.php

file_put_contents(
    "piping-files-1.txt", file_get_contents("shakespeare.txt")
);

require "memory.php";

Unsurprisingly, this script uses slightly more memory to run than the text file it copies. That’s because it has to read (and keep) the file contents in memory until it has written to the new file. For small files, that may be okay. When we start to use bigger files, no so much…

毫不奇怪，该脚本比其复制的文本文件使用更多的内存来运行。这是因为它必须读取(并保留)内存中的文件内容，直到将其写入新文件为止。对于小文件，可能没问题。当我们开始使用更大的文件时，就没有那么多了……

Let’s try streaming (or piping) from one file to another:

让我们尝试从一个文件流式传输(或管道传输)到另一个文件：

// from piping-files-2.php

$handle1 = fopen("shakespeare.txt", "r");
$handle2 = fopen("piping-files-2.txt", "w");

stream_copy_to_stream($handle1, $handle2);

fclose($handle1);
fclose($handle2);

require "memory.php";

This code is slightly strange. We open handles to both files, the first in read mode and the second in write mode. Then we copy from the first into the second. We finish by closing both files again. It may surprise you to know that the memory used is 393KB.

这段代码有点奇怪。我们打开两个文件的句柄，第一个处于读取模式，第二个处于写入模式。然后我们从第一个复制到第二个。我们通过再次关闭两个文件来完成。知道使用的内存为393KB可能会让您感到惊讶。

That seems familiar. Isn’t that what the generator code used to store when reading each line? That’s because the second argument to fgets specifies how many bytes of each line to read (and defaults to -1 or until it reaches a new line).

看起来很熟悉。读取每一行时，生成器代码不是用来存储的吗？这是因为fgets的第二个参数指定每行要读取的字节数(默认为-1或直到到达新行)。

The third argument to stream_copy_to_stream is exactly the same sort of parameter (with exactly the same default). stream_copy_to_stream is reading from one stream, one line at a time, and writing it to the other stream. It skips the part where the generator yields a value, since we don’t need to work with that value.

stream_copy_to_stream的第三个参数是完全相同的参数(默认设置完全相同)。 stream_copy_to_stream正在一次从一个流读取一行，然后将其写入另一流。它跳过了生成器产生值的部分，因为我们不需要使用该值。

Piping this text isn’t useful to us, so let’s think of other examples which might be. Suppose we wanted to output an image from our CDN, as a sort of redirected application route. We could illustrate it with code resembling the following:

放置这段文字对我们没有用，所以让我们考虑其他可能的例子。假设我们想从CDN输出图像，作为一种重定向的应用程序路由。我们可以用类似于以下内容的代码来说明：

// from piping-files-3.php

file_put_contents(
    "piping-files-3.jpeg", file_get_contents(
        "https://github.com/assertchris/uploads/raw/master/rick.jpg"
    )
);

// ...or write this straight to stdout, if we don't need the memory info

require "memory.php";

Imagine an application route brought us to this code. But instead of serving up a file from the local file system, we want to get it from a CDN. We may substitute file_get_contents for something more elegant (like Guzzle), but under the hood it’s much the same.

想象一下一条应用程序路线将我们带到了这段代码。但是，我们不是从本地文件系统提供文件，而是从CDN获取文件。我们可以用file_get_contents代替一些更优雅的东西(例如Guzzle )，但实际上是一样的。

The memory usage (for this image) is around 581KB. Now, how about we try to stream this instead?

内存使用情况(此图像)约为581KB 。现在，我们如何尝试流式传输呢？

// from piping-files-4.php

$handle1 = fopen(
    "https://github.com/assertchris/uploads/raw/master/rick.jpg", "r"
);

$handle2 = fopen(
    "piping-files-4.jpeg", "w"
);

// ...or write this straight to stdout, if we don't need the memory info

stream_copy_to_stream($handle1, $handle2);

fclose($handle1);
fclose($handle2);

require "memory.php";

The memory usage is slightly less (at 400KB), but the result is the same. If we didn’t need the memory information, we could just as well print to standard output. In fact, PHP provides a simple way to do this:

内存使用量略少( 400KB )，但是结果是相同的。如果我们不需要内存信息，我们也可以打印到标准输出。实际上，PHP提供了一种简单的方法来执行此操作：

$handle1 = fopen(
    "https://github.com/assertchris/uploads/raw/master/rick.jpg", "r"
);

$handle2 = fopen(
    "php://stdout", "w"
);

stream_copy_to_stream($handle1, $handle2);

fclose($handle1);
fclose($handle2);

// require "memory.php";

其他流 (Other Streams)

There are a few other streams we could pipe and/or write to and/or read from:

我们还有其他一些流可以通过管道传输和/或写入和/或从中读取：

php://stdin (read-only)
php://stdin (只读)
php://stderr (write-only, like php://stdout)
php://stderr (只写，例如php：// stdout)
php://input (read-only) which gives us access to the raw request body
php://input (只读)，它使我们可以访问原始请求正文
php://output (write-only) which lets us write to an output buffer
php://output (仅写)，可让我们写入输出缓冲区
php://memory and php://temp (read-write) are places we can store data temporarily. The difference is that php://temp will store the data in the file system once it becomes large enough, while php://memory will keep storing in memory until that runs out.
php://memory和php://temp (读写)是我们可以临时存储数据的地方。不同之处在于，一旦php://temp足够大，它将在文件系统中存储数据，而php://memory将一直存储在内存中，直到用完为止。

筛选器 (Filters)

There’s another trick we can use with streams called filters. They’re a kind of in-between step, providing a tiny bit of control over the stream data without exposing it to us. Imagine we wanted to compress our shakespeare.txt. We might use the Zip extension:

我们可以对流使用另一个技巧，称为过滤器 。它们是介于两者之间的一种步骤，可在不向我们公开数据流的情况下提供一点点控制。想象一下，我们想压缩我们的shakespeare.txt 。我们可以使用Zip扩展名：

// from filters-1.php

$zip = new ZipArchive();
$filename = "filters-1.zip";

$zip->open($filename, ZipArchive::CREATE);
$zip->addFromString("shakespeare.txt", file_get_contents("shakespeare.txt"));
$zip->close();

require "memory.php";

This is a neat bit of code, but it clocks in at around 10.75MB. We can do better, with filters:

这是一整段代码，但是时钟约为10.75MB 。使用过滤器，我们可以做得更好：

// from filters-2.php

$handle1 = fopen(
    "php://filter/zlib.deflate/resource=shakespeare.txt", "r"
);

$handle2 = fopen(
    "filters-2.deflated", "w"
);

stream_copy_to_stream($handle1, $handle2);

fclose($handle1);
fclose($handle2);

require "memory.php";

Here, we can see the php://filter/zlib.deflate filter, which reads and compresses the contents of a resource. We can then pipe this compressed data into another file. This only uses 896KB.

在这里，我们可以看到php://filter/zlib.deflate过滤器，该过滤器读取和压缩资源的内容。然后，我们可以将该压缩数据通过管道传输到另一个文件中。这仅使用896KB 。

I know this is not the same format, or that there are upsides to making a zip archive. You have to wonder though: if you could choose the different format and save 12 times the memory, wouldn’t you?

我知道这是不同的格式，或者制作zip存档有很多好处。 但是，您必须怀疑：是否可以选择其他格式并节省12倍的内存，对吗？

To uncompress the data, we can run the deflated file back through another zlib filter:

要解压缩数据，我们可以通过另一个zlib过滤器运行回缩的文件：

// from filters-2.php

file_get_contents(
    "php://filter/zlib.inflate/resource=filters-2.deflated"
);

Streams have been extensively covered in Understanding Streams in PHP and “Using PHP Streams Effectively”. If you’d like a different perspective, check those out!

在理解PHP中的流和“ 有效地使用PHP流 ”中已广泛涉及流。如果您想换个角度，请检查一下！

自定义流 (Customizing Streams)

fopen and file_get_contents have their own set of default options, but these are completely customizable. To define them, we need to create a new stream context:

fopen和file_get_contents具有它们自己的默认选项集，但是它们是完全可定制的。要定义它们，我们需要创建一个新的流上下文：

// from creating-contexts-1.php

$data = join("&", [
    "twitter=assertchris",
]);

$headers = join("\r\n", [
    "Content-type: application/x-www-form-urlencoded",
    "Content-length: " . strlen($data),
]);

$options = [
    "http" => [
        "method" => "POST",
        "header"=> $headers,
        "content" => $data,
    ],
];

$context = stream_content_create($options);

$handle = fopen("https://example.com/register", "r", false, $context);
$response = stream_get_contents($handle);

fclose($handle);

In this example, we’re trying to make a POST request to an API. The API endpoint is secure, but we still need to use the http context property (as is used for http and https). We set a few headers and open a file handle to the API. We can open the handle as read-only since the context takes care of the writing.

在此示例中，我们尝试向API发出POST请求。 API端点是安全的，但是我们仍然需要使用http上下文属性(用于http和https )。我们设置了一些标题，并打开了API的文件句柄。我们可以将句柄以只读方式打开，因为上下文负责编写。

There are loads of things we can customize, so it’s best to check out the documentation if you want to know more.

我们可以自定义很多内容，因此，如果您想了解更多信息，最好查阅文档。

制作自定义协议和过滤器 (Making Custom Protocols and Filters)

Before we wrap things up, let’s talk about making custom protocols. If you look at the documentation, you can find an example class to implement:

在总结之前，让我们谈谈创建自定义协议。如果您查看文档，则可以找到要实现的示例类：

Protocol {
    public resource $context;
    public __construct ( void )
    public __destruct ( void )
    public bool dir_closedir ( void )
    public bool dir_opendir ( string $path , int $options )
    public string dir_readdir ( void )
    public bool dir_rewinddir ( void )
    public bool mkdir ( string $path , int $mode , int $options )
    public bool rename ( string $path_from , string $path_to )
    public bool rmdir ( string $path , int $options )
    public resource stream_cast ( int $cast_as )
    public void stream_close ( void )
    public bool stream_eof ( void )
    public bool stream_flush ( void )
    public bool stream_lock ( int $operation )
    public bool stream_metadata ( string $path , int $option , mixed $value )
    public bool stream_open ( string $path , string $mode , int $options ,
        string &$opened_path )
    public string stream_read ( int $count )
    public bool stream_seek ( int $offset , int $whence = SEEK_SET )
    public bool stream_set_option ( int $option , int $arg1 , int $arg2 )
    public array stream_stat ( void )
    public int stream_tell ( void )
    public bool stream_truncate ( int $new_size )
    public int stream_write ( string $data )
    public bool unlink ( string $path )
    public array url_stat ( string $path , int $flags )
}

We’re not going to implement one of these, since I think it is deserving of its own tutorial. There’s a lot of work that needs to be done. But once that work is done, we can register our stream wrapper quite easily:

我们不打算实现其中之一，因为我认为它值得拥有自己的教程。有很多工作要做。但是一旦完成这项工作，我们就可以很容易地注册流包装器：

if (in_array("highlight-names", stream_get_wrappers())) {
    stream_wrapper_unregister("highlight-names");
}

stream_wrapper_register("highlight-names", "HighlightNamesProtocol");

$highlighted = file_get_contents("highlight-names://story.txt");

Similarly, it’s also possible to create custom stream filters. The documentation has an example filter class:

同样，也可以创建自定义流过滤器。该文档有一个示例过滤器类：

Filter {
    public $filtername;
    public $params
    public int filter ( resource $in , resource $out , int &$consumed ,
        bool $closing )
    public void onClose ( void )
    public bool onCreate ( void )
}

This can be registered just as easily:

可以轻松注册：

$handle = fopen("story.txt", "w+");
stream_filter_append($handle, "highlight-names", STREAM_FILTER_READ);

highlight-names needs to match the filtername property of the new filter class. It’s also possible to use custom filters in a php://filter/highligh-names/resource=story.txt string. It’s much easier to define filters than it is to define protocols. One reason for this is that protocols need to handle directory operations, whereas filters only need to handle each chunk of data.

highlight-names需要与新过滤器类的filtername属性匹配。也可以在php://filter/highligh-names/resource=story.txt字符串中使用自定义过滤器。定义过滤器比定义协议要容易得多。原因之一是协议需要处理目录操作，而过滤器仅需要处理每个数据块。

If you have the gumption, I strongly encourage you to experiment with creating custom protocols and filters. If you can apply filters to stream_copy_to_stream operations, your applications are going to use next to no memory even when working with obscenely large files. Imagine writing a resize-image filter or and encrypt-for-application filter.

如果您愿意，我强烈建议您尝试创建自定义协议和过滤器。如果您可以将过滤器应用于stream_copy_to_stream操作，则即使使用令人讨厌的大文件，您的应用程序也将几乎不使用任何内存。想象一下编写一个resize-image过滤器或encrypt-for-application过滤器。

摘要 (Summary)

Though this isn’t a problem we frequently suffer from, it’s easy to mess up when working with large files. In asynchronous applications, it’s just as easy to bring the whole server down when we’re not careful about memory usage.

尽管这不是我们经常遇到的问题，但是在处理大文件时很容易搞砸。在异步应用程序中，当我们不注意内存使用情况时，很容易将整个服务器宕机。

This tutorial has hopefully introduced you to a few new ideas (or refreshed your memory about them), so that you can think more about how to read and write large files efficiently. When we start to become familiar with streams and generators, and stop using functions like file_get_contents: an entire category of errors disappear from our applications. That seems like a good thing to aim for!

希望本教程向您介绍了一些新想法(或刷新了它们的记忆)，以便您可以更多地考虑如何有效地读取和写入大文件。当我们开始熟悉流和生成器，并停止使用诸如file_get_contents类的功能时：一整类错误将从我们的应用程序中消失。这看起来是一件好事！

翻译自: https://www.sitepoint.com/performant-reading-big-files-php/

使用jsp读取服务器文件

culh2177

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
使用jsp读取服务器文件_如何使用PHP读取大文件（不杀死服务器）

使用jsp读取服务器文件It’s not often that we, as PHP developers, need to worry about memory management. The PHP engine does a stellar job of cleaning up after us, and the web server model of short-lived executi...
复制链接

扫一扫