php 读取超大文件,如何用PHP读取大文件（不宕掉你的服务器）

最新推荐文章于 2023-11-27 10:46:15 发布

weixin_40008566

最新推荐文章于 2023-11-27 10:46:15 发布

阅读量594

点赞数 1

文章标签： php 读取超大文件

作为PHP开发人员，我们并不经常担心内存管理。 PHP引擎在我们之后做了很好的清理工作，并且短期执行上下文的Web服务器模型意味着即使是最潦草的代码也没有持久的影响。

衡量成功

确保我们对代码进行任何改进的唯一方法是测量不良情况，然后在修改程序后将该测量结果与另一个进行比较。换句话说，除非我们知道“解决方案”对我们有多大帮助(如果有的话)，否则我们不知道它是否真的是一个解决方案。

我们可以关注两个度量标准。首先是CPU使用率。我们想要处理的过程有多快或多慢？第二个是内存使用情况。脚本需要执行多少内存？这些通常是成反比的 - 这意味着我们可以以CPU使用为代价来卸载内存使用量，反之亦然。

在异步执行模型中(如使用多进程或多线程PHP应用程序)，CPU和内存使用量是重要的考虑因素。在传统的PHP体系结构中，当任何一个达到服务器的限制时，这些通常都会成为问题。

衡量PHP内部的CPU使用情况是不切实际的。不过可以大概的使用top来查看系统CPU的情况。对于Windows，请考虑使用Linux子系统，以便您可以在容器中使用top。

我们常常用来查看使用多少内存的方法是：

memory_get_peak_usage();

function formatBytes($bytes, $precision = 2) {

$units = array("b", "kb", "mb", "gb", "tb");

$bytes = max($bytes, 0);

$pow = floor(($bytes ? log($bytes) : 0) / log(1024));

$pow = min($pow, count($units) - 1);

$bytes /= (1 << (10 * $pow));

return round($bytes, $precision) . " " . $units[$pow];

}

我们将在脚本结尾处使用这些函数，以便我们可以一次查看哪个脚本使用的内存最多。

我们的选择是什么？

有很多方法可以有效地读取文件。但是也有两种可能的情况，我们可以使用它们。我们可能想要同时读取和处理所有数据，输出处理过的数据或根据我们阅读的内容执行其他操作。我们也可能想要转换一个数据流，而不需要真正需要访问数据。

让我们设想一下，对于第一种情况，我们希望能够读取文件并每10,000行创建一个单独的排队处理作业。我们需要在内存中至少保留10,000行，并将它们传递给排队的工作管理器(无论采取何种形式)

对于第二种情况，我们假设我们想要压缩特别大的API响应的内容。我们不在乎它如果编写，但我们需要确保它以压缩形式备份。

在这两种情况下，我们都需要阅读大文件。首先，我们需要知道数据是什么。第二，我们不关心数据是什么。

逐行读取文件

有许多用于处理文件的功能。让我们将几个结合到一个自然的文件阅读器中：

function formatBytes($bytes, $precision = 2) {

$units = array("b", "kb", "mb", "gb", "tb");

$bytes = max($bytes, 0);

$pow = floor(($bytes ? log($bytes) : 0) / log(1024));

$pow = min($pow, count($units) - 1);

$bytes /= (1 << (10 * $pow));

return round($bytes, $precision) . " " . $units[$pow];

}

print formatBytes(memory_get_peak_usage());

function readTheFile($path) {

$lines = [];

$handle = fopen($path, "r");

while(!feof($handle)) {

$lines[] = trim(fgets($handle));

}

fclose($handle);

return $lines;

}

readTheFile("shakespeare.txt");

require "memory.php";

文本文件约5.5MB，峰值内存使用量为12.8MB。现在，让我们使用一个yield来读取每一行：

function readTheFile($path) {

$handle = fopen($path, "r");

while(!feof($handle)) {

yield trim(fgets($handle));

}

fclose($handle);

}

readTheFile("shakespeare.txt");

require "memory.php";

文本文件的大小相同，但峰值内存使用量为393KB。但是这并不意味着什么，除非我们对我们正在阅读的数据做些什么。当我们看到两个空行时，也许我们可以将文档分成块。像这样的东西：

$iterator = readTheFile("shakespeare.txt");

$buffer = "";

foreach ($iterator as $iteration) {

preg_match("/\n{3}/", $buffer, $matches);

if (count($matches)) {

print ".";

$buffer = "";

} else {

$buffer .= $iteration . PHP_EOL;

}

require "memory.php";

即使我们将文本文档拆分为1,216个块，我们仍然只使用459KB的内存。鉴于yield的性质，我们将使用的最多内存是我们需要在迭代中存储最大文本块的内存。在这种情况下，最大的块是101,985个字符。

yield有其他用途，但是这对于大文件的高性能读取来说是非常好的。如果我们需要处理数据，yield可能是最好的方法。

文件之间的管道

在我们不需要操作数据的情况下，我们可以将文件数据从一个文件传递到另一个文件。这通常被称为管道。我们可以通过使用流方法来实现这一点。我们首先编写一个脚本以从一个文件传输到另一个文件，以便我们可以测量内存使用情况：

file_put_contents(

"piping-files-1.txt", file_get_contents("shakespeare.txt")

);

require "memory.php";

让我们尝试从一个文件到另一个文件的流式传输(或管道)：

$handle1 = fopen("shakespeare.txt", "r");

$handle2 = fopen("piping-files-2.txt", "w");

stream_copy_to_stream($handle1, $handle2);

fclose($handle1);

fclose($handle2);

require "memory.php";

这段代码有点奇怪。我们打开两个文件的句柄，第一个处于读取模式，第二个处于写入模式。然后我们从第一个复制到第二个。我们通过再次关闭这两个文件完成。你可能会惊讶地发现你使用的内存是393KB。

这似乎很熟悉。这不是yield在读取每行时用于存储的内容吗？这是因为fgets的第二个参数指定要读取每行的字节数(默认为-1或直到到达新行)。

stream_copy_to_stream的第三个参数是完全相同的参数(具有完全相同的默认值)。 stream_copy_to_stream正在从一个流读取，一次一行，并将其写入其他流。它跳过了yield产生值的部分，因为我们不需要使用该值。

不过管道对我们没有用，所以我们来看看其他可能的例子。假设我们想从我们的CDN输出图像，作为一种重定向的程序路由。我们可以用类似下面的代码来说明它：

file_put_contents(

"piping-files-3.jpeg", file_get_contents(

"https://github.com/assertchris/uploads/raw/master/rick.jpg"

)

);

// ...or write this straight to stdout, if we don't need the memory info

require "memory.php";

想象一下，程序路由将我们带入了这个代码。但不是从本地文件系统提供文件，我们希望从CDN获取文件。我们可以用file_get_contents替代更优雅的东西(比如Guzzle)，但是在引擎盖下它也是一样的。内存使用量(对于这个图像)大约是581KB。现在，我们如何尝试流式传输呢？

$handle1 = fopen(

"https://github.com/assertchris/uploads/raw/master/rick.jpg", "r"

);

$handle2 = fopen(

"piping-files-4.jpeg", "w"

);

// ...or write this straight to stdout, if we don't need the memory info

stream_copy_to_stream($handle1, $handle2);

fclose($handle1);

fclose($handle2);

require "memory.php";

内存使用量略少(在400KB)，但结果是一样的。如果我们不需要内存信息，我们也可以打印到标准输出。事实上，PHP提供了一个简单的方法来实现这一点：

$handle1 = fopen(

"https://github.com/assertchris/uploads/raw/master/rick.jpg", "r"

);

$handle2 = fopen(

"php://stdout", "w"

);

stream_copy_to_stream($handle1, $handle2);

fclose($handle1);

fclose($handle2);

// require "memory.php";

其他流方式

还有一些其他流可以管道或写入或读取：

php://stdin 只读取

php://stderr 只写入

php://input 只读取，我们可以访问原始请求主体

php://output 只写入，我们可以访问原始请求主体

php://memory和 php://temp 可读写是我们可以暂时存储数据的地方。不同之处在于，一旦php:// temp足够大，php就会将数据存储在文件系统中，而php://memory将一直保存在内存中直到耗尽。

Filters 过滤器

$zip = new ZipArchive();

$filename = "filters-1.zip";

$zip->open($filename, ZipArchive::CREATE);

$zip->addFromString("shakespeare.txt", file_get_contents("shakespeare.txt"));

$zip->close();

require "memory.php";

这是一个完整的打开文件的代码，但它的内存大约在10.75MB。我们可以做得更好，使用过滤器：

$handle1 = fopen(

"php://filter/zlib.deflate/resource=shakespeare.txt", "r"

);

$handle2 = fopen(

"filters-2.deflated", "w"

);

stream_copy_to_stream($handle1, $handle2);

fclose($handle1);

fclose($handle2);

require "memory.php";

在这里，我们可以看到php://filter/zlib.deflate过滤器，它读取和压缩资源的内容。然后，我们可以将这个压缩数据导入另一个文件。这只用了896KB。

为了解压缩数据，我们可以通过另一个zlib过滤器压缩文件：

file_get_contents(

"php://filter/zlib.inflate/resource=filters-2.deflated"

);

自定义流

fopen和file_get_contents有它们自己的一组默认选项，但这些都是完全可定制的。为了定义它们，我们需要创建一个新的流上下文：

$data = join("&", [

"twitter=assertchris",

]);

$headers = join("\r\n", [

"Content-type: application/x-www-form-urlencoded",

"Content-length: " . strlen($data),

]);

$options = [

"http" => [

"method" => "POST",

"header"=> $headers,

"content" => $data,

];

$context = stream_content_create($options);

$handle = fopen("https://example.com/register", "r", false, $context);

$response = stream_get_contents($handle);

fclose($handle);

在这个例子中，我们试图向API发出一个POST请求。 API端点是安全的，但我们仍然需要使用http上下文属性(用于http和https)。我们设置一些头文件并打开API的文件句柄。由于上下文负责写作，因此我们可以将句柄打开为只读。

制作自定义协议和过滤器

Protocol {

public resource $context;

public __construct ( void )

public __destruct ( void )

public bool dir_closedir ( void )

public bool dir_opendir ( string $path , int $options )

public string dir_readdir ( void )

public bool dir_rewinddir ( void )

public bool mkdir ( string $path , int $mode , int $options )

public bool rename ( string $path_from , string $path_to )

public bool rmdir ( string $path , int $options )

public resource stream_cast ( int $cast_as )

public void stream_close ( void )

public bool stream_eof ( void )

public bool stream_flush ( void )

public bool stream_lock ( int $operation )

public bool stream_metadata ( string $path , int $option , mixed $value )

public bool stream_open ( string $path , string $mode , int $options ,

string &$opened_path )

public string stream_read ( int $count )

public bool stream_seek ( int $offset , int $whence = SEEK_SET )

public bool stream_set_option ( int $option , int $arg1 , int $arg2 )

public array stream_stat ( void )

public int stream_tell ( void )

public bool stream_truncate ( int $new_size )

public int stream_write ( string $data )

public bool unlink ( string $path )

public array url_stat ( string $path , int $flags )

}

if (in_array("highlight-names", stream_get_wrappers())) {

stream_wrapper_unregister("highlight-names");

}

stream_wrapper_register("highlight-names", "HighlightNamesProtocol");

$highlighted = file_get_contents("highlight-names://story.txt");

同样，也可以创建自定义流过滤器。

Filter {

public $filtername;

public $params

public int filter ( resource $in , resource $out , int &$consumed ,

bool $closing )

public void onClose ( void )

public bool onCreate ( void )

}

这可以很容易地注册：

$handle = fopen("story.txt", "w+");

stream_filter_append($handle, "highlight-names", STREAM_FILTER_READ);

突出显示名称需要匹配新筛选器类的filtername属性。也可以在php：//filter/highligh-names/resource=story.txt字符串中使用自定义过滤器。定义过滤器比定义协议容易得多。其中一个原因是协议需要处理目录操作，而过滤器只需处理每个数据块。

如果您有这种想法，我强烈建议尝试创建自定义协议和过滤器。如果可以将过滤器应用于stream_copy_to_stream操作，那么即使在使用大容量文件时，您的应用程序也会在内存旁边使用。想象一下，编写调整图像过滤器或加密应用程序过滤器。

总结

虽然这不是我们经常遇到的问题，但在处理大文件时很容易搞砸。在异步应用程序中，当我们不注意内存使用情况时，将整个服务器关闭很容易。

weixin_40008566

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
php 读取超大文件,如何用PHP读取大文件（不宕掉你的服务器）

作为PHP开发人员，我们并不经常担心内存管理。 PHP引擎在我们之后做了很好的清理工作，并且短期执行上下文的Web服务器模型意味着即使是最潦草的代码也没有持久的影响。衡量成功确保我们对代码进行任何改进的唯一方法是测量不良情况，然后在修改程序后将该测量结果与另一个进行比较。换句话说，除非我们知道“解决方案”对我们有多大帮助(如果有的话)，否则我们不知道它是否真的是一个解决方案。我们可以关注两个度量标...
复制链接

扫一扫

php 读取超大文件,如何用PHP读取大文件（不宕掉你的服务器）

“相关推荐”对你有帮助么？