恶梦护士 asa_内容上传的噩梦

最新推荐文章于 2022-06-11 13:25:55 发布

weixin_26723981

最新推荐文章于 2022-06-11 13:25:55 发布

阅读量441

点赞数

文章标签： python

原文链接：https://medium.com/payfit/the-content-upload-nightmare-d30a13dbf086

版权

恶梦护士 asa

“这是什么文件？” (“What is this file?”)

This is a simple question I didn’t care much about. But that was before I had to properly handle file uploading.

这是一个我不太在乎的简单问题。但这是在我必须正确处理文件上传之前。

This question is easy, anyone can answer it quickly. test.pdf is a pdf, it ends with pdf, image.jpg is a jpg , and that's when things start to get interesting, because there are a lot of questions you can ask to make someone doubt:

这个问题很简单，任何人都可以快速回答。 test.pdf是pdf，以pdf结尾， image.jpg是jpg ，那时事情开始变得有趣起来，因为您可能要问很多问题，以至于有人怀疑：

Are you sure it’s a pdf? Someone could have changed the extension.
您确定它是PDF吗？有人可能已经更改了扩展名。
Your machine says it’s a pdf but can you trust your machine? How does it know that it is a pdf?
您的机器说这是pdf但是您可以信任您的机器吗？它怎么知道它是PDF？
What if someone changed the bytes inside the file to make your machine think it’s a valid pdf?
如果有人更改了文件中的字节以使您的机器认为它是有效的pdf怎么办？
What if the file you’re trying to open is a malware in disguise?
如果您要打开的文件是伪装的恶意软件怎么办？

It’s almost impossible for a system that handles file uploading to be sure, but let’s try to find a way to be as close to the truth as we can.

确保处理文件上传的系统几乎是不可能的，但是让我们尝试找到一种尽可能接近事实的方法。

In this article I’ll share with you how you can detect and verify files that are being uploaded to your server (so server-side). Alright, let’s jump into it!

在本文中，我将与您分享如何检测和验证要上传到服务器(即服务器端)的文件。好吧，让我们跳进去吧！

内容类型标题 (The content-type header)

First things first, when a file is uploaded to your server using HTTP through a browser, you will always have the content-type header. This is a field provided by your browser to help easily identify what kind of file is uploaded. Say I'm uploading my test.pdf, Firefox will add the content-type to the request and say it's an application/pdf.

首先，当通过浏览器使用HTTP将文件上传到服务器时，您将始终具有content-type标头。这是您的浏览器提供的字段，可帮助您轻松识别上传的文件类型。假设我要上传test.pdf ，Firefox会将content-type添加到请求中，并说它是application/pdf 。

Image for post — Browser content-type header

How does the browser detect the file's content-type ?

浏览器如何检测文件的content-type ？

The simple response is “like you and me, the browser checks the extension and then decides what kind of file it is”. It sees a .pdf, it is a pdf.

简单的响应是“就像您和我一样，浏览器会检查扩展名，然后确定它是哪种文件”。它看到一个.pdf ，它是一个pdf。

The truth is a bit more complex than that, and to make it a bit more fun, results are not always the same between browsers. It’s called mime-sniffing. Lets see some of them:

事实要比这复杂得多，并且要使其有趣一点，浏览器之间的结果并不总是相同的。这被称为mime-sniffing 。让我们来看一些：

火狐浏览器 (Firefox)

// OK. We want to try the following sources of mimetype information, in this
  // order:
  // 1. defaultMimeEntries array
  // 2. OS-provided information
  // 3. our "extras" array
  // 4. Information from plugins
  // 5. The "ext-to-type-mapping" category
  // Note that, we are intentionally not looking at the handler service, because
  // that can be affected by websites, which leads to undesired behavior.

Function nsExternalHelperAppService::GetTypeFromExtension in mozilla-central/uriloader/exthandler/nsExternalHelperAppService.cpp (link).

功能 nsExternalHelperAppService::GetTypeFromExtension 在 mozilla-central/uriloader/exthandler/nsExternalHelperAppService.cpp ( 链接 )。

So first they browse through an array, let’s take a look:

因此，首先，他们浏览数组，让我们看一下：

/**
 * Default extension->mimetype mappings. These are not overridable.
 * If you add types here, make sure they are lowercase, or you'll regret it.
 */
static const nsDefaultMimeTypeEntry defaultMimeEntries[] = {
    // The following are those extensions that we're asked about during startup,
    // sorted by order used
    {IMAGE_GIF, "gif"},
    {TEXT_XML, "xml"},
    {APPLICATION_RDF, "rdf"},
    {IMAGE_PNG, "png"},
    // -- end extensions used during startup
    {TEXT_CSS, "css"},
    {IMAGE_JPEG, "jpeg"},
    {IMAGE_JPEG, "jpg"},
    {IMAGE_SVG_XML, "svg"},
    {TEXT_HTML, "html"},
    {TEXT_HTML, "htm"},
    {APPLICATION_XPINSTALL, "xpi"},
    {"application/xhtml+xml", "xhtml"},
    {"application/xhtml+xml", "xht"},
    {TEXT_PLAIN, "txt"},
    {APPLICATION_JSON, "json"},
    {APPLICATION_XJAVASCRIPT, "js"},
    {APPLICATION_XJAVASCRIPT, "jsm"},
    {VIDEO_OGG, "ogv"},
    {VIDEO_OGG, "ogg"},
    {APPLICATION_OGG, "ogg"},
    {AUDIO_OGG, "oga"},
    {AUDIO_OGG, "opus"},
    {APPLICATION_PDF, "pdf"},
    {VIDEO_WEBM, "webm"},
    {AUDIO_WEBM, "webm"},
    {IMAGE_ICO, "ico"},
    {TEXT_PLAIN, "properties"},
    {TEXT_PLAIN, "locale"},
    {TEXT_PLAIN, "ftl"},
#if defined(MOZ_WMF)
    {VIDEO_MP4, "mp4"},
    {AUDIO_MP4, "m4a"},
    {AUDIO_MP3, "mp3"},
#endif
#ifdef MOZ_RAW
    {VIDEO_RAW, "yuv"}
#endif
};

from the same file.

来自同一文件。

This array seems pretty short compared to the number of existing extensions, but should do the trick for most of the files uploaded. What about files that do not match this list?

与现有扩展名的数量相比，此数组似乎很短，但是应该可以解决大多数上传的文件的问题。与该列表不匹配的文件呢？

Then we check for the internal type and…

然后我们检查内部类型并…

nsresult nsExternalHelperAppService::GetMIMEInfoFromOS(
    const nsACString& aMIMEType, const nsACString& aFileExt, bool* aFound,
    nsIMIMEInfo** aMIMEInfo) {
  *aMIMEInfo = nullptr;
  *aFound = false;
  return NS_ERROR_NOT_IMPLEMENTED;
}

It’s not implemented ¯\(ツ)/¯.

没有实现¯\(ツ)/¯。

I won’t go through the whole file, but the extra array is the same as the first array, with a bit more information and extensions handled. Then it looks for plugins that could have the answer (which is very smart), and at the end, it queries another database with more information to find the answer (available here).

我不会遍历整个文件，但是额外的数组与第一个数组相同，只是处理了更多信息和扩展名。然后，它寻找可能具有答案的插件(非常聪明)，最后，它向另一个数据库查询更多信息以找到答案(可在此处获得 )。

Ok so Firefox is pretty straightforward, it just looks for the extension in multiple ways to find the content-type. Classic and easy to fool if you ask me.

好吧，Firefox非常简单，它只是以多种方式寻找扩展名以查找内容类型。如果你问我，经典又容易上当。

Let’s take a look at another one.

让我们来看看另一个。

Chrome (Chrome)

Chrome is very similar to Firefox in its handling:

Chrome在处理方面与Firefox非常相似：

// 2) File extension -> MIME type.  Implemented in GetMimeTypeFromExtension().
//
//    Sources are considered in the following order:
//
//    a) kPrimaryMappings.  Order matters here since file extensions can appear
//       multiple times on these lists.  The first mapping in order of
//       appearance in the list wins.
//
//    b) Underlying platform.
//
//    c) kSecondaryMappings.  Again, the order matters.

I won’t copy paste the full list here, but you can check it out in this URL. From the comments, Chrome is doing exactly the same thing as Firefox (except maybe the fact that the underlying platform check is implemented in Chrome).

我不会在此处复制粘贴完整列表，但是您可以在此URL中检出。根据评论，Chrome的行为与Firefox 完全相同 (可能是底层平台检查是在Chrome中实现的事实除外)。

其它浏览器 (Other browsers)

Sadly, Internet Explorer is not open sourced, but it’s probably doing the same thing as its two competitors.

令人遗憾的是，Internet Explorer不是开源的，但可能与其两个竞争对手在做同样的事情。

Since the hard-coded lists are limited, your browser will always send either a content-type found in its list, or the one provided by your OS.

由于硬编码列表是有限的，因此您的浏览器将始终发送列表中找到的内容类型或操作系统提供的内容类型。

Furthermore, this solution is not really viable since it can easily be tricked. Say I take my file virus.exe and rename it test.jpg. Now I upload it to your server, you receive img/jpeg so you trust it and try to resize it, then your image library will probably crash because it's a .exe file, so your server fails, and you are sad because you absolutely need to resize every uploaded jpg file.

此外，该解决方案实际上并不可行，因为它很容易被欺骗。说我带上文件virus.exe并将其重命名为test.jpg 。现在，我将其上传到您的服务器，您收到img/jpeg因此您信任它并尝试调整其大小，然后您的图片库可能会崩溃，因为它是一个.exe文件，因此您的服务器出现了故障，您很难过，因为您绝对需要调整每个上载的jpg文件的大小。

Even without doing so, you can simply intercept the request using a proxy to modify the header and now you can upload your virus.exe without even renaming it. Or simply call your server using curl or postman to change the content-type manually.

即使不这样做，您也可以使用代理来修改标头来截获该请求，现在您甚至可以在不重命名的情况下上传virus.exe 。或者，只需使用curl或postman致电您的服务器即可手动更改内容类型。

Ok so we can’t trust the content-type, what do we trust then? Chromium (and hopefully Firefox in a near future?) use the type provided by the OS (underlying platform). Let’s check how it works.

好吧，我们不能信任内容类型，那么我们信任什么？ Chromium(希望在不久的将来成为Firefox？) 使用操作系统 (底层平台) 提供的类型 。让我们检查一下它是如何工作的。

操作系统定义的文件类型 (File type defined by the OS)

For example, in Linux, the command file runs a set of 3 tests, the first one that has the answer is printed:

例如，在Linux中，命令file运行一组3个测试，第一个具有答案的测试将被打印：

filesystem test: test based on the result of a call to stat command to determine if it is a regular file or special file.
文件系统测试：根据对stat命令的调用结果进行测试，以确定它是常规文件还是特殊文件。
magic number test: verifies the first bytes of a file in order to determine what type of file it is, based on a list of mime-types and the bytes related.
幻数测试：基于mime类型列表和相关字节，验证文件的第一个字节，以确定文件的类型。
language test: if the file does not match the two prior tests, it is examined in order to determine whether it is a text file or not.
语言测试：如果文件与之前的两次测试不匹配，则将对其进行检查以确定它是否是文本文件。

More info on the file command

关于file命令的更多信息

Hey so we can use the magic number to find out what type of file is uploaded! If we do this check server-side, we will have no problem!

嘿，所以我们可以使用幻数来查找上传的文件类型！如果我们在服务器端执行此检查，我们将没有问题！

Note: “magic number” is a short-hand for file signatures, learn more about it here.

注意：“幻数”是文件签名的简写，请 在此处 了解更多信息 。

Well yes and no, but first let’s take a look at what actually is a magic number.

是的，不是，但是首先让我们看一下实际上是一个幻数。

A magic number is a set of bytes (sometimes with an offset), that are characteristic of a type of file. Lets take a pdf file as an example:

幻数是一组字节(有时带有偏移量)，它们是文件类型的特征。让我们以pdf文件为例：

» xxd test.pdf | head -n 2                                         
00000000: 2550 4446 2d31 2e34 0a25 f6e4 fcdf 0a31  %PDF-1.4.%.....1
00000010: 2030 206f 626a 0a3c 3c0a 2f54 7970 6520   0 obj.<<./Type

These are the first two lines of the hexadecimal values of the content of my test.pdf file, it starts with 25 50 44 46 which is the magic-number used to determine if a file is a pdf. This magic number is equivalent to %PDF. You could modify the magic-number of a file, but then, if you don't know what you're doing besides that, you will just destroy your file since a lot of these values are linked together.

这是test.pdf文件内容的十六进制值的前两行，它以25 50 44 46开头，这是用于确定文件是否为pdf的幻数。该幻数等效于%PDF 。您可以修改文件的幻数，但是，然后，如果您不知道自己在做什么，则将破坏文件，因为许多这些值都链接在一起。

Sadly, this is not always true, a PDF file starts with this magic-number, but an Adobe Illustrator file (.ai) also starts with %PDF. We need to check if the hexadecimal value of the file contains also Adobe Illustrator to know whether this file is a pdf or an Adobe Illustrator file.

不幸的是，这并不总是正确的， PDF文件以该幻数开头，但是Adobe Illustrator文件( .ai )也以%PDF开头。我们需要检查文件的十六进制值是否还包含Adobe Illustrator以了解此文件是pdf还是Adobe Illustrator文件。

See for yourself in the famous file-type library in Javascript:

在著名的Javascript 文件类型库中亲自查看：

if (checkString('%PDF')) {
		// Check if this is an Adobe Illustrator file
		const isAiFile = await checkSequence('Adobe Illustrator', 1350);
		if (isAiFile) {
			return {
				ext: 'ai',
				mime: 'application/postscript'
			};
		}		// Assume this is just a normal PDF
		return {
			ext: 'pdf',
			mime: 'application/pdf'
		};
	}

You can find one of the lists here.

您可以 在此处找到列表之一。

Then again, I can easily break your server if I know you are using this, I need to modify these specific hexadecimal values to match what you seek, and you can have a great invalid PDF which is considered valid (and who knows what happens next). Do note that this is probably the hardest part to fake since modifying hexadecimal values of a file to match a file signature can be very painful.

再说一次，如果我知道您正在使用它，那么我可以轻松地断开服务器，我需要修改这些特定的十六进制值以匹配您要查找的内容，并且您可能会有一个很大的无效PDF，被认为是有效的(谁知道接下来会发生什么情况) )。请注意，这可能是最难伪造的部分，因为修改文件的十六进制值以匹配文件签名可能非常痛苦 。

Before leaving you, lets talk a bit about what could be your worst nightmare: the octet-stream. When an OS or a browser does not find out what file it has to deal with, it decides to call it an octet-stream. It’s true, your file is a bunch of bytes, but accepting files on your server that have the type octet-stream is the same as accepting anything and everything, and never knowing what’s in front of you.

在离开您之前，让我们先谈谈最可怕的噩梦： 八位字节流 。当操作系统或浏览器找不到要处理的文件时，它决定将其称为八位字节流 。的确，您的文件是一堆字节，但是在服务器上接受八位字节流类型的文件与接受任何内容和所有内容一样，却永远不知道面前的内容。

We can never know precisely what file was sent to you, unless we try to use it and fail, that is the sad part, but we can always try to be as close to the reality as we can get, by using magic-numbers as part of our verification on upload for example. I hope this short article gave you a better view of content-uploading and the nightmare of handling all cases.

我们永远无法确切知道发送给您的文件，除非我们尝试使用它并失败，这是可悲的部分，但是我们始终可以通过使用幻数来尽可能地接近我们所能获得的现实。例如，我们在上传时进行验证的一部分。我希望这篇简短的文章可以使您更好地了解内容上载以及处理所有案件的噩梦。