Operational错误指即使是进程书写正确,也会在运行的时候遇到的问题,如:failed to connect to server

failed to resolve hostname

invalid user input

request timeout

server returned a 500 response

socket hang-up

system is out of memory


Programmer错误指进程书写有误,往往我们可以通过修改进程修正这种错误。如:tried to read property of “undefined”

called an asynchronous function without a callback

passed a “string” where an object was expected

passed an object where an IP address string was expected


以上的错误分类在各种编程语言中都会遇到,但是各种编程语言处理错误的方式却不相同。如C语言通过函数返回值的形式来处理错误,这是最经典朴素的方式。C++,Java,Python当然也支持最函数返回值的形式,但是侧重于使用throw/try/catch模式处理错误。例如在Java中你会经常需要处理IOExeption,不处理,那么你的进程有可能崩掉。而到了异步编程的世界,一般来说前面两种处理方式不起作用,例如在异步函数中你使用throw抛出错误,而在异步函数的caller中就捕捉不到抛出的异常,因为caller在准备接收异常的时候,可能throw所在的函数还没有运行,或者throw的时候,caller早已经过去了。所以说一般在异步编程中,不使用throw/try/catch。当然,为了处理异步编程中的错误,引入了回调函数和eventEmitter模式。若不了解什么是回调函数和eventEmitter,请参考What are callbacks和What are Event Emitters。

处理Operational Error

对于一个具体的Operational Error,一般我们可以如下处理:Deal with the failure directly. Sometimes, it’s clear what you have to do to handle an error. If you get an ENOENT error trying to open a log file, maybe this is the first time the program has run on this system and you just need to create the log file first. A more interesting case might be where you’re maintaining a persistent connection to a server (e.g., a database), and you get a “socket hang-up” error. This usually means either the remote side or the network flaked out, and it’s frequently transient, so you’d usually deal with this by reconnecting. (This isn’t the same as retrying, below, since there’s not necessarily an operation going on when you get this error.)

Propagate the failure to your client. If you don’t know how to deal with the error, the simplest thing to do is to abort whatever operation you’re trying to do, clean up whatever you’ve started, and deliver an error back to your client. (How to deliver that error is another question, and it’s discussed below.) This is appropriate when you expect that whatever caused the error is not going to change soon. For example, if the user gave you invalid JSON, it’s not going to help to try parsing it again.

Retry the operation. For errors from the network and remote services (e.g., a web service), it’s sometimes useful to retry an operation that returns an error. For example, if a remote service gives a 503 (Service Unavailable error), you may want to retry in a few seconds.If you’re going to retry, you should clearly document that you may retry multiple times, how many times you’ll try before failing, and how long you’ll wait between retries. Also, don’t assume that you should always retry an operation. If you’re several layers deep in the stack (e.g., you’re being called by a client, which was called by another client, which is being driven by a human), it’s usually better to fail fast and let the end client deal with retries. If every layer of the stack thinks it needs to retry on errors, the user can end up waiting much longer than they should because because each layer didn’t realize that the underlying layer was also retrying.

Blow up. For errors that truly can’t happen, or would effectively represent programmer errors if they ever did (e.g., failed to connect to a localhost socket that’s supposed to be listening in the same program), it’s fine to log an error message and crash. Other errors like running out of memory effectively can’t be handled in a dynamic language like JavaScript anyway, so it may be totally reasonable to crash. (That said, you can get ENOMEM from discrete operations like child_process.exec, and those you can reasonably handle, and you should consider doing so.) You can also blow up if there’s nothing you can reasonably do about something and an administrator needs to fix things. For example, if you run out of file descriptors or don’t have permission to access your configuration file, there’s nothing you can do about this, and a user will have to log in and fix things anyway.

Log the error — and do nothing else. Sometimes, there’s nothing you can do about something, there’s nothing to retry or abort, and there’s also no reason to crash the program. An example might be if you’re keeping track of a group of remote services using DNS and one of those services falls out of DNS. There’s nothing you can do about it except log a message and proceed with the remaining services. But you should at least log something in this case. (There are exceptions to every rule. If this is something that may happen thousands of times per second, and there’s nothing you can do about it, it’s probably not worth logging it every time it happens. But do log it periodically.)

处理Programmer Error

对于Programmer Error,最好的方式就是让进程停止掉,然后修改错误。因为若继续让进程运行,那么后面就会遇到无法判断的错误。




