1.3. The Four Phases of Investigation 调查的四个阶段

Good investigation practices should balance the need to solve problems quickly, the need to build your skills, and the effective use of subject matter experts. The need to solve a problem quickly is obvious, but building your skills is important as well.

良好的调查方法应该需要平衡,快速解决问题的需要, 建立你的技能的需要, 以及对主题专家的有效利用。快速解决问题的需要是显而易见的, 但是建立你的技能也很重要。

Imagine walking into a library looking for information about a type of hardwood called “red oak.” To your surprise, you find a person who knows absolutely everything about wood. You have a choice to make. You can ask this person for the information you need, or you can read through several books and resources trying to find the information on your own. In the first case, you will get the answer you need right away...you just need to ask. In the second case, you will likely end up reading a lot of information about hardwood on your quest to find information about red oak. You’re going to learn more about hardwood, probably the various types, relative hardness, and what each is used for. You might even get curious and spend time reading up on the other types of hardwood. This peripheral information can be very helpful in the future, especially if you often work with hardwood.

想象一下走进一个图书馆, 寻找一种叫做 "红橡树" 的硬木的信息。让你吃惊的是, 你找到了一个对木头了如指掌的人。你可以做出选择,你可以问这个人你需要的信息, 或者你可以阅读一些书籍和资源, 试图找到你自己的信息。在第一种情况下, 你马上就会得到你需要的答案...... 你只要问。在第二种情况下, 你可能会读到很多关于硬木的信息, 你在寻找关于红橡木的信息。您将了解有关硬木的更多信息, 可能是各种类型、相对硬度以及每个用途。你甚至可能会好奇, 花时间阅读其他类型的硬木。这些外围信息在将来可能会非常有用, 特别是当您经常使用硬木时。

The next time you need information about hardwood, you go to the library again. You can ask the mysterious and knowledgeable person for the answer or spend some time and dig through books on your own. After a few trips to the library doing the investigation on your own, you will have learned a lot about hardwood and might not need to visit the library any more to get the answers you need. You’ve become an expert in hardwood. Of course, you’ll use your new knowledge and power for something nobler than creating difficult decisions for those walking into a library.

下次您需要有关硬木的信息时, 请再次到图书馆中。你可以向神秘而博学的人请教答案, 或者花些时间自己在书上挖掘。经过几次去图书馆做调查, 你会学到很多关于硬木的知识, 以后可能不需要再访问图书馆, 就可以获得您需要的答案,你已经成为硬木专家了。当然, 你会用你的新知识和力量来做一些更高尚的事情, 而不是为那些走进图书馆的人创造困难。

Likewise, every time you encounter a problem, you have a choice to make. You can immediately try to find the answer by searching the Internet or by asking an expert, or you can investigate the problem on your own. If you investigate a problem on your own, you will increase your skills from the experience regardless of whether you successfully solve the problem.

同样, 每当遇到问题时, 你都可以做出选择。您可以立即尝试通过搜索互联网或询问专家来找到答案, 或者您可以自己调查问题。如果你自己调查一个问题, 无论你是否成功地解决了问题, 你都会从经验中提高你的技能。

Of course, you need to make sure the skills that you would learn by finding the answer on your own will help you again in the future. For example, a physician may have little use for vast knowledge of hardwood ... although she or he may still find it interesting. For a physician that has one question about hardwood every 10 years, it may be better to just ask the expert or look for a shortcut to get the information she or he needs.

当然, 你需要通过自己找到答案来确保自己能学到的技能, 将来对你有帮助。例如, 医生可能对硬木的广泛知识没有用处..。对于一个每10年有一个关于硬木的问题的医生来说, 最好是去询问专家或者寻找一个捷径来获取她或他需要的信息。

The first section of this chapter will outline a useful balance that will solve problems quickly and in many cases even faster than getting a subject matter expert involved (from here on referred to as an expert). How is this possible? Well, getting an expert usually takes time. Most experts are busy with numerous other projects and are rarely available on a minute’s notice. So why turn to them at the first sign of trouble? Not only can you investigate and resolve some problems faster on your own, you can become one of the experts of tomorrow.

本章的第一部分将概述一个有用的平衡, 它将快速解决问题, 在许多情况下, 甚至比获得相关的主题专家 (从这里被称为专家) 要快。这怎么可能?这是因为, 找个专家通常需要时间。大多数专家都忙于许多其他项目, 而且很少能在一分钟内得到通知。那么, 为什么在第一次出现麻烦的时候求助于他们呢?你不仅能更快地调查和解决一些问题, 还可以成为明天的专家之一。

There are four phases of problem investigation that, when combined, will both build your skills and solve problems quickly and effectively.

问题调查有四个阶段, 如果结合起来, 将建立你的技能和快速和有效地解决问题。

  1. Initial investigation using your own skills.
  2. 使用自己的技能进行初步调查
  3. Search for answers using the Internet or other resource.
  4. 使用 Internet 或其他资源搜索答案。
  5. Begin deeper investigation.
  6. 开始深入调查
  7. Ask a subject matter expert for help.
  8. 向相关专家求助。

The first phase is an attempt to diagnose the problem on your own. This ensures that you build some skill for every problem you encounter. If the first attempt takes too long (that is, the problem is urgent and you need an immediate solution), move on to the next phase, which is searching for the answer using the Internet. If that doesn’t reveal a solution to the problem, don’t get an expert involved just yet. The third phase is to dive in deeper on your own. It will help to build some deep skill, and your homework will also be appreciated by an expert should you need to get one involved. Lastly, when the need arises, engage an expert to help solve the problem.

第一阶段是尝试自己诊断问题。这可以确保你为遇到的每个问题建立一些技能。如果第一次尝试时间过长 (即问题非常紧急, 您需要立即解决方案), 请转到下一阶段, 即使用 Internet 搜索答案。如果这还不能解决问题, 就不要让专家参与进来。第三个阶段是你自己深入钻研。这将有助于建立一些深层次的技能, 你的准备工作也将得到专家的赞赏, 如果你需要一个专家参与。最后, 当需要时, 聘请专家帮助解决问题。

The urgency of a problem should help to guide how quickly you go through the phases. For example, if you’re supporting the New York Stock Exchange and you are trying to solve a problem that would bring it back online during the peak hours of trading, you wouldn’t spend 20 minutes surfing the Internet looking for answers. You would get an expert involved immediately.

问题的紧迫性应该有助于指导你使用哪个阶段。例如, 如果您支持纽约证交所, 而您正试图解决在交易高峰期将其重新联机的问题, 您不会花20分钟上网寻找答案。你会立刻让专家参与进来。

The type of problem that occurred should also help guide how quickly you go through the phases. If you are a casual at-home Linux user, you might not benefit from a deep understanding of how Linux device drivers work, and it might not make sense to try and investigate such a complex problem on your own. It makes more sense to build deeper skills in a problem area when the type of problem aligns with your job responsibilities or personal interests.

发生的问题的类型也应该有助于指导您使用哪个阶段。如果你是一个居家 linux 用户, 你可能不会受益于对 linux 设备驱动程序如何工作的深刻理解, 也可能没有意义, 尝试和调查这样一个复杂的问题。当问题的类型与你的工作职责或个人利益相符时, 在问题领域建立更深的技能更有意义。

1.3.1. Phase #1: Initial Investigation Using Your Own Skills 用自己的技能初步调查

Basic information you should always make note of when you encounter a problem is:

在遇到问题时应始终注意的基本信息是:

  • The exact time the problem occurred
  • 问题发生的确切时间
  • Dynamic operating system information (information that can change frequently over time)
  • 动态操作系统信息 (可随时间频繁更改的信息)

The exact time is important because some problems are related to an event that occurred at that time. A common example is an errant cron job that randomly kills off processes on the system. A cron job is a script or program that is run by the cron daemon. The cron daemon is a process that runs in the background on Linux and Unix systems and runs programs or scripts at specific and configurable times (refer to the Linux man pages for more information about cron). A system administrator can accidentally create a cron job that will kill off processes with specific names or for a certain set of user IDs. As a non-privileged user (a user without super user privileges), your tool or application would simply be killed off without a trace. If it happens again, you will want to know what time it occurred and if it occurred at the same time of day (or week, hour, and so on).

The exact time is also important because it may be the only correlation between the problem and the system conditions at the time when the problem occurred. For example, an application often crashes or produces an error message when it is affected by low virtual memory. The symptom of an application crashing or producing an error message can seem, at first, to be completely unrelated to the current system conditions.

确切的时间很重要, 因为有些问题与当时发生的事件有关。一个常见的例子是一个错误的 cron 作业, 它随机地杀死系统上的进程。cron 作业是由 cron 守护进程运行的脚本或程序。cron 守护进程是在 Linux 和 Unix 系统的后台运行的过程, 在特定和可配置的时间运行程序或脚本 (有关 cron 的更多信息, 请参阅 Linux  man)。系统管理员可以创建一个 cron 作业, 它将使用特定名称或某一组用户 id 来扼杀进程。作为一个非特权用户 (没有超级用户特权的用户), 您的工具或应用程序将直接被毫无痕迹地杀死。如果再次发生, 您将希望知道什么时间发生, 如果它在一天的同一时间 (或周, 小时, 等等) 发生。确切的时间也很重要, 因为它可能是问题发生时与系统条件之间唯一的关联。例如, 当应用程序受到低虚拟内存的影响时, 通常会崩溃或产生错误消息。首先, 应用程序崩溃或生成错误消息的症状可能与当前系统条件完全无关。

The dynamic OS information includes anything that can change over time without human intervention. This includes the amount of free memory, the amount of free disk space, the CPU workload, and so on. This information is important enough that you may even want to collect it any time a serious problem occurs. For example, if you don’t collect the amount of free virtual memory when a problem occurs, you might never get another chance. A few minutes or hours later, the system resources might go back to normal, eliminating any evidence that the system was ever low on memory. In fact, this is so important that distributions such as SUSE LINUX Enterprise Server continuously run sar (a tool that displays dynamic OS information) to monitor the system resources. Sar is a special tool that can collect, report, or save information about the system activity.

动态 OS 信息包括任何可以在不需要人工干预的情况下随时更改的内容。这包括可用内存量、可用磁盘空间量、CPU 工作负荷等。这些信息很重要, 在严重的问题发生的时候,你需要收集它。例如, 如果在出现问题时不收集可用虚拟内存的数量, 则以后可能永远不会有其他机会。几分钟或几小时后, 系统资源可能会恢复正常, 从而消除了系统内存不足的任何证据。事实上, 这是非常重要的, 如 SUSE LINUX 企业服务器的发行版连续运行 sar (显示动态 OS 信息的工具) 来监视系统资源。Sar 是一种特殊的工具, 可以收集、报告或保存有关系统活动的信息。

The dynamic OS information is also a good place to start investigating many types of problems, which are frequently caused by a lack of resources or changes to the operating system. As part of this initial investigation, you should also make a note of the following:

动态 OS 信息也是开始调查许多类型问题的好地方, 这常常是由于缺少资源或操作系统的更改引起的。作为初步调查的一部分, 您还应记下以下内容:

  • What you were doing when the problem occurred. Were you installing software? Were you trying to start a Web server?
  • 问题发生时你在做什么。你在安装软件吗?您是否尝试启动 Web 服务器?
  • A problem description. This should include a description of what happened and a description of what was supposed to happen. In other words, how do you know there was a problem?
  • 问题描述。这应该包括对所发生的事情的描述和对本该发生的事情的描述。换句话说, 你怎么知道有问题?
  • Anything that may have triggered the problem. This will be pretty problem-specific, but it’s worthwhile to think about it when the problem is still fresh in your mind.
  • 可能引发问题的任何事情。这将是一个非常具体的问题, 但是当问题仍然在你的脑海中出现时, 考虑它是值得的。
  • Any evidence that may be relevant. This includes error logs from an application that you were using, the system log (/var/log/messages), an error message that was printed to the screen, and so on. You will want to protect any evidence (that is, make sure the relevant files don’t get deleted until you solve the problem).
  • 任何可能相关的证据。这包括您正在使用的应用程序中的错误日志、系统日志 (/var/日志/消息)、打印到屏幕上的错误消息等等。您需要保全任何证据 (即, 确保相关文件在解决问题之前不会被删除)。

If the problem isn’t too serious, then just make a mental note of this information and continue the investigation. If the problem is very serious (has a major impact to a business), write this stuff down or put it into an investigation log (an investigation log is covered in detail later in this chapter).

如果问题不是太严重, 那么就把这些信息记下来, 继续调查。如果问题非常严重 (对业务有重大影响), 请将这些内容写下来或放到调查日志中 (本章后面将详细介绍调查日志)。

If you can reproduce the problem at will, strace and ltrace may be good tools to start with. The strace and ltrace utilities can trace an application from the command line, or they can trace a running process. The strace command traces all of the system calls (special functions that interact with the operating system), and ltrace traces functions that a program called. The strace tool is probably the most useful problem investigation tool on Linux and is covered in more detail in Chapter 2, “strace and System Call Tracing Explained.”

如果你能在随时重现这个问题, strace 和 ltrace 可能是个好工具。strace 和 ltrace 实用程序可以从命令行跟踪应用程序, 也可以跟踪正在运行的进程。strace 命令跟踪所有系统调用 (与操作系统交互的特殊函数), 而 ltrace 程序调用的跟踪函数。strace 工具可能是 Linux 上最有用的问题调查工具, 并在2章 "strace 和系统调用跟踪解释" 中详细介绍。

Every now and then you’ll run into a problem that occurs once every few weeks or months. These problems usually occur on busy, complex systems, and even though they are rare, they can still have a major impact to a business and your personal time. If the problem is serious and cannot be reproduced, be sure to capture as much information as possible given that it might be your only chance. Also if the problem can’t be reproduced, you should start writing things down because you might need to refer to the information weeks or months into the future. For these types of problems, it may be worthwhile to collect a lot of information about the OS (including the software versions that are installed on it) considering that the problem could be related to something else that may change over weeks or months of time. Problems that take weeks or months to resolve can span several major changes or upgrades to the system, making it important to keep track of the original conditions under which the problem occurred.

时不时地, 你会遇到一个每隔几周或几个月发生一次的问题。这些问题通常发生在繁忙的、复杂的系统上, 即使它们很少见, 但仍然会对企业和个人产生重大影响。如果问题是严重的, 不能重现的, 一定要收集尽可能多的信息, 因为这可能是你唯一的机会。此外, 如果不能重现问题, 你应该开始记录下来一些东西, 因为在几周或几个月后,你可能需要参考这些信息。对于这些类型的问题, 可能需要收集大量有关 OS 的信息 (包括安装在它上的软件版本), 考虑到问题可能与可能在数周或数月内发生变化的其他内容相关。需要数周或数月来解决的问题可能涉及几个重大的变化或对系统的升级。记录下问题发生时的原始信息,这对解决问题是很重要的..。

Collecting the right OS information can involve running many OS commands, too many for someone to run when the need arises. For your convenience, this book comes with a data collection script that can gather an enormous amount of information about the operating system in a very short period of time. It will save you from having to remember each command and from having to type each command in to collect the right information.

收集正确的 os 信息可能涉及运行许多 OS 命令, 这导致在需要时运行的命令太多。为了方便起见, 本书附带了一个数据收集脚本, 可以在很短的时间内收集大量有关操作系统的信息。它将帮助你节省时间和精力,收集正确的信息。

The data collection script is particularly useful in two situations. The first situation is that you are investigating a problem on a remote customer system that you can’t log in to. The second situation is a serious problem on a local system that is critical to resolve. In both cases, the script is useful because it will usually gather all the OS information you need to investigate the problem with a single run.

数据收集脚本在两种情况下特别有用。第一种情况是, 您正在调查无法登录到的远程客户系统上的问题。第二种情况是本地系统的一个严重问题, 对解决这一点至关重要。在这两种情况下, 脚本都很有用, 因为它通常在一次运行中会收集您调查问题所需的所有 OS 信息。

When servicing a remote customer, it will reduce the number of initial requests for information. Without a data collection script, getting the right information for a remote problem can take many emails or phone calls. Each time you ask for more information, the information that is collected is older, further from the time that the problem occurred.

在为远程客户提供服务时, 它将减少初始请求信息的数量。如果没有数据收集脚本, 为远程问题获取正确的信息可能会需要多次电子邮件或电话。每次询问更多信息时, 收集的信息距离问题发生之时越来越远。

The script is easy to modify, meaning that you can add commands to collect information about specific products (including yours if you have any) or applications that may be important. For a business, this script can improve the efficiency of your support organization and increase the level of customer satisfaction with your support.

脚本易于修改, 这意味着您可以添加命令来收集有关特定产品的信息 (如果你有的话) 或重要的应用程序的信息。对于业务, 此脚本可以提高组织的效率并提高客户对您的支持的满意度。

Readers that are only using Linux at home may still find the script useful if they ever need to ask for help from a Linux expert. However, the script is certainly aimed more at the business Linux user. For this reason, there is more information on the data collection script in Appendix B, “Data Collection Script” (for the readers who support or use Linux in a business setting).

在家里使用 linux 的读者可能仍然会发现脚本有用, 如果他们需要从 linux 专家那里寻求帮助。但是, 该脚本当然是针对商业 Linux 用户的。因此, 在附录 B "数据收集脚本" (对于在业务设置中支持或使用 Linux 的读者) 中, 有更多关于数据收集脚本的信息。

Do not underestimate the importance of doing an initial investigation on your own, even if the information you need to solve the problem is on the Internet. You will learn more investigating a problem on your own, and that earned knowledge and experience will be helpful for solving problems again in the future. That said, make sure the information you learn is in an area that you will find useful again. For example, improving your skills with strace is a very worthwhile exercise, but learning about a rare problem in a device driver is probably not worth it for the average Linux user. An initial investigation will also help you to better understand the problem, which can be helpful when trying to find the right information on the Internet. Of course, if the problem is urgent, use the appropriate resources to find the right solution as soon as possible.

不要低估对自己进行初步调查的重要性, 即使您需要解决问题的信息在互联网上也能找到。你会学到更多的关于你要调查的问题, 而且获得的知识和经验将有助于将来再次解决问题。这样说, 请确保您所学的信息在一个你将会发现有用的领域。例如, 使用 strace 提高您的技能是一项非常值得的练习, 但是在设备驱动程序中了解一个罕见的问题可能不值得为一般的 Linux 用户所用。初步调查还将帮助您更好地理解问题, 这在尝试在互联网上找到正确的信息时会有帮助。当然, 如果问题很紧急, 请使用适当的资源尽快找到合适的解决方案。

1.3.1.1. Did Anything Change Recently?

Everything is working as expected and then suddenly, a problem occurs. The first question that people usually ask is “Did anything change recently?” The fact of the matter is that something either changed or something triggered the problem. If something changed and you can figure out what it was, you might have solved the problem and avoided a lengthy investigation.

一切都按预期工作, 然后突然出现问题。人们通常问的第一个问题是 "最近发生什么变化了吗?”事实是, 往往发生改变的事情引发了问题。如果你可以找出最近发生了什么改变, 你可能已经解决了问题。这避免了冗长的调查。

In general, it is very important to keep changes to a production environment to a minimum. When changes are necessary, be sure to notify the system users of any changes in advance so that any resulting impact will be easier for them to diagnose. Likewise, if you are a user of a system, look to your system administrator to give you a heads up when changes are made to the system. Here are some examples of changes that can cause problems:

一般而言, 将生产环境的更改保持在最低限度是非常重要的。如果需要进行更改, 请确保预先通知系统用户, 以便于其更易于判断影响。同样, 如果您是系统的用户, 请查看系统管理员, 以便在对系统进行更改时提醒您。下面是一些可能导致问题的更改示例:

  • A recent upgrade or change in the kernel version and/or system libraries and/or software on the system (for example, a software upgrade). The change could introduce a bug or a change in the (expected) behavior of the operating system. Either can affect the software that runs on the system.
  • 系统中内核版本和/或系统库和/或软件的最近升级或更改 (例如, 软件升级)。更改可能会引入 bug 或操作系统的 (预期) 行为的更改。可能会影响系统上运行的软件。
  • Changes to kernel parameters or tunable values can cause changes to behavior of the operating system, which can in turn cause problems for software that runs on the system.
  • 对内核参数或可调值的更改可能会导致操作系统行为的更改, 从而导致系统上运行的软件出现问题。
  • Hardware changes. Disks can fail causing a major outage or possibly just a slowdown in the case of a RAID. If more memory is added to the system and applications start to fail, it could be the result of bad memory. For example, gcc is one of the tools that tend to crash with bad memory.
  • 硬件更改。磁盘可能会失败, 导致大规模停机, 或者可能只是 RAID 事件的减速。如果将更多内存添加到系统中, 并且应用程序开始失败, 则可能是内存坏块造成的。例如, gcc 是一个工具, 往往由于内存坏块而崩溃。
  • Changes in workload (that is, more users suddenly going to a particular Web site) may push the system close to the limit of its resources. Increases in workload can consume the last bit of memory, causing problems for any software that could be running on the system.
  • 工作负载的变化 (即, 更多的用户突然进入某个特定的网站) 可能会使系统接近其资源的限制。工作负载的增加可能会消耗最后一点内存, 从而导致系统上运行的软件出现问题。

One of the best ways to detect changes to the system is to periodically run a script or tool that collects important information about the system and the software that runs on it. When a difficult problem occurs, you might want to start with a quick comparison of the changes that were recently made on the system—if nothing else, to rule them out as candidates to investigate further.

检测系统更改的最佳方法之一是定期运行脚本或工具, 以收集有关系统及其运行的软件的重要信息。当出现困难的问题时, 您可能希望首先对系统最近进行的更改进行快速比较 (如果没有其他情况), 将系统更改排除以进行进一步调查。

Using information about changes to the system requires a bit of work up front. If you don’t save historical information about the operating environment, you won’t be able to compare it to the current information when something goes wrong. There are some useful tools such as tripwire that can help to keep a history of good, known configuration states.

使用有关系统更改的信息需要前面的一些工作。如果您不保存有关操作环境的历史信息, 则在出现错误时无法将其与当前信息进行比较。有一些有用的工具, 如tripwire, 可以帮助保持良好的已知配置状态的历史。

Another best practice is to track any changes to configuration files in a revision control system such as CVS. This will ensure that you can “go back” to a stable point in the system’s past. For example, if the system were running smoothly three weeks ago but is unstable now, it might make sense to go back to the configuration three weeks prior to see if the problems are due to any configuration changes.

另一个最佳做法是跟踪修订控制系统 (如 CVS) 中的配置文件的任何更改。这将确保您可以 "回退" 到系统过去的一个稳定点。例如, 如果系统在三周前运行平稳, 但现在不稳定, 。那么把系统配置回退到3周之前,比查找造成系统不稳定的改变,更有意义。

1.3.2. Phase #2: Searching the Internet Effectively

There are three good reasons to move to this phase of investigation. The first is that your boss and/or customer needs immediate resolution of a problem. The second reason is that your patience has run out, and the problem is going in a direction that will take a long time to investigate. The third is that the type of problem is such that investigating it on your own is not going to build useful skills for the future.

有三个很好的理由去进入这个阶段的调查。首先, 你的老板和/或客户需要立即解决一个问题。第二个原因是你的耐心已经用光了, 这个问题正朝着一个需要很长时间去调查的方向发展。第三, 问题的类型是这样的, 你自己去调查它不会为将来建立有用的技能。

Using what you’ve learned about the problem in the first phase of investigation, you can search online for similar problems, preferably finding the identical problem already solved. Most problems can be solved by searching the Internet using an engine such as Google, by reading frequently asked question (FAQ) documents, HOW-TO documents, mailing-list archives, USENET archives, or other forums.

使用您在调查的第一阶段所学到的技能, 您可以在线搜索类似的问题, 最好找到已经解决的相同问题。大多数问题都可以通过使用 Google 这样的引擎来解决, 通过阅读常见问题 (FAQ) 文档、HOW-TO文档、邮件列表存档、USENET存档或其他论坛。

1.3.2.1. Google

When searching, pick out unique keywords that describe the problem you’re seeing. Your keywords should contain the application name or “kernel” + unique keywords from actual output + function name where problem occurs (if known). For example, keywords consisting of “kernel Oops sock_poll” will yield many results in Google.

搜索时, 挑选出描述您所看到问题的唯一关键字。您的关键字应该包含应用程序名称或 "内核" + 唯一关键字, 从实际输出 + 函数名称中出现问题 (如果已知)。例如, 由 "内核Oops sock_poll" 组成的关键词将在 Google 中产生许多结果。

There is so much information about Linux on the Internet that search engine giant Google has created a special search specifically for Linux. This is a great starting place to search for the information you want - http://www.google.com/linux.

互联网上有这么多关于 linux 的信息, 搜索引擎巨头 Google 为 linux 创建了专门的搜索。这是一个伟大的开始的地方, 搜索您想要的信息-http://www.google.com/linux

There are also some types of problems that can affect a Linux user but are not specific to Linux. In this case, it might be better to search using the main Google page instead. For example, FreeBSD shares many of the same design issues and makes use of GNU software as well, so there are times when documentation specific to FreeBSD will help with a Linux related problem.

还有一些类型的问题可能会影响 linux 用户, 但并不局限于 linux。在这种情况下, 最好使用 Google 主页来搜索。例如, FreeBSD 共享许多相同的设计问题, 同时也使用 GNU 软件, 因此有时特定于 FreeBSD 的文档将有助于解决 Linux 相关问题。

1.3.2.2. USENET

USENET is comprised of thousands of newsgroups or discussion groups on just about every imaginable topic. USENET has been around since the beginning of the Internet and is one of the original services that molded the Internet into what it is today. There are many ways of reading USENET newsgroups. One of them is by connecting a software program called a news reader to a USENET news server. More recently, Google provided Google Groups for users who prefer to use a Web browser. Google Groups is a searchable archive of most USENET newsgroups dating back to their infancies. The search page is found at http://groups.google.com or off of the main page for Google. Google Groups can also be used to post a question to USENET, as can most news readers.

USENET由数以千计的小组或讨论团体围绕每一个特定的话题组成, 。自互联网开始以来, USENET一直围绕着它, 是将互联网塑造成今天的原始服务之一。有许多方法来阅读USENET的群组。其中之一是通过将一个称为USENET阅读器的软件程序连接到USENET中心的消息服务器。最近, google 为喜欢使用 Web 浏览器的用户提供了 google 组。Google 组是一个可搜索的存档, 大多数USENET都可以追溯到他们的 infancies。搜索页面被发现在 http://groups.google.com 或关闭谷歌主页。Google 的群组也可以用来向USENET发布一个问题, 而大多数信息的读者也可以这样做。

1.3.2.3. Linux Web Resources

There are several Web sites that store searchable Linux documentation. One of the more popular and comprehensive documentation sites is The Linux Documentation Project: http://tldp.org.

The Linux Documentation Project is run by a group of volunteers who provide many valuable types of information about Linux including FAQs and HOW-TO guides.

有几个网站存储可搜索的 Linux 文档。更受欢迎和全面的文档网站是 The Linux Documentation Project : http://tldp.org。The Linux Documentation Project由一组志愿者运行, 他们提供了许多关于 Linux 的重要信息类型, 包括常见问题解答和操作指南。

There are also many excellent articles on a wide range of topics available on other Web sites as well. Two of the more popular sites for articles are:

在其他网站上也有许多优秀的文章可供选择。最受欢迎的网站有两个:

The first of these sites has useful Linux articles that can help you get a better understanding of the Linux environment and operating system. The second Web site is for learning more about the Linux kernel, not necessarily for fixing problems.

第一个网站有有用的 linux 文章, 可以帮助您更好地了解 linux 环境和操作系统。第二个网站是为了了解更多关于 Linux 内核的信息, 而不一定是为了解决问题。

1.3.2.4. Bugzilla Databases

Inspired and created by the Mozilla project, Bugzilla databases have become the most widely used bug tracking database systems for all kinds of GNU software projects such as the GNU Compiler Collection (GCC). Bugzilla is also used by some distribution companies to track bugs in the various releases of their GNU/Linux products.

在 Mozilla 项目的启发和创建下, Bugzilla 数据库已成为各种 gnu 软件项目 (如 gnu 编译器) 的最广泛使用的 bug 跟踪数据库系统。Bugzilla 也被一些发行公司用来跟踪其 GNU/Linux 产品的各种版本中的 bug。

Most Bugzilla databases are publicly available and can, at a minimum, be searched through an extensive Web-based query interface. For example, GCC’s Bugzilla can be found at http://gcc.gnu.org/bugzilla, and a search can be performed without even creating an account. This can be useful if you think you’ve encountered a real software bug and want to search to see if anyone else has found and reported the problem. If a match is found to your query, you can examine and even track all the progress made on the bug.

大多数 Bugzilla 数据库都是公开提供的, 至少可以通过一个广泛的基于 Web 的查询界面进行搜索。例如, 在 http://gcc.gnu.org/bugzilla 中可以找到 GCC 的 Bugzilla, 可以在不创建帐户的情况下执行搜索。如果您认为您遇到了一个真正的软件 bug, 并且想搜索其他人是否找到并报告了问题, 这可能很有用。如果查询找到了匹配项, 则可以检查甚至跟踪对 bug 所做的所有进度。

If you’re sure you’ve encountered a real software bug, and searching does not indicate that it is a known issue, do not hesitate to open a new bug report in the proper Bugzilla database. Open source software is community-based, and reporting bugs is a large part of what makes the open source movement work. Refer to investigation Phase 4 for more information on opening a bug reports.

如果您确定遇到了真正的软件 bug, 而搜索并没有表明它是已知的问题, 请不要犹豫,在BugZilla中填写一个新的bug报告。。开源软件是基于社区的, 而报告 bug 是导致开源工作向前进步的很大一部分。有关打开 bug 报告的详细信息, 请参阅调查阶段4。

1.3.2.5. Mailing Lists

Mailing lists are related closely to USENET newsgroups and in some cases are used to provide a more user friendly front- end to the lesser known and less understood USENET interfaces. The advantage of mailing lists is that interested parties explicitly subscribe to specific lists. When a posting is made to a mailing list, everyone subscribed to that list will receive an email. There are usually settings available to the subscriber to minimize the impact on their inboxes such as getting a daily or weekly digest of mailing list posts.

邮件列表与 "新闻组" 群组紧密相关, 在某些情况下, 用于为鲜为人知和不太了解的用户界面提供更友好的前端。邮件列表的优点是, 感兴趣的各方明确地订阅了特定的列表。当对邮件列表进行提交时, 所有订阅该列表的人都会收到一封电子邮件。订阅服务器通常有可用的设置, 以尽量减少对其收件箱的影响, 例如每日或每周的邮件列表摘要。

The most popular Linux related mailing list is the Linux Kernel Mailing List (lkml). This is where most of the Linux pioneers and gurus such as Linux Torvalds, Alan Cox, and Andrew Morton “hang out.” A quick Google search will tell you how you can subscribe to this list, but that would probably be a bad idea due to the high amount of traffic. To avoid the need to subscribe and deal with the high traffic, there are many Web sites that provide fancy interfaces and searchable archives of the lkml. The main one is http://lkml.org.

最受欢迎的 linux 相关邮件列表是 linux 内核邮件列表 (lkml)。这是大多数 linux 先驱和大师, 如 linux 托瓦尔兹, 艾伦考克斯, 和安德鲁莫顿 "闲逛"。Google 快速搜索将告诉您如何订阅此列表, 但由于大量的通信量, 这可能是个坏主意。为了避免订阅和处理高流量的需要, 有许多网站提供了 lkml 的花哨界面和可搜索的存档。主要的是 http://lkml.org。

There are also sites that provide summaries of discussions going on in the lkml. A popular one is at Linux Weekly News (lwn.net) at http://lwn.net/Kernel.

还有一些网站提供了 lkml 中正在进行的讨论的摘要。一个流行的一个是在 Linux 周刊新闻 (lwn.net) 在http://lwn.net/Kernel

As with USENET, you are free to post questions or messages to mailing lists, though some require you to become a subscriber first.

与新闻组一样, 您可以自由地将问题或邮件张贴到邮件列表中, 尽管有些邮件列表要求您先成为订阅者。

1.3.3. Phase #3: Begin Deeper Investigation (Good Problem Investigation Practices)

If you get to this phase, you’ve exhausted your attempt to find the information using the Internet. With any luck you’ve picked up some good pointers from the Internet that will help you get a jump start on a more thorough investigation.

如果你进入这个阶段, 你已经用尽了你的尝试, 用互联网去寻找信息。幸运的是, 你从互联网上找到了一些好的指导, 这将帮助你从一个更彻底的调查中获得一个跳跃的开始。

Because this is turning out to be a difficult problem, it is worth noting that difficult problems need to be treated in a special way. They can take days, weeks, or even months to resolve and tend to require much data and effort. Collecting and tracking certain information now may seem unimportant, but three weeks from now you may look back in despair wishing you had. You might get so deep into the investigation that you forget how you got there. Also if you need to transfer the problem to another person (be it a subject matter expert or a peer), they will need to know what you’ve done and where you left off.

因为这是一个困难的问题, 值得注意的是, 困难的问题需要以一种特殊的方式来对待。他们可能需要几天、几周甚至几个月的时间来解决, 并且往往需要大量的数据和努力。收集和跟踪某些信息现在似乎并不重要, 但三周后, 你可能会希望你有。你可能会深入调查, 忘记了你是怎么到那里的。另外, 如果你需要把问题移交给另一个人 (无论是主题专家还是同行), 他们都需要知道你做了什么, 你调查到了哪里。

It usually takes many years to become an expert at diagnosing complex problems. That expertise includes technical skills as well as best practices. The technical skills are what take a long time to learn and require experience and a lot of knowledge. The best practices, however, can be learned in just a few minutes. Here are six best practices that will help when diagnosing complex problems:

要成为诊断复杂问题的专家通常需要多年时间。这些专长包括技术技能和最佳做法。技术技能是需要很长时间学习和需要经验和大量知识的东西。然而, 在短短几分钟内就可以了解到最佳做法。以下是六最佳实践, 在诊断复杂问题时会有帮助:

  1. Collect relevant information when the problem occurs.
  2. 在出现问题时收集相关信息。
  3. Keep a log of what you’ve done and what you think the problem might be.
  4. 记录下你所做的和你认为问题可能是什么。
  5. Be detailed and avoid qualitative information.
  6. 详细和避免定性信息。
  7. Challenge assumptions until they are proven.
  8. 挑战假设, 直到他们被证明。
  9. Narrow the scope of the problem.
  10. 缩小问题的范围。
  11. Work to prove or disprove theories about the problem.
  12. 努力证明或反驳关于这个问题的理论。

The best practices listed here are particularly important for complex problems that take a long time to solve. The more complex a problem is, the more important these best practices become. Each of the best practices is covered in more detail as follows.

此处列出的最佳做法对于需要很长时间解决的复杂问题尤为重要。问题越复杂, 这些最佳实践就越重要。每一种最佳做法都详细介绍如下。

1.3.3.1. Best Practices for Complex Investigations

1.3.3.1.1. Collect the Relevant Information When the Problem Occurs

Earlier in this chapter we discussed how changes can cause certain types of problems. We also discussed how changes can remove evidence for why a problem occurred in the first place (for example, changes to the amount of free memory can hide the fact that it was once low). In the former situation, it is important to collect information because it can be compared to information that was collected at a previous time to see if any changes caused the problem. In the latter situation, it is important to collect information before the changes on the system wipe out any important evidence. The longer it takes to resolve a problem, the better the chance that something important will change during the investigation. In either situation, data collection is very important for complex problems.

本章前面我们讨论了改变如何导致某些类型的问题。我们还讨论了改变如何能够删除第一个问题发生的原因的证据 (例如, 对空闲内存数量的更改可以隐藏它曾经低的事实)。在前一种情况下, 收集信息很重要, 因为它可以与以前收集的信息进行比较, 以查看是否有任何更改导致了问题。在后一种情况下, 重要的是在系统变更前收集信息, 清除任何重要证据。解决问题所需的时间越长, 在调查过程中发生重大变化的可能性就越大。在这两种情况下, 数据收集对于复杂的问题非常重要。

Even reproducible problems can be affected by a changing system. A problem that occurs one day can stop occurring the next day because of an unknown change to the system. If you’re lucky, the problem will never occur again, but that’s not always the case.

即使是可重现的问题也会受到不断变化的系统的影响。发生某一天的问题可能会由于系统的未知改变而在第二天停止发生。如果你很幸运, 问题就不会再发生了, 但情况并非总是如此。

Consider a problem that occurred many years ago where application trap occurred in one xterm (a type of terminal window) window but not in another. Both xterm windows were on the same system and were identical in every way (well, so it seemed at first) but still the problem occurred only in one. Even the list of environment variables was the same except for the expected differences such as PWD (present working directory). After logging out and back in, the problem could not be reproduced. A few days later the problem came back again, only in one xterm. After a very complex investigation, it turned out that an environment variable PWD was the difference that caused the problem to occur. This isn’t as simple as it sounds. The contents of the PWD environment variable was not the cause of the problem, although the difference in size of PWD variables between the two xterms forced the stack (a special memory segment) to slightly move up or down in the address space. Sure enough, changing PWD to another value made the problem disappear or recur depending on the length. This small difference caused the different behavior for the application in the two xterms. In one xterm, a memory corruption in the application landed without issue on an inert part of the stack, causing no side-effect. In the other xterm, the memory corruption landed on a pointer on the stack (the long description of the problem is beyond the scope of this chapter). The pointer was dereferenced by the application, and the trap occurred. This is a very rare problem but is a good example of how small and seemingly unrelated changes or differences can affect a problem.

考虑一个在许多年前发生的问题, 应用程序bug发生在一个 xterm (一种终端窗口) 窗口中, 而不是另一个。两个 xterm 窗口都在相同的系统上, 在各个方面都是相同的 (在一开始是相同的), 但仍然问题只发生在一个。即使环境变量的列表是相同的, 除了预期的差异, 如PWD(当前工作目录)。在注销和返回后, 问题无法重现。几天后, 问题又回来了, 只有一个 xterm。经过非常复杂的调查, 结果发现环境变量PWD是导致问题发生的差异。这并不像听起来那么简单。PWD环境变量的内容不是问题的原因, 尽管两个 xterms 之间的PWD变量大小的差异迫使堆栈 (特殊内存段) 在地址空间中上下移动。果然, 将PWD更改为另一个值会使问题消失或重复, 具体取决于长度。这一小差异导致了两个 xterms 中应用程序的不同行为。在一个 xterm 中, 应用程序中的内存损坏在堆栈的安全部分无问题地着陆, 导致没有副作用。在其他 xterm 中, 内存损坏落在堆栈上的指针上 (问题的具体描述超出本章的范围)。该指针被应用程序取消, 并发生bug。这是一个非常罕见的问题, 但这是一个很好的例子, 说明小的和看似无关的变化或差异会影响一个问题。。

If the problem is serious and difficult to reproduce, collect and/or write down the information from 1.3.1: Initial Investigation Using Your Own Skills.

如果问题是严重的, 很难重现, 收集和/或写下的信息从 1.3.1: 使用你自己的技能进行初步调查。

For quick reference, here is the list:

为快速参考, 以下是列表:

  • The exact time the problem occurred
  • 问题发生的确切时间
  • Dynamic operating system information
  • 动态操作系统信息
  • What you were doing when the problem occurred
  • 问题发生时您在做什么
  • A problem description
  • 问题描述
  • Anything that may have triggered the problem
  • 可能引发问题的任何事情
  • Any evidence that may be relevant
  • 任何可能相关的证据

The more serious and complex the problem is, the more you’ll want to start writing things down. With a complex problem, other people may need to get involved, and the investigation may get complex enough that you’ll start to forget some of the information and theories you’re using. The data collector included with this book can make your life easier whenever you need to collect information about the OS.

问题越严重, 越复杂, 你就越想开始写东西。有了一个复杂的问题, 其他人可能需要参与进来, 调查可能会变得很复杂, 以至于你会开始忘记你正在使用的一些信息和理论。当您需要收集有关 OS 的信息时, 本书中包含的数据脚本可以使您的生活更轻松。

1.3.3.1.2. Use an Investigation Log

Even if you only ever have one complex, critical problem to work on at a time, it is still important to keep track of what you’ve done. This doesn’t mean well written, grammatically correct explanations of everything you’ve done, but it does mean enough detail to be useful to you at a later date. Assuming that you’re like most people, you won’t have the luxury of working on a single problem at a time, which makes this even more important. When you’re investigating 10 problems at once, it sometimes gets difficult to keep track of what has been done for each of them. You also stand a good chance of hitting a similar problem again in the future and may want to use some of the information from the first investigation.

即使您每次都有一个复杂的、关键的问题要处理, 跟踪您所做的工作仍然非常重要。这并不意味着很好, 语法正确的解释你所做的一切, 但它确实意味着足够的细节在以后是有用的。假设你和大多数人一样, 你就需要同时处理多个问题。做记录就更重要了。当你一次调查10个问题时, 有时很难记清楚每个问题都做了些什么。你也有可能再次碰到类似的问题, 希望从第一次调查的信息中获得一些有用的资料。

Further, if you ever need to get someone else involved in the investigation, an investigation log can prevent a great deal of unnecessary work. You don’t want others unknowingly spending precious time re-doing your hard earned steps and finding the same results. An investigation log can also point others to what you have done so that they can make sure your conclusions are correct up to a certain point in the investigation.

此外, 如果您需要让其他人参与调查, 调查日志可以防止大量不必要的工作。你不希望别人在花费宝贵的时间重复你的调查工作。调查日志还可以让其他人了解您所做的工作, 以便他们能够确保您的结论是正确的。

An investigation log is a history of what has been done so far for the investigation of a problem. It should include theories about what the problem could be or what avenues of investigation might help to narrow down the problem. As much as possible, it should contain real evidence that helps lead you to the current point of investigation. Be very careful about making assumptions, and be very careful about qualitative proofs (proofs that contain no concrete evidence).

调查日志是迄今为止为调查问题所做的工作记录。它应该包括关于问题可能是什么, 或者哪种调查的方式有助于缩小问题的范围。它应该尽可能地包含真正的证据, 这些证据有助于引导你到目前的调查点。你要非常小心的假设, 并非常小心定性的证明 (没有确切证据的证明)。

The following example shows a very structured and well laid out investigation log. With some experience, you’ll find the format that works best for you. As you read through it, it should be obvious how useful an investigation log is. If you had to take over this problem investigation right now, it should be clear what has been done and where the investigator left off.

下面的示例演示了一个非常结构化且布局良好的调查日志。在做过一些调查日志后,你会找到适合你的格式。当您通读它时, 你会发现调查日志是多么有用。如果你现在要接手调查这个问题, 你应该清楚已经做了什么, 哪些调查人员离开了。

Code View: Scroll / Show All

Time of occurrence: Sun Sep 5 21:23:58 EDT 2004

Problem description: Product Y failed to start when run from a cron job.

Symptom:

 

ProdY: Could not create communication semaphore: 1176688244 (EEXIST)

 

What might have caused the problem: The error message seems to indicate

that the semaphore already existed and could not be recreated.

 

 

Theory #1: Product Y may have crashed abruptly, leaving one or more IPC

resources. On restart, the product may have tried to recreate a semaphore

that it already created from a previous run.

 

Needed to prove/disprove:

 The ownership of the semaphore resource at the time of

the error is the same as the user that ran product Y.

 That there was a previous crash for product Y that

would have left the IPC resources allocated.

 

Proof: Unfortunately, there was no information collected at the time of

the error, so we will never truly know the owner of the semaphore at the

time of the error. There is no sign of a trap, and product Y always

leaves a debug file when it traps. This is an unlikely theory that is

good given we don't have the information required to make progress on

it.

 

Theory #2: Product X may have been running at the time, and there may

have been an IPC (Inter Process Communication) key collision with

product Y.

 

Needed to prove/disprove:

 Check whether product X and product Y can use the same

IPC key.

 Confirm that both product X and product Y were actually

running at the time.

 

Proof: Started product X and then tried to start product Y. Ran "strace"

on product X and got  the following semget:

 

ion 618% strace -o productX.strace prodX

ion 619% egrep "sem|shm" productX.strace

semget(1176688244, 1, 0)        = 399278084

 

Ran "strace" on product Y and got the following semget:

 

ion 730% strace -o productY.strace prodY

ion 731% egrep "sem|shm" productY.strace

semget(1176688244, 1, IPC_CREAT|IPC_EXCL|0x1f7|0666) = EEXIST

 

The IPC keys are identical, and product Y tries to create the semaphore

but fails. The error message from product Y is identical to the original

error message in the problem description here.

 

Notes: productX.strace and productY.strace are under the data directory.

 

Assumption: I still  don't  know whether  product X  was running  at the

time when product Y failed to start, but given these results, it is very

likely. IPC collisions are rare, and we know that product X and product

Y cannot run at the same  time the way they are currently  configured.

Note: A semaphore is a special type of inter-process communication mechanism that provides a synchronization mechanism between processes (and/or threads). The type of semaphore used here requires a unique “key” so that multiple processes can use the same semaphore. A semaphore can exist without any processes using it, and some applications expect and rely on creating a semaphore before they can run properly. The semget () in the strace that follows is a system call (a special type of OS function) that, as the name suggests, gets a semaphore.

注意: 信号量是一种特殊类型的进程间通信机制, 它提供进程 (和/或线程) 之间的同步机制。此处使用的信号量类型需要唯一的 "key", 以便多个进程可以使用相同的信号量。信号量可以在没有任何使用它的进程的情况下存在, 有些应用程序依赖于信号量才能正常运行。下面的 strace 中的 semget () 是一个系统调用 (一种特殊类型的 OS 函数), 顾名思义, 它获取信号量。

Notice how detailed the proofs are. Even the commands used to capture the original strace output are included to eliminate any human error. When entering a proof, be sure to ask yourself, “Would someone else need any more proof than this?” This level of detail is often required for complex problems so that others will see the proof and agree with it.

注意证据的详细程度。即使用于捕获原始 strace 输出的命令也包括在内, 以消除任何人为错误。当你进入一个证明, 一定要问自己, "其他人还需要更多的证据吗?”复杂的问题往往需要这一层次的细节, 以便使其他人看到的证据, 并同意它。

The amount of detail in your investigation log should depend on how critical the problem is and how close you are to solving it. If you’re completely lost on a very critical problem, you should include more detail than if you are almost done with the investigation. The high level of detail is very useful for complex problems given that every piece of data could be invaluable later on in the investigation.

调查日志中的详细信息应取决于问题的重要程度, 以及距离您解决这个问题的远近。如果你完全迷失在一个非常关键的问题上, 你应该包括各种尽可能的细节。对于复杂的问题, 高度的细节是非常有用的, 因为在以后的调查中, 每一条数据都可能是无价的。

If you don’t have a good problem tracking system, here is a possible directory structure that can help keep things organized:

如果您没有一个好的问题跟踪系统, 下面的目录结构, 可以帮助你跟踪问题:

<problem identifier>/ inv.txt

                    / data /

                    / src /

 

The problem identifier is for tracking purposes. Use whatever is appropriate for you (even if it is 1, 2, 3, 4, and so on). The inv.txt is the investigation log, containing the various theories and proofs. The data directory is for any data files that have been collected. Having one data directory helps keep things organized and it also makes it easy to refer to data files from your investigation log. The src directory is for any source code or scripts that you write to help investigate the problem.

The problem directory is what you would show someone when referring to the problem you are investigating. The investigation log would contain the flow of the investigation with the detailed proofs and should be enough to get someone up to speed quickly.

问题标识符用于跟踪目的。你可以使用任何适合您的标识 (即使它是1、2、3、4等等)。inv.txt 是调查记录, 包含各种理论和证据。数据目录用于已收集的任何数据文件。拥有一个数据目录有助于使事情组织起来, 而且它也便于从调查日志中引用数据文件。src 目录用于为帮助调查问题而编写的任何源代码或脚本。问题目录是指当你提到你正在调查的问题时, 你会向别人展示的记录。调查日志将包含调查的流程以及详细的证明, 并且应该是足够使某人快速地了解问题。

You may also want to save the problem directory for the future or better yet, put the investigation directories somewhere where others can search through them as well. After all, you worked hard for the information in your investigation log; don’t be too quick to delete it. You never know when you’ll hit a similar (or the same) problem again. The investigation log can also be used to help educate more junior people about investigation techniques.

您可能还保存问题目录以便将来使用, 或者将调查目录放在其他人可以搜索到的地方。毕竟, 你曾经为调查日志中的信息努力工作;不希望这么快就删除它。你永远不知道什么时候你会遇到类似的 (或同样的) 问题。调查日志也可以用来帮助更多的初级人员的学习调查技术。

1.3.3.1.3. Be Detailed (Avoid Qualitative Information)

Be very detailed in your investigation log or any time when discussing the problem. If you prove a theory using an error record from an error log file, include the error record and the name of the error log file as proof in the investigation log. Avoid qualitative proofs such as, “Found an error log that showed that the suspect product was running at the time.” If you transfer a problem to another person, that person will want to see the actual error record to ensure that your assumption was correct. Also if the problem lasts long enough, you may actually start to second-guess yourself as well (which is actually a good thing) and may appreciate that quantitative proof (a proof with real data to back it up).

在您的调查日志或讨论问题的时候都应该非常详细。如果您使用错误日志文件中的错误记录来证明理论, 请在调查日志中写下错误记录和错误日志文件的名称作为证据。避免含糊不清证据, 例如 "发现错误日志, 显示可疑产品正在运行"。如果您将问题转移给其他人, 该人将希望看到实际的错误记录, 以确保您的假设是正确的。此外, 如果问题持续的时间足够长, 你可能实际上开始怀疑自己, 以及 (实际上是一个好东西), 并可能会欣赏的定量证明 (有实际数据来支持它的证明)。

Another example of a qualitative proof is a relative term or description. Descriptions like “the file was very large” and “the CPU workload was high” will mean different things to different people. You need to include details for how large the file was (using the output of the ls command if possible) and how high the CPU workload was (using uptime or top). This will remove any uncertainty that others (or you) have about your theories and proofs for the investigation.

另一个定性证明的例子是相对术语或描述。像 "文件很大" 和 "CPU 工作负载高" 这样的描述对不同的人来说将意味着不同的事情。您需要包含文件大小的详细信息 (如果可能, 使用 ls 命令的输出) 以及 CPU 工作负载的高 (使用uptime或top)。这将消除其他人 (或你) 对你的调查的理论和证明的不确定性。

Similarly, when you are asked to review an investigation, be leery of any proof or absolute statement (for example, “I saw the amount of virtual memory drop to dangerous levels last night”) without the required evidence (that is, a log record, output from a specific OS command, and so on). If you don’t have the actual evidence, you’ll never know whether a statement is true. This doesn’t mean that you have to distrust everyone you work with to solve a problem but rather a realization that people make mistakes. A quick cut and paste of an error log file or the output from an actual command might be all the evidence you need to agree with a statement. Or you might find that the statement is based on an incorrect assumption.

同样, 当您被要求复查调查时, 请注意任何证据或绝对声明 (例如, "昨晚我看到虚拟内存的数量降到了危险级别"), 没有确切的证据 (即日志记录, 来自特定 OS 的命令的输出等)。如果你没有确切的证据, 你永远不会知道一个陈述是否属实。这并不意味着你必须不信任你工作伙伴能解决一个问题, 而是认识到人们会犯错误。快速剪切和粘贴错误日志文件或实际命令的输出可能是您所需要的所有证据。或者, 您可能会发现该基于错误的假设的结论。

1.3.3.1.4. Challenge Assumptions

There is nothing like spending a week diagnosing a problem based on an assumption that was incorrect. Consider an example where a problem has been identified and a fix has been provided ... yet the problem happens again. There are two main possibilities here. The first is that the fix didn’t address the problem. The second is that the fix is good, but you didn’t actually get it onto the system (for the statistically inclined reader: yes there is a chance that the fix is bad and it didn’t get on the system, but the chances are very slim). For critical problems, people have a tendency to jump to conclusions out of desperation to solve a problem quickly. If the group you’re working with starts complaining about the bad fix, you should encourage them to challenge both possibilities. Challenge the assumption that the fix actually got onto the system. (Was it even built into the executable or library that was supposed to contain the fix?)

没有什么比花一个星期的时间来证明一个错误的假设更让人恼火了。例如,问题已经被确定并提供了修复程序...... 但问题又发生了。这里有两种可能性。第一个可能是修复程序没有解决问题。第二个可能是, 修复是好的, 但你实际上并没有把它部署到系统上 (对于苛刻的读者: 是的, 还有一个机会, 修复程序是坏的, 它没有部署到系统, 但这种可能性是非常渺茫)。对于严重的问题, 人们倾向于迅速解决问题。如果你工作的小组开始抱怨坏的修复程序, 你应该鼓励他们挑战这两种可能性。查看修复程序是否真正部署到系统上。(它是否已经合到出现问题的可执行文件或库中?

1.3.3.1.5. Narrow Down the Scope of the Problem

Solution (that is, a complete IT solution) -level problem determination is difficult enough, but to make matters worse, each application or product in a solution usually requires a different set of skills and knowledge. Even following the trail of evidence can require deep skills for each application, which might mean getting a few experts involved. This is why it is so important to try and narrow down the scope of the problem for a solution level problem as quickly as possible.

Today’s complex heterogeneous solutions can make simple problems very difficult to diagnose. Computer systems and the software that runs on them are integrated through networks and other mechanism(s) to work together to provide a solution. A simple problem, even one that has a clear error message, can become difficult given that the effect of the problem can ripple throughout a solution, causing seemingly unrelated symptoms. Consider the example in Figure 1.1.

解决方案 (即一个完整的 IT 解决方案) 级别的问题定位是非常困难的, 但更糟糕的是, 解决方案中的不同的应用程序或产品通常需要一套不同的技能和知识。即使循着证据的踪迹来调查, 每个应用程序也都需要很高的技能, 这可能意味着要让一些专家参与进来。这就是为什么要尽可能快地将问题的范围缩小的原因。现在复杂的异构解决方案可以使简单的问题很难诊断。计算机系统和运行在它们上的软件通过网络和其他机制集成在一起, 以提供解决方案。由于问题的影响可能会波及整个解决方案, 导致看似不相关的症状, 一个简单的问题, 即使有明确的错误信息, 也可能变得很难判断。请参考图1.1 中的示例。

Figure 1.1. Ripple effect of an error in a solution.

Application A in a solution could return an error code because it failed to allocate memory (effect #1). On its own, this problem could be easy to diagnose. However, this in turn could cause application B to react and return an error of its own (effect #2). Application D may see this as an indication that application B is unavailable and may redirect its requests to a redundant application C (effect #3). Application E, which relies on application D and serves the end user, may experience a slowdown in performance (effect #4) since application D is no longer using the two redundant servers B and C. This in turn can cause an end user to experience the performance degradation (effect #5) and to phone up technical support (effect #6) because the performance is slower than usual. If this seems overly complex, it is actually an oversimplification of real IT solutions where hundreds or even thousands of systems can be connected together. The challenge for the investigator is to follow the trail of evidence back to the original error.

解决方案中的应用程序 A 可能会返回错误代码, 因为它无法分配内存 (效果 #1)。就其本身而言, 这个问题很容易诊断。但是, 这反过来会导致应用程序 B 响应并返回自己的错误 (效果 #2)。应用程序 D 可能将此视为表示应用程序 B 不可用, 并可能将其请求重定向到冗余应用程序 C (效果 #3)。应用程序 E 依赖于应用程序 D 并服务于最终用户, 因此可能会遇到性能下降 (效果 #4), 因为应用程序 D 不再使用两个冗余服务器 B 和 C。这反过来会导致最终用户体验性能下降 (效果 #5)。最终客户因为性能差 而要求技术支持 (效果 #6) 提供电话服务。这似乎有点过于复杂, 但实际上已经对真正的 IT 解决方案的做了很大的简化。 真正的IT解决方案将成百上千个应用系统连接在一起,调查人员面临的挑战是跟踪证据的踪迹,找到最初的错误。

It is particularly important to challenge assumptions when working on a solution-level problem. You need to find out whether each symptom is related to a local system or whether the symptom is related to a change or error condition in another part of a solution.

在处理解决方案级别的问题时, 特别重要的是要挑战假设。您需要了解每个症状是否与本地系统相关, 或者该症状是否与解决方案的另一部分中的更改或错误条件相关。

There are some complex problems that cannot be broken down in scope. These problems require true skill and perseverance to diagnose. Usually this type of problem is a race condition that is very difficult to reproduce. A race condition is a type of problem that depends on timing and the order in which things occur. A good example is a “late read.” A late read is a software defect where memory is freed, but at some point in the very near future, it is used again by a different part of the application. As long as the memory hasn’t been reused, the late read may be okay. However, if the memory block has been reused (and written to), the late read will access the new contents of the memory block, causing unpredictable behavior. Most race conditions can be narrowed in scope in one way or another, but some are so timing-dependent that any changes to the environment (for the purposes of investigation) will cause the problem to not occur.

有些复杂的问题是不能分解的。这些问题需要真正的技能和毅力来诊断。通常这种类型的问题是一个很难重现的race条件。race条件是一种类型的问题, 取决于时间和事情发生的顺序。一个很好的例子是 "late read"。Late read是一种软件缺陷, 内存被释放, 但在不久的将来, 它又被应用程序的另外一部分使用。只要内存没有被重用, late read可能没有问题。但是, 如果内存块已被重用 (并写入), 则late read将访问内存块的新内容, 从而导致不可预知的行为。大多数race条件可以以一种或另一种方式缩小范围, 但有些问题是依赖于时间的。对环境的任何更改 (为了调查目的), 不会导致问题发生。

Lastly, everyone working on an IT solution should be aware of the basic architecture of the solution. This will help the team narrow the scope of any problems that occur. Knowing the basic architecture will help people to theorize where a problem may be coming from and eventually identify the source.

最后, 在 IT 解决方案上工作的每个人都应该了解解决方案的基本体系结构。这将帮助团队缩小问题的范围。了解基本的体系结构将帮助人们推断出问题的根源, 并最终识别出源头。

1.3.3.2. Create a Reproducible Test Case

Assuming you know how the problem occurs (note that the word here is how, not why), it will help others if you can create a test case and/or environment that can reproduce the problem at will. A test case is a term used to refer to a tool or a small set of commands that, when run, can cause a problem to occur.

假设您知道问题是如何发生的 (请注意, 这里的单词是如何, 而不是为什么), 如果您可以创建一个可以随时重现问题的测试用例和/或环境, 它将帮助其他人。测试用例是一个工具或一组命令, 在运行时会导致出现问题。

A successful test case can greatly reduce the time to resolution for a problem. If you’re investigating a problem on your own, you can run and rerun the test case to cause the problem to occur many times in a row, learning from the symptoms and using different investigation techniques to better understand the problem.

成功的测试用例可以大大缩短解决问题的时间。如果您自己正在调查问题, 则可以重复运行测试用例, 以使问题连续发生多次, 从症状中吸取教训, 使用不同的调查技术来更好地理解问题。

If you need to ask an expert for help, you will also get much more help if you include a reproducible test case. In many cases, an expert will know how to investigate a problem but not how to reproduce it. Having a reproducible test case is especially important if you are asking a stranger for help over the Internet. In this case, the person helping you will probably be doing so on his or her own time and will be more willing to help out if you make it as easy as you can.

如果你需要技术专家的帮助, 如果你有一个可重现的测试用例, 你会得到更多的帮助。在许多情况下, 专家将知道如何调查一个问题, 但不知道如何重现它。如果你通过互联网向陌生人求助, 那么拥有一个可重现问题的测试用例就显得尤为重要。在这种情况下, 如果你有测试用例,那么帮助你的人很可能在他或她自己的时间里,会更愿意帮助你。

1.3.3.3. Work to Prove and/or Disprove Theories

This is part of any good problem investigation. The investigator will do his best to think of possible avenues of investigation and to prove or disprove them. The real art here is to identify theories that are easy to prove or disprove or that will dramatically narrow the scope of a problem.

这是好的问题调查的一部分。调查人员将尽力考虑可能的调查渠道, 并证明或反驳他们。在这里, 真正的艺术是找出容易证明或反驳的理论, 或者将问题的范围缩小。

Even nonsolution level problems (such as an application that fails when run from the command line) can be easier to diagnose if they are narrowed in scope with the right theory. Consider an application that is failing to start with an obscure error message. One theory could be that the application is unable to allocate memory. This theory is much smaller in scope and easier to investigate because it does not require intimate knowledge about the application. Because the theory is not application-specific, there are more people who understand how to investigate it. If you need to get an expert involved, you only need someone who understands how to investigate whether an application is unable to allocate memory. That expert may know nothing about the application itself (and might not need to).

即使是 nonsolution 级别的问题 (如从命令行运行时失败的应用程序), 也可以更容易地诊断它们是否可以缩小范围。请考虑一个无法正常启动的应用程序,它打出了含糊不清的信息。一种理论可能是应用程序无法分配内存。这一理论在范围上要小得多, 更容易调查, 因为它不需要对应用程序有深入的了解。因为理论不是特定于应用程序的, 所以有更多的人知道如何调查它。如果需要相关专家参与进来, 您只需要了解如何调查应用程序是否无法分配内存的专家。该专家可能对应用程序本身一无所知 (可能不需要)。

1.3.3.4. The Source Code

If you are familiar with reading C source code, looking at the source is always a great way of determining why something isn’t working the way it should. Details of how and when to do this are discussed in several chapters of this book, along with how to make use of the cscope utility to quickly pinpoint specific source code areas.

如果你熟悉 C 源代码, 那么查看源码是一个很好的方法来确定为什么某件事情没有按它应该的方式工作。本书后面的几个章节将详细讨论如何和何时执行此操作, 以及如何使用 cscope 快速定位特定的源代码。

Also included in the source code is the Documentation directory that contains a great deal of detailed documentation on various aspects of the Linux kernel in text files. For specific kernel related questions, performing a search command such as the following can quickly yield some help:

源代码中还包含了文档目录, 其中包含大量关于 Linux 内核的详细文档。对于特定的内核相关问题, 执行诸如下面这样的搜索命令可以快速产生一些帮助:

find /usr/src/linux/Documentation -type f |

xargs grep -H <search_pattern> | less

 

where <search_pattern> is the desired search criteria as documented in grep(1).

其中<search_pattern>是在 grep (1) 中记录的所需搜索条件.

1.3.4. Phase #4: Getting Help or New Ideas

Everyone gets stuck, and once you’ve looked at a problem for too long, it can be hard to view it from a different perspective. Regardless of whether you’re asking a peer or an expert for ideas/help, they will certainly appreciate any homework you’ve done up to this point.

每个人都被卡住了, 一旦你看了一个问题太久, 就很难从不同的角度去分析它。不管你是在向一个同事还是专家的求助, 他们肯定会感谢你做的准备工作。

1.3.4.1. Profile of a Linux Guru

A great deal of the key people working on Linux do so as a “side job” (which often receives more time and devotion than their regular full-time jobs). Many of these people were the original “Linux hackers” and are often considered the “Linux gurus” of today. It’s important to understand that these Linux gurus spend a great deal of their own spare time working (sometimes affectionately called “hacking”) on the Linux kernel. If they decide to help you, they will probably be doing so on their own time. That said, Linux gurus are a special breed of people who have great passion for the concept of open source, free software, and the operating system itself. They take the development and correct operation of the code very seriously and have great pride in it. Often they are willing to help if you ask the right questions and show some respect.

许多在 Linux 上工作的关键人物都是 "兼职" (通常比他们的专职工作投入更多的时间和精力)。其中许多人最初是 "linux 黑客", 并且经常被认为是当今的 "linux 大师"。重要的是要了解, 这些 linux 专家花费大量的业余时间做 (有时亲切地被称为 "黑客") linux 内核的工作。如果他们决定帮助你, 他们会利用他们自己的时间。也就是说, Linux 大师是一群特殊的人, 他们对开源、自由软件和操作系统有着极大的热情。他们非常认真地对待代码的开发, 并且有极大的自豪感。如果你提出正确的问题并表现出尊重, 他们往往愿意帮忙你。

1.3.4.2. Effectively Asking for Help

1.3.4.2.1. Netiquitte

Netiquette is a commonly used term that refers to Internet etiquette. Netiquette is all about being polite and showing respect to others on the Internet. One of the best and most succinct documents on netiquette is RFC1855 (RFC stands for “Request for Comments”). It can be found at http://www.faqs.org/rfcs/rfc1855.html. Here are a few key points from this document:

礼仪是指互联网礼仪的常用术语。礼仪是在互联网上表现出对他人的礼貌和尊重。描述礼仪最好和最简洁的文档是 RFC1855 (RFC代表 "请求评论")。可以在 http://www.faqs.org/rfcs/rfc1855.html 找到。以下是本文档的几个要点:

  • Read both mailing lists and newsgroups for one to two months before you post anything. This helps you to get an understanding of the culture of the group.
  • 在张贴任何内容之前, 请阅读邮件列表和新闻组一两个月。这有助于您了解该组的文化。
  • Consider that a large audience will see your posts. That may include your present or next boss. Take care in what you write. Remember too, that mailing lists and newsgroups are frequently archived and that your words may be stored for a very long time in a place to which many people have access.
  • 考虑到有大量的观众会看到你的帖子。可能包括你现在或下一位老板。注意你写的东西。还要记住, 邮件列表和新闻组经常存档, 并且您的文字可能会在很长一段时间内存储在许多人可以访问的地方。
  • Messages and articles should be brief and to the point. Don’t wander off-topic, don’t ramble, and don’t send mail or post messages solely to point out other people’s errors in typing or spelling. These, more than any other behavior, mark you as an immature beginner.
  • 邮件和文章应该简明扼要。不要乱逛话题, 不要闲聊, 不要仅仅为了指出别人在打字或拼写上的错误而发送邮件或张贴邮件。这些行为, 比任何其他行为, 标志着你是一个不成熟的初学者。

Note that the first point tells you to read newsgroups and mailing lists for one to two months before you post anything. What if you have a problem now? Well, if you are responsible for supporting a critical system or a large group of users, don’t wait until you need to post a message, starting getting familiar with the key mailing lists or newsgroups now.

请注意, 第一点告诉您在发布任何内容之前, 先阅读新闻组和邮件列表1-2个月。如果你现在有问题怎么办?嗯, 如果你负责维护一个关键的系统或一大群用户, 现在就开始熟悉的相关的邮件列表或新闻组,不要等到你需要的时候,才临时抱佛脚。

Besides making people feel more comfortable about how you communicate over the Internet, why should you care so much about netiquette? Well, if you don’t follow the rules of netiquette, people won’t want to answer your requests for help. In other words, if you don’t respect those you are asking for help, they aren’t likely to help you. As mentioned before, many of the people who could help you would be doing so on their own time. Their motivation to help you is governed partially by whether you are someone they want to help. Your message or post is the only way they have to judge who you are.

除了让人们对你在互联网上的交流感到更加舒适之外, 你为什么还要关心礼仪?好吧, 如果你不遵守礼仪的规定, 人们就不会愿意回答你的求助请求。换言之, 如果你不尊重那些你寻求帮助的人, 他们不可能帮助你。如前所述, 许多可以帮助你的人会在自己的时间里这样做。他们帮助你的动机部分取决于你是否是他们想要帮助的人。你的信息或帖子是他们判断你是是不是那种人的唯一方法。

There are many other Web sites that document common netiquette, and it is worthwhile to read some of these, especially when interacting with USENET and mailing lists. A quick search in Google will reveal many sites dedicated to netiquette. Read up!

还有许多其他网站记录了常见的礼仪, 值得一读, 尤其是在与新闻组和邮件列表进行交互时。谷歌的快速搜索将显示许多专门用于礼仪的网站。快点读一下!

1.3.4.2.2. Composing an Effective Message

In this section we discuss how to create an effective message whether for email or for USENET. An effective message, as you can imagine, is about clarity and respect. This does not mean that you must be completely submissive — assertiveness is also important, but it is crucial to respect others and understand where they are coming from. For example, you will not get a very positive response if you post a message such as the following to a mailing list:

在本节中, 我们将讨论如何创建有效的邮件, 无论是电子邮件还是新闻组。一个有效的信息, 你可以想象, 是关于清晰和尊重。这并不意味着你必须完全顺从--自信也是重要的, 但是尊重他们和理解他们是至关重要的。例如, 如果将邮件 (如以下内容) 张贴到邮件列表中, 则不会得到非常积极的响应:

To: linux-kernel-mailing-list

From: Joe Blow

Subject: HELP NEEDED NOW: LINUX SYSTEM DOWN!!!!!!

Message:

 

MY LINUX SYSTEM IS DOWN!!!! I NEED SOMEONE TO FIX IT NOW!!!! WHY DOES

LINUX ALWAYS CRASH ON ME???!!!!

 

Joe Blow

Linux System Administrator

 

First of all, CAPS are considered an indication of yelling in current netiquette. Many people reading this will instantly take offense without even reading the complete message.

首先, 大写被认为是在当前礼仪大喊大叫的征兆。许多阅读这一信息的人, 即使没有阅读完整的讯息, 也会立即感到不舒服。

Second, it’s important to understand that many people in the open source community have their own deadlines and stress (like everyone else). So when asking for help, indicating the severity of a problem is OK, but do not overdo it.

其次, 了解开源社区中的许多人有自己的工作期限和压力 (和其他人一样) 是很重要的。因此, 在寻求帮助时, 指出问题的严重性是可以的, 但不要过分。

Third, bashing the product that you’re asking help with is a very bad idea. The people who may be able to help you may take offense to such a comment. Sure, you might be stressed, but keep it to yourself.

第三, 批评你要求帮助的产品是一个非常糟糕的想法。那些能帮助你的人可能会对这样的评论感到不爽。当然, 你可能会感到压力, 但要保持克制。

Last, this request for help has no content to it at all. There is no indication of what the problem is, not even what kernel level is being used. The subject line is also horribly vague. Even respectful messages that do not contain any content are a complete waste of bandwidth. They will always require two more messages (emails or posts), one from someone asking for more detail (assuming that someone cares enough to ask) and one from you to include more detail.

最后, 此帮助请求根本没有内容。没有说明问题是什么, 甚至没有说明内核版本。subject也含糊不清。即使是不包含任何内容的问候邮件也完全浪费了带宽。他们希望两个邮件 (电子邮件或帖子)就能说明问题, 一个邮件是从其他人获得更多的细节 (假设有人关注这个要求) 在另一个邮件中,需要提供所有的细节。

Ok, we’ve seen an example of how not to compose a message. Let’s reword that bad message into something that is far more appropriate:

好了, 我们已经看到了一个不好的例子。让我们把这个不好的例子修改一下:

Code View: Scroll / Show All

To: linux-kernel-mailing-list

From: Joe Blow

Subject: Oops in zisofs_cleanup on 2.4.21

Message:

 

Hello All,

My Linux server has experienced the Oops shown below three times in

the last week while running my database management system. I have

tried to reproduce it, but it does not seem to be triggered by

anything easily executed. Has anyone seen anything like this before?

 

Unable to handle kernel paging request at virtual address

ffffffff7f1bb800

 printing rip:

ffffffff7f1bb800

PML4 103027 PGD 0

Oops: 0010

CPU 0

Pid: 7250, comm: foo Not tainted

RIP: 0010:[zisofs_cleanup+2132522656/-2146435424]

RIP: 0010:[<ffffffff7f1bb800>]

RSP: 0018:0000010059795f10 EFLAGS: 00010206

RAX: 0000000000000000 RBX: 0000010059794000 RCX: 0000000000000000

RDX: ffffffffffffffea RSI: 0000000000000018 RDI: 0000007fbfff8fa8

RBP: 00000000037e00de R08: 0000000000000000 R09: 0000000000000000

R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000009

R13: 0000000000000018 R14: 0000000000000018 R15: 0000000000000000

FS: 0000002a957819e0(0000) GS:ffffffff804beac0(0000)

knlGS:0000000000000000

CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b

CR2: ffffffff7f1bb800 CR3: 0000000000101000 CR4: 00000000000006e0

Process foo (pid: 7250, stackpage=10059795000)

Stack: 0000010059795f10 0000000000000018 ffffffff801bc576 0000010059794000

  0000000293716a88 0000007fbfff8da0 0000002a9cf94ff8 0000000000000003

  0000000000000000 0000000000000000 0000007fbfff9d64 0000007fbfff8ed0

Call Trace: [sys_msgsnd+134/976]{sys_msgsnd+134} [system_call+119/

124]{system_call+119}

Call Trace: [<ffffffff801bc576>]{sys_msgsnd+134}

[<ffffffff801100b3>]{system_call+119}

 

Thanks in advance,

Joe Blow

 

The first thing to notice is that the subject is clear, concise, and to the point. The next thing to notice is that the message is polite, but not overly mushy. All necessary information is included such as what was running when the oops occurred, an attempt at reproducing was made, and the message includes the Oops Report itself. This is a good example because it’s one where further analysis is difficult. This is why the main question in the message was if anyone has ever seen anything like it. This question will encourage the reader at the very least to scan the Oops Report. If the reader has seen something similar, there is a good chance that he or she will post a response or send you an email. The keys again are respect, clarity, conciseness, and focused information.

首先要注意的是, 主题是明确的, 简明的, 并揭示主题。接下来要注意的是, 信息是有礼貌的, 但不是过于模糊。所有必要的信息包括, 如发生了什么事, 什么时候Oops出现, 一个试图重现的尝试, 和OOPS报告本身。这是一个很好的例子, 因为后面的分析比较困难。邮件的主题是, 是否有人看到过类似的问题。这个问题将鼓励读者至少扫描一下Oops报告。如果读者遇到类似的问题,他或她会发布一个答复或发送电子邮件给你。发布问题的关键是尊重、清晰、简洁和发布所有的信息。

1.3.4.2.3. Giving Back to the Community

The open source community relies on the sharing of knowledge. By searching the Internet for other experiences with the problem you are encountering, you are relying on that sharing. If the problem you experienced was a unique one and required some ingenuity either on your part or someone else who helped you, it is very important to give back to the community in the form of a follow-up message to a post you have made. I have come across many message threads in the past where someone posted a question that was exactly the same problem I was having. Thankfully, they responded to their own post and in some cases even prefixed the original subject with “SOLVED:” and detailed how they solved the problem. If that person had not taken the time to post the second message, I might still be looking for the answer to my question. Also think of it this way: By posting the answer to USENET, you’re also very safely archiving information at no cost to you! You could attempt to save the information locally, but unless you take very good care, you may lose the info either by disaster or by simply misplacing it over time.

开源社区依赖于知识的共享。通过在互联网上寻找你遇到的问题的类似的经验, 解决你的问题依赖于这种分享。如果你所经历的问题是独一无二的, 需要你或其他帮助你的人做一些独创性的事情, 那么以后续消息的形式向社区做出回复是非常重要的。在过去, 我遇到过很多邮件线程, 有人发布了一个与我完全相同的问题。谢天谢地, 他们对自己的帖子做出了回应, 在某些情况下甚至把原来的主题用 "解决:" 并且详细说明他们如何解决问题。如果那个人没有花时间张贴第二封邮件, 我可能还在寻找我的问题的答案。还可以这样想: 通过将答案发布到新闻网, 您也可以非常安全地存档信息, 无需花费任何成本!您可以尝试在本地保存信息, 但是除非您非常小心, 否则您可能会因为灾难而丢失信息, 或者过一段时间把它弄混了。

If someone responded to your plea for help and helped you out, it’s always a very good idea to go out of your way to thank that person. Remember that many Linux gurus provide help on their own time and not as part of their regular jobs.

如果有人回应了你的请求, 帮助了你, 那么你最好要去感谢那个人。请记住, 许多 Linux 专家在他们自己的时间提供帮助, 而不是他们的常规工作的一部分。

1.3.4.2.4. USENET

When posting to USENET, common netiquette dictates to only post to a single newsgroup (or a very small set of newsgroups) and to make sure the newsgroup being posted to is the correct one. If the newsgroup is not the correct one, someone may forward your message if you’re lucky; otherwise, it will just get ignored.

当把信息贴到新闻组时, 常见的礼仪要求只贴到一个或一组小组中, 并确保要张贴到的新闻组是正确的。如果新闻组不是正确的, 如果您幸运, 有人可以转发您的邮件;否则, 它就会被忽略。

There are thousands of USENET newsgroups, so how do you know which one to post to? There are several Web sites that host lists of available newsgroups, but the problem is that many of them only list the newsgroups provided by a particular news server. At the time of writing, Google Groups 2 (http://groups-beta.google.com/) is currently in beta and offers an enhanced interface to the USENET archives in addition to other group-based discussion archives. One key enhancement of Google Groups 2 is the ability to see all newsgroup names that match a query. For example, searching for “gcc” produces about half of a million hits, but the matched newsgroup names are listed before all the results. From this listing, you will be able to determine the most appropriate group to post a question to.

这里有数以千计的新闻组, 所以您如何知道要投递到哪一个?有几个网站可以查询可用新闻组的列表, 但问题是, 许多 Web 站点只列出特定新闻服务器提供的新闻组。在编写时, Google Group 2 (http://groups-beta.google.com/) 目前在 beta 版中, 除了其他基于组的讨论存档之外, 还提供了一个增强的界面。Google Group 2的一个关键增强功能是查看与查询匹配的所有新闻组名称。例如, 搜索 "gcc" 会产生大约100万次点击量的一半, 但匹配的新闻组名称在所有结果之前都列出。在此列表中, 您将能够确定将问题发布到的最合适的组。

Of course, there are other resources beyond USENET you can send a message to. You or your company may have a support contract with a distribution or consulting firm. In this case, sending an email using the same tips presented in this chapter still apply.

当然, 除了USENET之外还有其他资源可以发送信息给您。您或您的公司可能有一个销售或咨询公司的支持合同。在这种情况下, 使用本章中介绍的相同提示发送电子邮件仍然适用。

1.3.4.2.5. Mailing Lists

As mentioned in the RFC, it is considered proper netiquette to not post a question to a mailing list without monitoring the emails for a month or two first. Active subscribers prefer users to lurk for a while before posting a question. The act of lurking is to subscribe and read incoming posts from other subscribers without posting anything of your own.

正如 RFC 中提到的, 在不随时查看电子邮件的情况下, 不将问题张贴在邮件列表中, 这被认为是正确的礼仪。活动订阅者希望用户在发布问题之前潜伏一段时间。潜伏的行为是订阅和阅读来自其他订户的帖子, 而不张贴您自己的任何内容。

An alternative to posting a message to a newsgroup or mailing list is to open a new bug report in a Bugzilla database, if one exists for the package in question.

将邮件张贴到新闻组或邮件列表的另一种方法是在 Bugzilla 数据库中打开一个新的 bug 报告 (如果BugZilla存在)。

1.3.4.2.6. Tips on Opening Bug Reports in Bugzilla

When you open a bug report in Bugzilla, you are asking someone else to look into the problem for you. Any time you transfer a problem to someone else or ask someone to help with a problem, you need to have clear and concise information about the problem. This is common sense, and the information collected in Phase #3 will pretty much cover what is needed. In addition to this, there are some Bugzilla specific pointers, as follows:

当您在 Bugzilla 中新开一个 bug 报告时, 您正在请求其他人为您调查此问题。每当您将问题转移给他人或请求某人帮助解决问题时, 您都需要了解有关该问题的清晰而简明的信息。这是常识, 并且在阶段 #3 收集全部的信息。此外, 还有一些 Bugzilla 特定使用方法, 如下所示:

  • Be sure to properly characterize the bug in the various drop-down menus of the bug report screen. See as an example the new bug form for GCC’s Bugzilla, shown in Figure 1.2. It is important to choose the proper version and component because components in Bugzilla have individual owners who get notified immediately when a new bug is opened against their components.
  • 确保在 bug 报告的各种下拉菜单中正确地描述 bug。请参阅 GCC 的 Bugzilla 的新 bug 界面, 如Figure 1.2.所示。选择正确的版本和组件很重要, 因为 Bugzilla 中的组件有单独的维护者, 当针对其组件打开新 bug 时, 它们会立即得到通知。

Figure 1.2. Bugzilla

[View full size image]

  • Enter a clear and concise summary into the Summary field. This is the first and sometimes only part of a bug report that people will look at, so it is crucial to be clear. For example, entering Compile aborts is very bad. Ask yourself the same questions others would ask when reading this summary: “How does it break?” “What error message is displayed?” and “What kind of compile breaks?” A summary of gcc -c foo.c -O3 for gcc3.4 throws sigsegv is much more meaningful. (Make it a part of your lurking to get a feel for how bug reports are usually built and model yours accordingly.)
  • 在 "摘要" 字段中输入一个清晰简明的摘要。这是错误报告中的第一个, 有时也是人们会看唯一的一部分, 所以清楚是至关重要的。例如, 输入编译中止是非常糟糕的。在阅读此摘要时, 问自己同样的问题: "它是如何打破的?"显示什么错误消息" 和 "什么样的编译中断?"gcc3.4的-O3选项 抛出的 sigsegv错误信息是非常有意义的。(深入研究这一部分, 以了解 bug 报告通常是如何生成的, 并据此作为你的模板。)
  • In the Description field, be sure to enter a clear report of the bug with as much information as possible. Namely, the following information should be included for all bug reports:
    在 "说明" 字段中, 确保输入尽可能多的信息,以便生成清晰的bug报告。也就是说, 所有 bug 报告都应包括以下信息:
    • Exact version of the software being used
    • 正在使用的软件的确切版本
    • Linux distribution being used
    • 正在使用 Linux 版本
    • Kernel version as reported by uname –a
    • uname -a报告的内核版本
    • How to easily reproduce the problem (if possible)
    • 如何轻松重现问题 (如果可能)
    • Actual results you see - cut and paste output if possible
    • 您看到的实际结果-剪切和粘贴输出 (如果可能)
    • Expected results - detail what you expect to see
    • 预期结果-详细说明您期望看到的内容
  • Often Bugzilla databases include a feature to attach files to a bug report. If this is supported, attach any files that you feel are necessary to help the developers reproduce the problem. See Figure 1.2.
  • 通常 Bugzilla 数据库包括一个功能, 用于将文件附加到 bug 报告。如果此项被支持, 请附加您认为有助于开发人员重现问题所需的任何文件。见Figure 1.2

Note: The ability for others to reproduce the problem is crucial. If you cannot easily reproduce the bug, it is unlikely that a developer will investigate it beyond speculating what the problem may be based on other known problems.

注意: 其他人重现问题的能力是至关重要的。如果您无法轻松重现 bug, 则开发人员不太可能在基于其他已知问题的情况下进行调查。

1.3.4.3. Use Your Distribution’s Support

If you or your business has purchased a Linux distribution from one of the distribution companies such as Novell/SuSE, Redhat, or Mandrake, it is likely that some sort of support offering is in place. Use it! That’s what it is there for. It is still important, though, to do some homework on your own. As mentioned before, it can be faster than simply asking for help at the first sign of trouble, and you are likely to pick up some knowledge along the way. Also any work you do will help your distribution’s support staff solve your problem faster.

如果您或您的企业从一个经销公司 (如 Novell/SuSE、Redhat 或Mandrake) 购买了 Linux 发行版, 则可能会有某种支持服务。使用这个支持服务!这就是它的存在的原因。不过, 你自己做一些准备工作还是很重要的。如前所述, 做一些准备工作比一出现问题就请求帮助,对解决这个问题要更好一些。并且您可以在做准备工作的过程中学到一些知识。此外, 您所做的任何工作都将帮助您的支持人员更快地解决您的问题。

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

mounter625

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值