《成为一名机器学习工程师》_每个机器学习工程师都应该掌握的一种工具-CSDN博客

《成为一名机器学习工程师》

程式设计 (Programming)

Throughout my journey working with data, I have discovered a tool that will help save your time and make you more productive, no matter what programming language you are using.

在处理数据的整个过程中，我发现了一个工具，无论您使用哪种编程语言，该工具都可以帮助您节省时间并提高工作效率。

The shell.

贝壳。

When you run any program from the terminal, you are actually using shell to run it. Any command you type on the terminal, it runs on shell.

当您从终端运行任何程序时，实际上是在使用Shell运行它。您在终端上键入的任何命令都在shell上运行。

Unfortunately, most of us only learn a small amount of shell, mainly cd and ls to navigate through directories.

不幸的是，我们大多数人只学习少量的shell，主要是cd和ls来浏览目录。

Other than that, maybe we learn tool-specific commands such as git and docker , and language-specific commands to compile and run different programming languages. However, we often treat remembering these commands as part of learning a tool or programming language instead of learning the shell itself.

除此之外，也许我们会学习特定于工具的命令(例如git和docker )以及特定于语言的命令，以编译和运行不同的编程语言。但是，我们经常将记住这些命令当作学习工具或编程语言的一部分，而不是学习外壳本身。

I have compiled some useful “shell hacks” that I often used on my work that might help you in your work too.

我已经整理了一些有用的“ shell hacks”，这些在我的工作中经常使用，可能也会对您的工作有所帮助。

链接程序 (Chaining Programs)

If you are working with data, chances are you need to load the data either from a database or from a file. I personally often use files containing data that has been extracted and filtered from Hadoop.

如果要使用数据，则可能需要从数据库或文件中加载数据。我个人经常使用包含从Hadoop中提取和过滤的数据的文件。

Learning how to chain programs through shell will simplify your life and code.

学习如何通过shell链接程序将简化您的工作和代码。

Suppose you want to read a file, process its content, and output the result. Here’s how your program will probably look like in Python.

假设您要读取文件，处理文件内容并输出结果。这是您的程序在Python中的外观。

What if you want the program to process multiple files?

如果您希望程序处理多个文件怎么办？

Surely, you don’t want to have to edit the code for each individual file names, so you may came up with something like this.

当然，您不需要编辑每个文件名的代码，因此您可能想出了类似的方法。

Since you are using argparse, you also need to modify your running script to something like this.

由于您使用的是argparse，因此还需要将正在运行的脚本修改为如下所示。

python3 process_files.py -i "data/input01.txt" -o "out/output01.txt"

Let’s be honest here, not everyone could implement argparse spontaneously without using Google search. It also took me a couple of minutes to put together the code since I have not used argparse in a while.

坦白地说，并不是每个人都可以在不使用Google搜索的情况下自发实现argparse。由于我已经有一段时间没有使用argparse了，所以花了我几分钟的时间来整理代码。

Here’s how the program could look like if you’re chaining programs in shell.

如果您在Shell中链接程序，则程序如下所示。

And here’s the script to run the program with chaining.

这是通过链接运行程序的脚本。

cat data/input01.txt | python3 process_files_v2.py > out/output01.txt

The | character passes through the stream of data from the file “data/input01.txt” that is opened by the command cat

| 字符通过命令“ cat ”打开的文件“ data / input01.txt”中的数据流传递

This allows the data to be read through sys.stdin inside the Python script.

这允许通过Python脚本中的sys.stdin读取数据。

The > character passes through the output stream from the “process_files_v2.py” script into the file “out/output01.txt”.

>字符将从“ process_files_v2.py”脚本的输出流传递到文件“ out / output01.txt”。

This allows us to simply use print inside the Python script and let the shell direct the stream into the external file.

这使我们可以在Python脚本中简单地使用print ，并让shell将流定向到外部文件中。

创建别名 (Creating Aliases)

Setting aliases for commonly used commands could also save you a lot of time. Here is an example.

为常用命令设置别名也可以节省大量时间。这是一个例子。

alias ls='ls -a'

Put the alias in your terminal configuration file (default is ~/.bashrc for Linux or ~/.bash_profile for MacOS). The next time you open a terminal and typed ls , it would act like ls -a instead.

将别名放入您的终端配置文件(对于Linux，默认值为~/.bashrc ；对于MacOS，默认值为~/.bash_profile )。下次打开终端并键入ls ，它的行为类似于ls -a 。

If you want to use ls , and not ls -a after setting alias for it, you could use \ls to make the terminal process it as literal ls

如果要在设置别名后使用ls而不是ls -a ，则可以使用\ls使终端将其作为原义ls

You could also set the alias to completely new name, such as the following.

您也可以将别名设置为全新的名称，例如以下名称。

alias gl='git pull origin master'

Or for doing ssh into the work machine with an address that you always forgot.

或者使用您始终忘记的地址在工作机中使用ssh。

alias work='ssh 128.0.0.1'

The possibilities are endless.

可能性是无止境。

创建功能 (Creating Functions)

You can take it a step further by creating your own functions instead of just using aliases. Here is another example.

您可以通过创建自己的函数而不是仅使用别名来使它更进一步。这是另一个例子。

mkcd() {
    mkdir -p $1;
    cd $1 
}

The variable $1 means the first position argument that follows after that command. If you typed mkcd awesome_project , then it will create a new directory called awesome_project and then cd into that directory.

变量$1表示该命令之后的第一个位置参数。如果键入mkcd awesome_project ，它将创建一个名为awesome_project的新目录，然后cd进入该目录。

杂 (Miscellaneous)

This is a list of functions that may or may not come in handy. I will not get into the details of every functions, but I will give a rough definition of what they are capable of.

这是可能会派上用场或可能不会派上用场的功能列表。我不会详细介绍每个功能，但会大致定义它们的功能。

>> : similar to > where it directs the output stream of the previous command into a file. However >> will append the current file instead of writing from the start of the file.
>> ：类似于> ，它将前一个命令的输出流定向到文件中。但是>>将附加当前文件，而不是从文件开头写入。
grep : this is really helpful if you are dealing with large text files (over 1GB) and you need to search for some words inside them. If you open them with text editors or even vim, it will take quite a while to load the file.
grep ：如果您要处理大型文本文件(超过1GB)，并且需要在其中搜索一些单词，这确实很有用。如果使用文本编辑器甚至vim打开它们，则加载文件将花费相当长的时间。
head and tail : comes in handy when you are trying to use a little bit of data to your model pipeline just to see if they are working. Instead of modifying the internal code with data = data[:10] , you can just feed the head of the data by using cat data.txt | head -n 10 | python3 train.py
head and tail ：当您尝试在模型管道中使用少量数据以查看它们是否正常工作时，它会派上用场。您无需使用data = data[:10]修改内部代码，而只需使用cat data.txt | head -n 10 | python3 train.py cat data.txt | head -n 10 | python3 train.py
sort : much faster than sorting in Python, or almost any other language actually. Useful if you are dealing with large *sv files that needs sorting. Be careful as the -k flag in particular is a bit tricky.
sort ：比使用Python或几乎其他任何语言进行sort要快得多。如果您正在处理需要排序的大型* sv文件，则很有用。请特别小心，因为-k标志特别棘手。

For example,
例如，

sort -t, -k2,2n, -k1,1nr, -k3,3 < data.csv .
sort -t, -k2,2n, -k1,1nr, -k3,3 < data.csv 。

sort -t, -k2,2n, -k1,1nr, -k3,3 < data.csv . -t flag indicates the separator (comma in this case, default is tab)
sort -t, -k2,2n, -k1,1nr, -k3,3 < data.csv 。 -t标志指示分隔符(在这种情况下，逗号，默认为制表符)

sort -t, -k2,2n, -k1,1nr, -k3,3 < data.csv . -t flag indicates the separator (comma in this case, default is tab)-k2,2n indicates the first sorting key is the second field and should be treated as numeric (n)-k1,1nr indicates the second sorting key is the first field and should be treated as numeric and reversed (nr) (reversed means descending)
sort -t, -k2,2n, -k1,1nr, -k3,3 < data.csv 。 -t标志指示分隔符(在这种情况下，逗号默认为tab) -k2,2n指示第一个排序键为第二个字段，应将其视为数字( n ) -k1,1nr指示第二个排序键为第一个字段字段，应将其视为数字并取反( nr )(取反表示降序)

sort -t, -k2,2n, -k1,1nr, -k3,3 < data.csv . -t flag indicates the separator (comma in this case, default is tab)-k2,2n indicates the first sorting key is the second field and should be treated as numeric (n)-k1,1nr indicates the second sorting key is the first field and should be treated as numeric and reversed (nr) (reversed means descending)-k3,3 indicates the third sorting key is the third field and will be treated as string
sort -t, -k2,2n, -k1,1nr, -k3,3 < data.csv 。 -t标志指示分隔符(在这种情况下，逗号默认为tab) -k2,2n指示第一个排序键为第二个字段，应将其视为数字( n ) -k1,1nr指示第二个排序键为第一个字段字段，应将其视为数字并取反( nr )(反向表示降序) -k3,3表示第三个排序键为第三个字段，将被视为字符串

Make sure to set the key by using
确保通过使用设置密钥

-k1,1 to make sure it only uses the first key. If you left it as -k1, it will use all the keys starting from position one until the end of the line instead of only the first one.
-k1,1以确保它仅使用第一个密钥。如果将其保留为-k1 ，它将使用从位置1开始到行尾的所有键，而不是仅使用第一个。

Another note, do not use pipe into the sort function such as
另外要注意的是，不要使用管道到排序功能如

data.csv | sort ... as the performance will be much worse. This is because the sort function will assume that the data is small and need to adjust itself multiple times later on.
data.csv | sort ... data.csv | sort ...因为效果会更差。这是因为排序功能将假定数据很小，并且以后需要多次对其进行调整。
split : split a file into several smaller ones, will come in handy when your model could not load all the training data into memory at once. You can split by either number of lines or byte size.
split ：将文件分割成几个较小的文件，当您的模型无法一次将所有训练数据加载到内存中时，它将很方便。您可以按行数或字节大小进行拆分。
top or htop : usually used to check the load of a server (cpu, ram, running process). Highly recommended to install htop for a more easily interpretable version.
top或htop ：通常用于检查服务器的负载(cpu，ram，正在运行的进程)。强烈建议安装htop以获得更易于解释的版本。

You can also use
您也可以使用

w for a more lightweight function just to check what processes are running and which user runs them.
w具有更轻量级的功能，仅用于检查正在运行的进程以及哪个用户运行它们。
watch : you probably used nvidia-smi to see if your model is training on GPU. However, you realised that it just prints out the status at that time and wait for your next command.
watch ：您可能使用了nvidia-smi来查看您的模型是否在GPU上训练。但是，您意识到它只是在那时打印出状态并等待您的下一个命令。

Try
尝试

watch -n 5 nvidia-smi , now it will refresh the status every 5 seconds.
watch -n 5 nvidia-smi ，现在它将每5秒刷新一次状态。
& : want something to run in background? Just add & at the end of the command. It will show you which PID the program is running on, just in case you need to kill them later
& ：想要在后台运行某些内容？只需在命令末尾添加& 。它会告诉您程序正在运行哪个PID，以防万一您以后需要杀死它们
ps : show list of processes. Most common usage is ps aux , a: all users, u: display the process owner, x: show processes not attached to a terminal. Also useful if you want to know the PID for your programs.
ps ：显示进程列表。最常见的用法是ps aux ，a：所有用户，u：显示进程所有者，x：显示未连接到终端的进程。如果您想知道程序的PID，也很有用。
kill : stops a program. The default is kill -15 PID which sends a termination signal to the specified program and let the program respond to the signal.
kill ：停止程序。默认值为kill -15 PID ，它将终止信号发送到指定程序，并让程序响应该信号。

Use
用

kill -9 PID for maximum effectiveness (send termination signal without letting the program respond)
终止kill -9 PID以kill -9 PID最大功效(发送终止信号而不让程序响应)
man : short for manual, usually more complete than typing --help after a command. For example, type man split to read the manual for split function.
man ：手册的缩写，通常比在命令后键入--help更完整。例如，键入man split以阅读有关split功能的手册。

Don’t worry if you can’t remember all of them. The key in being an engineer is to know that something exist so you can Google it later on when you forget.

如果您不记得所有这些，请不要担心。成为工程师的关键是要知道某物的存在，以便以后忘记时可以使用它进行搜索。