从Jupyter Notebook切换到脚本的5个理由

意见 (Opinion)

动机 (Motivation)

Like most people, the first tool I used when started learning data science is Jupyter Notebook. Most of the online data science courses use Jupyter Notebook as a medium to teach. This makes sense because it is easier for beginners to start writing code in Jupyter Notebook’s cells than writing a script with classes and functions.

与大多数人一样,我开始学习数据科学时使用的第一个工具是Jupyter Notebook。 大多数在线数据科学课程都使用Jupyter Notebook作为教学手段。 这是有道理的,因为对于初学者来说,在Jupyter Notebook的单元格中开始编写代码比编写具有类和函数的脚本要容易得多。

Another reason why Jupyter Notebook is such a common tool in data science is that Jupyter Notebook makes it easy to explore and plot the data. When we type ‘Shift + Enter’, we will immediately see the results of the code, which makes it easy for us to identify whether our code works or not.

Jupyter Notebook之所以成为数据科学中如此普遍的工具的另一个原因是,Jupyter Notebook使其易于浏览和绘制数据。 当我们键入“ Shift + Enter”时,我们将立即看到代码的结果,这使我们很容易确定我们的代码是否有效。

However, I realized several fallbacks of Jupyter Notebook as I work with more data science projects:

但是,当我处理更多数据科学项目时,我意识到了Jupyter Notebook的一些后备功能:

  • Unorganized: As my code gets bigger, it becomes increasingly difficult for me to keep track of what I write. No matter how many markdowns I use to separate the notebook into different sections, the disconnected cells make it difficult for me to concentrate on what the code does.

    杂乱无章 :随着我的代码变得越来越大,对我而言,跟踪自己的编写变得越来越困难。 无论我使用多少次降价将笔记本分成不同的部分,断开的单元格都使我难以集中精力执行代码。

  • Difficult to experiment: You may want to test with different methods of processing your data, choose different parameters for your machine learning algorithm to see if the accuracy increases. But every time you experiment with new methods, you need to rerun the entire notebook. This is time-consuming, especially when the processing procedure or the training takes a long time to run.

    难以实验: 可能想用不同的数据处理方法进行测试,为机器学习算法选择不同的参数以查看准确性是否提高。 但是,每次尝试新方法时,都需要重新运行整个笔记本。 这非常耗时,尤其是在处理过程或培训需要很长时间才能运行时。

  • Not ideal for reproducibility: If you want to use new data with a slightly different structure, it would be difficult to identify the source of error in your notebook.

    对于重现性而言并不理想:如果要使用结构略有不同的新数据,则很难在笔记本中识别错误源。

  • Difficult to debug: When you get an error in your code, it is difficult to know whether the reason for the error is the code or the change in data. If the error is in the code, which part of the code is causing the problem?

    难以调试:当您得到 代码中的错误,很难知道错误的原因是代码还是数据更改。 如果错误出在代码中,则代码的哪一部分导致了问题?

  • Not ideal for production: Jupyter Notebook does not play very well with other tools. It is not easy to run the code from Jupyter Notebook while using other tools.

    对于生产而言并不理想: Jupyter Notebook在与其他工具配合使用时效果不佳。 使用其他工具时,从Jupyter Notebook运行代码并不容易。

I knew there must be a better way to handle my code so I decided to give scripts a try. These are the benefits I found when using scripts:

我知道必须有一种更好的方式来处理我的代码,所以我决定尝试一下脚本。 这些是我在使用脚本时发现的好处:

有组织的 (Organized)

The cells in Jupyter Notebook make it difficult to organize the code into different parts. With a script, we could create several small functions with each function specifies what the code does like this

Jupyter Notebook中的单元格使得很难将代码组织成不同的部分。 使用脚本,我们可以创建几个小函数,每个函数指定代码的功能,如下所示

Image for post

Better yet, if these functions could be categorized in the same category such as functions to process the data, we could put them in the same class!

更好的是,如果可以将这些函数归为同一类,例如处理数据的函数,我们可以将它们归为同一类!

Image for post

Whenever we want to process our data, we know the functions in the class Preprocess can be used for this purpose.

每当我们要处理数据时,我们都知道Preprocess类中的函数可用于此目的。

鼓励实验 (Encourage Experiment)

When we want to experiment with a different approach to preprocess data, we could just add or remove a function by commenting out like this without being afraid to break the code! Even if we happen to break the code, we know exactly where to fix it.

当我们想尝试另一种预处理数据的方法时,我们可以通过注释掉这样的方式来添加或删除函数,而不必担心破坏代码! 即使我们碰巧破坏了代码,我们也知道在哪里修复它。

Image for post

We could also experiment with different parameters by changing the input of the functions. For example, if we want to see how different methods of resampling my Pandas series affect my results, we could just switch from method_of_resample='sum’ to method_of_resample= 'average'. How neat!

我们还可以通过更改函数的输入来试验不同的参数。 例如,如果要查看对熊猫系列进行重采样的不同方法如何影响我的结果,可以将其从method_of_resample='sum'切换到method_of_resample= 'average' 。 多么整洁!

Image for post

You can still use functions in a notebook, but when your number of functions gets really big, you might want to split the functions in different notebooks. Importing functions across different notebook is not easy.

您仍然可以在笔记本中使用功能,但是当功能数量真的很大时,您可能希望将功能拆分到不同的笔记本中。 跨不同笔记本导入功能并不容易。

重现性的理想选择 (Ideal for Reproducibility)

With classes and functions, we could make the code general enough so that it will be able to work with other data.

使用类和函数,我们可以使代码足够通用,以便能够与其他数据一起使用。

For example, if we want to drop different columns in my new data, we just need to change columns_to_drop to a list of columns, we want to drop and the code will run smoothly!

例如,如果我们想在新数据中删除不同的列,我们只需要将columns_to_drop更改为列的列表,我们就可以删除并且代码将平稳运行!

columns_to_drop = config.columns.to_drop


datetime_column = config.columns.datetime.sentiment


dropna_columns = config.columns.drop_na


processor = Preprocess(columns_to_drop, datetime_column, dropna_columns)

I can also create a pipeline that specifies steps to process and train the data! Once I have a pipeline, all I need to do is to use

我还可以创建一个管道来指定处理和训练数据的步骤! 一旦有了管道,我要做的就是使用

pipline.fit_transform(data)

to apply the same processing to both the train and test data.

对火车和测试数据进行相同的处理。

易于调试 (Easy to Debug)

With functions, it is easier to test whether that function produces the output we expect. We can quickly spot out where in the code we should change to produce the output we want

使用函数,可以更轻松地测试该函数是否产生我们期望的输出。 我们可以快速找出应该在代码中更改的位置以产生所需的输出

def extract_date_hour_minute(string: str):
        '''Extract data hour and minute from datetime string'''
        try:
            return string[:16]
        except TypeError:
            return np.nan


def test_extract_date_hour_minute():
    '''Test whether the function extract date, hour, and minute '''
        
    string = '2020-07-30T23:25:31.036+03:00'
    assert extract_date_hour_minute(string) == '2020-07-30T23:25'

If all of the tests pass but there is still an error in running our code, we know the data is where we should look next.

如果所有测试都通过了,但是在运行我们的代码时仍然存在错误,那么我们知道数据是我们下一步应该去的地方。

For example, after passing the test above, I still have a TypeError when running the script, which gives me the idea that my data has null values. I just need to take care of that to run the code smoothly.

例如,通过上述测试后,运行脚本时我仍然遇到TypeError,这使我想到了我的数据具有空值。 我只需要注意这一点即可顺利运行代码。

生产的理想选择 (Ideal for Production)

We can use different functions in multiple scripts on top of something else like this

我们可以在类似这样的其他东西的多个脚本中使用不同的功能

from preprocess import preprocess
from model import run_model
from predict import predict




def main(config):
    df = preprocess(config)


    df = run_model(config)


    df, df_scale, min_day, max_day, accuracy = predict(df, config)

or to add a config file to control the values of the variables. This prevents us from wasting time tracking down a specific variable in the code just to change its value.

或添加配置文件以控制变量的值。 这样可以避免我们浪费时间跟踪代码中的特定变量以更改其值。

columns:
  to_drop:
    #- keywords
    #- entities
    - code
    - error
    - warnings
  binary_columns: 
    - sentiment 
    - Diff
  datetime:
    time: Date 
    sentiment: crawled
  drop_na: 
    - sentiment
    - usage
    - crawled
    - emotion
  to_predict: sentiment

We could also easily add tools to track the experiment such as MLFlow or tools to handle configuration such as Hydra.cc!

我们还可以很容易地添加工具来跟踪实验,如MLFlow或工具来处理配置,如Hydra.cc

我不喜欢使用Jupyter Notebook的想法,直到我将自己推出舒适区 (I didn’t like the Idea of Using Jupyter Notebook until I Pushed myself out of my Comfort Zone)

I used to use Jupyter Notebook all the time. When some data scientists advise me to switch from Jupyter Notebook to script to prevent some problems listed above, I didn’t understand and felt resistant to do so. I didn’t like the uncertainty of not being able to see the outcome when I run the cell.

我曾经一直使用Jupyter Notebook。 当一些数据科学家建议我从Jupyter Notebook切换到脚本以防止上面列出的某些问题时,我并不理解,并且对此感到抵触。 我不喜欢在运行单元时无法看到结果的不确定性。

But the disadvantage of Jupyter Notebook grew as I started my first real data science project in my new company so I decided to push myself out of my comfort zone and experiment with scripts.

但是Jupyter Notebook的劣势随着我在新公司中开始第一个真实数据科学项目而变得越来越严重,因此我决定将自己从舒适的领域中脱身出来,并尝试使用脚本。

In the beginning, I felt uncomfortable but started to notice the benefits of using scripts. I started to feel more organized when my code is organized into different functions, classes, and into multiple scripts with each script serving different purposes such as preprocessing, training, and testing.

一开始,我感到不舒服,但是开始注意到使用脚本的好处。 当我的代码被组织成不同的函数,类和多个脚本,并且每个脚本具有不同的目的(例如预处理,培训和测试)时,我开始变得井井有条。

所以,您是否建议我停止使用Jupyter Notebook? (So are you Suggesting me to Stop Using Jupyter Notebook?)

Don’t get me wrong. I still use Jupyter Notebook if my code is small and if I don’t plan to put my code into production. I use Jupyter Notebook when I want to explore and visualize the data. I also use it to explain how to use some python libraries. For example, I write use mostly Jupyter Notebooks in this repository as the medium to explain the code mentioned in all of my articles.

不要误会我的意思。 如果我的代码很小并且我不打算将代码投入生产,我仍然会使用Jupyter Notebook。 当我想浏览和可视化数据时,我使用Jupyter Notebook。 我也用它来解释如何使用一些python库。 例如,我在这个存储库中主要使用Jupyter Notebooks作为媒介来解释我所有文章中提到的代码。

If you don’t feel comfortable with coding everything in scripts, you could use both scripts and Jupyter Notebook for different purposes. For example, you could create classes and functions in scripts then import them in the notebook so that the notebook is less messy.

如果您不满意用脚本编写所有代码,则可以将脚本和Jupyter Notebook都用于不同的目的。 例如,您可以在脚本中创建类和函数,然后将其导入笔记本中,以使笔记本不那么混乱。

Another alternative is to turn the notebook into the script after writing the notebook. I personally don't prefer this approach because it often takes me longer to organize the code in my notebook such as put them into functions and classes and write test functions.

另一种选择是在编写笔记本后将笔记本变成脚本。 我个人不喜欢这种方法,因为通常需要我花费更长的时间在笔记本中组织代码,例如将它们放入函数和类中以及编写测试函数。

I find writing a small function then writing a small test function is faster and safer. If I happen to want to speeds up my code with the new Python library, I could use the test function I already wrote to make sure it still works as I expected.

我发现编写一个小的函数然后编写一个小的测试函数会更快,更安全。 如果我碰巧想用新的Python库加速代码,则可以使用已经编写的测试函数来确保它仍然可以按预期工作。

With that being said, I believe there are more ways to solve the disadvantage of Jupyter Notebook than what I mentioned here such as how Netflix uses put the notebook into production and schedule the notebook to run at a certain time.

话虽这么说,我相信比我在这里提到的解决Jupyter Notebook的缺点还有更多的方法,例如Netflix如何使用Netflix将笔记本电脑投入生产并安排笔记本电脑在特定时间运行

结论 (Conclusion)

Everybody has their own way to make their workflow more efficient and to me, it is to leverage the utility of scripts. If you have just switched from Jupyter Notebook to script, it might not be intuitive to write code in scripts, but trust me, you will get used to using scripts eventually.

每个人都有自己的方法来提高工作流程的效率,对我来说,这是利用脚本的实用程序。 如果您刚刚从Jupyter Notebook切换到脚本,那么用脚本编写代码可能并不直观,但是请相信我,您最终将习惯于使用脚本。

Once that happens, you will start to realize many benefits of the scripts over the messy Jupyter Notebook and want to write most of your code in scripts.

一旦发生这种情况,相对于凌乱的Jupyter Notebook,您将开始意识到脚本的许多优点,并希望将大多数代码编写在脚本中。

If you don’t feel comfortable with the big change, start small.

如果您对较大的变化不满意,请从小处着手。

Big changes start with small steps

大变化始于小步

I like to write about basic data science concepts and play with different algorithms and data science tools. You could connect with me on LinkedIn and Twitter.

我喜欢写有关基本数据科学概念的文章,并喜欢使用不同的算法和数据科学工具。 您可以在LinkedInTwitter上与我联系。

Star this repo if you want to check out the codes for all of the articles I have written. Follow me on Medium to stay informed with my latest data science articles like these

如果您想查看我编写的所有文章的代码,请给此回购加注星号。 在Medium上关注我,以了解有关这些最新数据科学文章的最新信息

翻译自: https://towardsdatascience.com/5-reasons-why-you-should-switch-from-jupyter-notebook-to-scripts-cb3535ba9c95

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值