python完成了r markdown方式

最新推荐文章于 2024-12-16 12:01:46 发布

李_涛

最新推荐文章于 2024-12-16 12:01:46 发布

阅读量627

点赞数

文章标签： python

原文链接：https://towardsdatascience.com/python-done-the-r-markdown-way-d03bec4b96b

版权

介绍 (Introduction)

Starting to work with a new tool may not be easy, especially when that “tool” means a new programming language. At the same time, there might be an opportunity to build upon something already known and so make the transition smoother and less painful.

开始使用新工具可能并不容易，特别是当“工具”意味着一种新的编程语言时。同时，可能会有机会在已有的知识的基础上发展，从而使过渡更加顺畅和减轻痛苦。

In my case, it was the transition from R to Python. Unfortunately, my former colleagues, pythonians, did not share my genuine fondness of R. Also, whether R enthusiasts like it or not, Python is a widely used tool in data analysis/engineering/science and beyond. So, I concluded that learning at least some Python is a reasonable thing to do.

就我而言，这是从R到Python的过渡。不幸的是，我的前同事pythonian并没有分享我对R的真正喜爱。而且，无论R爱好者是否喜欢它， Python都是数据分析/工程/科学等领域广泛使用的工具。因此，我得出结论，至少学习一些Python是合理的做法。

For me, the first steps were maybe the most difficult ones. Residing in the comfort of RStudio, IDEs like Pycharm or Atom did not feel familiar. This experience led to the decision to begin in the well-known environment and test its limits when it comes to working with Python.

对我而言，第一步可能是最困难的步骤。住在RStudio的舒适环境中，像Pycharm或Atom这样的IDE并不熟悉。这种经验导致决定开始在众所周知的环境中使用，并测试使用Python时的限制。

To tell the truth, I did not end up using RStudio as the weapon of choice for using Python in a general setting. Hopefully, the following text will deliver the message why. However, I am convinced that for some use-cases, like integrating R and Python in an ad hoc analysis R Markdown way, RStudio still represents a viable way to go.

说实话，我并没有最终将RStudio用作在常规设置中使用Python的首选武器。希望以下文字可以传达原因。但是，我确信对于某些用例，例如以临时分析R Markdown方式集成R和Python， RStudio仍然是可行的方法。

More importantly, it could be a convenient starting line for people with the primary background in R.

更重要的是，对于具有R的主要背景的人来说，这可能是一个方便的起点。

So, what did I find?

那么，我发现了什么？

分析 (Analysis)

包装与环境 (Packages and environment)

First of all, let us set the environment and load the required packages.

首先，让我们设置环境并加载所需的软件包。

Global environment:
全球环境：

# Globally round numbers at decimalsoptions(digits=2)# Force R to use regular numbers instead of using the e+10-like notationoptions(scipen = 999)

R libraries:
R库：

# Load the required packages. # If these are not available, install them first.
ipak <- function(pkg){
  new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]if (length(new.pkg)) install.packages(new.pkg, 
                     dependencies = TRUE)sapply(pkg, 
         require, 
         character.only = TRUE)
}packages <- c("tidyverse", # Data wrangling
              "gapminder", # Data source
              "knitr", # R Markdown styling
              "htmltools") # .html files manipulationipak(packages)tidyverse gapminder     knitr htmltools 
     TRUE      TRUE      TRUE      TRUE

In this analysis, I will be working with the reticulate library, a package developed at RStudio. However, feel free to look for alternatives.

在此分析中，我将使用reticulate 库 (RStudio开发的软件包)进行工作。但是，请随时寻找替代方案。

Also, I am going to import the reticulate package separately for the purposes of a clearer flow.

另外，为了更清晰的流程，我将单独导入reticulate软件包。

Python activation:
Python激活：

# The "weapon of choice"library(reticulate)# Specifying which version of python to use.use_python("/home/vg/anaconda3/bin/python3.7", 
           required = T) # Locate and run Python

Python libraries:
Python库：

import pandas as pd # data wrangling
import numpy as np # arrays, matrices, math
import statistics as st # functions for calculating statistics
import plotly.express as px # plotting package

One of the limits of working with Python in R (Studio) is that in some cases you do not receive the Python traceback by default. That is a problem because, if I oversimplify things, a traceback helps you to identify where is the problem. Meaning it helps you fix it. So, consider it as an error message (or the lack of it).

在R (Studio)中使用Python的限制之一是，在某些情况下，默认情况下您不会收到Python回溯。这是一个问题，因为如果我简化了，回溯可以帮助您确定问题出在哪里。意思是它可以帮助您修复它。因此，请将其视为错误消息 (或缺少它)。

For example, when calling a library that you do not have installed, the Python chunk in R Markdown gives you green lights (so everything looks up and running), but this does not mean that the code ran the way you would expect (e.g. it imported a library).
例如，当调用一个尚未安装的库时，R Markdown中的Python块会为您提供绿灯(以便一切正常运行)，但这并不意味着代码按照您期望的方式运行(例如，导入了一个库)。

To deal with that, I suggest making sure you have your libraries installed in Terminal. If they are not, you can always install them and import afterwards.

为了解决这个问题，我建议确保在Terminal安装了库。如果不是，则始终可以安装它们，然后再导入。

For example, I will first import the json package which is already installed on my machine. I will do so by using Terminal here in RStudio. In addition, let me try to import the TensorFlow package.

例如，我将首先导入机器上已经安装的json包。我将通过在RStudio中使用Terminal来实现。另外，让我尝试导入TensorFlow 软件包。

From the following picture, you can see that there is no package TensorFlow. So, let me switch back to bash and install the package:

从下面的图片中，您可以看到没有软件包TensorFlow 。因此，让我切换回bash并安装该软件包：

Then go to the directory where is the newly installed package TensorFlow installed, switch to Python and import the package once again:

然后转到安装了新安装的TensorFlow软件包的目录，切换到Python并再次导入该软件包：

All right. For more information, take a look at the Installing Python Modules page.

行。有关更多信息，请查看“ 安装Python模块”页面。

数据 (Data)

As we now have R and Python set, let us import some data to play with. I will be working with a sample from the Gapminder, an intriguing project related to socio-demography of the world’s population. To be more specific, I will be working with the latest available data within the GapMinder library.

现在我们已经设置了R和Python ，让我们导入一些数据进行处理。我将使用Gapminder的样本进行研究，该样本是与世界人口的社会人口统计学有关的有趣项目。更具体地说，我将使用GapMinder库中的最新可用数据。

Data import in R:
在R中导入数据：

# Let us begin with the most recent set of data
gapminder_latest <- gapminder %>%filter(year == max(year))

So, now we have a data loaded in R. Unfortunately, it is not possible to access the R objects (e.g. vectors od tibbles) directly by Python. So, we need to convert the R object(s) first.

因此，现在我们在R中加载了数据。不幸的是，不可能通过Python直接访问R对象 (例如， 矢量和 tibbles )。因此，我们需要首先转换 R对象。

Data import in Python:
用Python导入数据：

# Convert R Data Frame (tibble) into the pandas Data Frame
gapminder_latest_py = r['gapminder_latest']

One important thing to realise when working with Python objects (e.g. arrays or pandas Data Frame) is that they are not explicitly stored in the environment as the R objects are.

使用Python对象 (例如数组或pandas Data Frame )时要意识到的一件事是，它们没有像R对象那样显式地存储在环境中。

In other words, if we want to know what is stored in the workspace, we must call functions like dir(), globals() or locals():
换句话说，如果我们想知道工作空间中存储了什么，我们必须调用dir() ， globals()或locals()类的函数：

['R', '__annotations__', '__builtins__', '__doc__', '__loader__', '__name__', '__package__', '__spec__', 'contact_window_days', 'contact_window_days_style', 'fig', 'gapminder_latest_count_py', 'gapminder_latest_max_lifeExp_py', 'gapminder_latest_mean_lifeExp_py', 'gapminder_latest_median_lifeExp_py', 'gapminder_latest_min_lifeExp_py', 'gapminder_latest_py', 'gapminder_latest_shape', 'gapminder_latest_stdev_lifeExp_py', 'lifeExpHist', 'np', 'pd', 'px', 'r', 'st', 'sys', 'variable', 'variable_grouping', 'variable_name']

Great, among the present objects, we can clearly see the data (gapminder_latest_py) or libraries (e.g. px).

太好了，在当前的对象中，我们可以清楚地看到数据( gapminder_latest_py )或库(例如px )。

So, let us explore the data a bit!

因此，让我们来探讨一下数据！

预期寿命 (Life Expectancy)

For the demonstration purposes, I will focus on the life expectancy or the average number of years a person is expected to live.

出于演示目的，我将关注预期寿命或预期一个人的平均寿命。

描述性统计 (Descriptive statistics)

Let’s begin with calculating some descriptive statistics like mean, median or the number of rows in the data using Python:

让我们开始使用Python计算一些描述性统计数据，例如均值， 中位数或数据中的行数 ：

# Descriptive statistics for the inline code in Python## Data Frame Overview### Number of rows
gapminder_latest_shape = gapminder_latest_py.shape[0] ### Number of distinct values within the life expectancy variable
gapminder_latest_count_py = gapminder_latest_py['lifeExp'].nunique()## Life Expectancy### Median (Life Expectancy)
gapminder_latest_median_lifeExp_py = st.median(gapminder_latest_py['lifeExp']) ### Mean
gapminder_latest_mean_lifeExp_py = st.mean(gapminder_latest_py['lifeExp'])### Minimum
gapminder_latest_min_lifeExp_py = min(gapminder_latest_py['lifeExp']) ### Maximum
gapminder_latest_max_lifeExp_py = max(gapminder_latest_py['lifeExp'])### Standard deviation
gapminder_latest_stdev_lifeExp_py = st.stdev(gapminder_latest_py['lifeExp'])

Nice. Unfortunately, we are not able to use the Python objects for inline coding, one of the key features of literate coding in R Markdown. So, if we want to use the results for inline codes, we need to transform the Python objects back to R:

真好不幸的是，我们无法使用Python对象进行内联编码 ，这是R Markdown中的识字编码的关键功能之一。因此，如果要将结果用于内联代码，则需要将Python对象转换回R ：

# Descriptive statistics for the inline code in Python - transformed to R## Data Frame Overview## Number of rows
gapminder_latest_nrow_r = py$gapminder_latest_shape### Number of distinct values within the life expectancy variable
gapminder_latest_count_r = py$gapminder_latest_count_py## Life Expectancy### Median (Life Expectancy)
gapminder_latest_median_lifeExp_r = py$gapminder_latest_median_lifeExp_py### Mean
gapminder_latest_mean_lifeExp_r = py$gapminder_latest_mean_lifeExp_py### Minimum
gapminder_latest_min_lifeExp_r = py$gapminder_latest_min_lifeExp_py### Maximum
gapminder_latest_max_lifeExp_r = py$gapminder_latest_max_lifeExp_py### Standard deviation
gapminder_latest_stdev_lifeExp_r = py$gapminder_latest_stdev_lifeExp_py

So, what can we say about life expectancy in 2007?

那么，我们能说一下2007年的 预期寿命吗？

First of all, there were 142 countries on the list. The minimum value of life expectancy was 39.61 years, the maximum 82.6 years.

首先，名单上有142个国家 。预期寿命的最小值为39.61岁，最大值为82.6岁。

The average value for life expectancy was 67.01 years and 50% or median hope to live 71.94 years or more. Lastly, the standard deviation was 12.07 years.

预期寿命的平均值为67.01岁，希望寿命中位数为71.94岁或50％以上。最后，标准偏差为12.07年 。

图(使用绘图) (Graphs (using Plotly))

Okay, let’s move to something else, like graphs.

好吧，让我们转到其他图形。

For example, we can take a look at how is the life expectancy distributed across the globe using Plotly:

例如，我们可以看看使用Plotly在全球的预期寿命如何分布 ：

fig = px.histogram(gapminder_latest_py, # package.function; Data Frame
                   x="lifeExp", # Variable on the X axis
                   range_x=(gapminder_latest_min_lifeExp_py, 
                            gapminder_latest_max_lifeExp_py), # Minimum and maximum values for the X axis
                   labels={'lifeExp':'Life expectancy - in years'}, # Naming of the interactive part
                   color_discrete_sequence=['#005C4E']) # Colour of fill lifeExpHist = fig.update_layout(
  title="Figure 1. Life Expectancy in 2007 Across the Globe - in Years", # The name of the graph
  xaxis_title="Years", # X-axis title
  yaxis_title="Count", # Y-axis title
  font=dict( # "css"
    family="Roboto",
    size=12,
    color="#252A31"
  ))lifeExpHist.write_html("lifeExpHist.html") # Save the graph as a .html object

Unfortunately, it is not possible to print interactive Plotly graphs in R Markdown via Python. Or, to be more precise, you will receive a Figure object by printing (e.g. print(lifeExpHist)) it:

不幸的是，不可能通过Python在R Markdown中打印交互式Plotly图。或者，更确切地说，您将通过打印(例如print(lifeExpHist) )来接收Figure object ：

Figure({
    'data': [{'alignmentgroup': 'True',
              'bingroup': 'x',
              'hovertemplate': 'Life expectancy - in years=%{x}<br>count=%{y}<extra></extra>',
              'legendgroup': '',
              'marker': {'color': '#005C4E'},
              'name': '',
              'offsetgroup': '',
              'orientation': 'v',
              'showlegend': False,
              'type': 'histogram',
              'x': array([43, 76, 72, 42, 75, 81, 79, 75, 64, 79, 56, 65, 74, 50, 72, 73, 52, 49,
                          59, 50, 80, 44, 50, 78, 72, 72, 65, 46, 55, 78, 48, 75, 78, 76, 78, 54,
                          72, 74, 71, 71, 51, 58, 52, 79, 80, 56, 59, 79, 60, 79, 70, 56, 46, 60,
                          70, 82, 73, 81, 64, 70, 70, 59, 78, 80, 80, 72, 82, 72, 54, 67, 78, 77,
                          71, 42, 45, 73, 59, 48, 74, 54, 64, 72, 76, 66, 74, 71, 42, 62, 52, 63,
                          79, 80, 72, 56, 46, 80, 75, 65, 75, 71, 71, 71, 75, 78, 78, 76, 72, 46,
                          65, 72, 63, 74, 42, 79, 74, 77, 48, 49, 80, 72, 58, 39, 80, 81, 74, 78,
                          52, 70, 58, 69, 73, 71, 51, 79, 78, 76, 73, 74, 73, 62, 42, 43]),
              'xaxis': 'x',
              'yaxis': 'y'}],
    'layout': {'barmode': 'relative',
               'font': {'color': '#252A31', 'family': 'Roboto', 'size': 12},
               'legend': {'tracegroupgap': 0},
               'margin': {'t': 60},
               'template': '...',
               'title': {'text': 'Figure 1. Life Expectancy in 2007 Across the Globe - in Years'},
               'xaxis': {'anchor': 'y', 'domain': [0.0, 1.0], 'range': [39.613, 82.603], 'title': {'text': 'Years'}},
               'yaxis': {'anchor': 'x', 'domain': [0.0, 1.0], 'title': {'text': 'Count'}}}
})

So, we import the previously created .html file instead (e.g. using the includeHTML function from the htmltools package):

所以，我们导入了先前创建的.html文件，而不是(例如，使用includeHTML从功能htmltools包)：

htmltools::includeHTML("lifeExpHist.html") # Render the graph

So, here comes a basic, yet interactive histogram made in the Python version of Plotly.

因此，这是在Python版本的Plotly制作的基本但交互式的直方图。

However, producing this graph in RStudio required quite a workaround. At the same time, a size of the graphs produced this way could easily be tens of MBs.

但是，在RStudio中生成此图需要一个变通方法。同时，以这种方式生成的图的大小可能很容易为数十MB 。

Consequently, a .html report containing such graphs would require a lot of data to download for a reader and it would take more time to render the page.
因此，包含此类图形的.html报告将需要大量数据才能下载给读者，并且呈现页面需要更多时间 。

汇总表(使用熊猫) (Summary tables (using pandas))

One of the common use-cases for pandas is to provide a data description. The respective code runs just fine. However, the output cannot be styled as you are used from styling in pandas.

pandas的常见用例之一是提供数据描述。相应的代码运行正常。但是，无法使用pandas样式来设置输出的样式。

When speaking of pandas, we can easily create a summary table for the commonly used statistics like mean or standard deviation of life expectancy for individual continents:
说到pandas ，我们可以轻松地为常用统计信息创建一个汇总表，例如各个大洲的 预期寿命的均值或标准差：

# Create a pandas Data Frame object containing the relevant variable,# conduct formatting.
  gapminder_latest_py['continent'] = gapminder_latest_py['continent'].astype(str)
  variable = 'continent'
  variable_name = 'Continents'
  gapminder_latest_py['lifeExp'] = gapminder_latest_py['lifeExp'].astype(int)
  variable_grouping = 'lifeExp'
  contact_window_days = gapminder_latest_py.groupby([
                          pd.Grouper(key=variable)])\
                          [variable_grouping]\
                          .agg(['count',
                                'min',
                                'mean',
                                'median',
                                'std',
                                'max'])\
                          .reset_index()
  contact_window_days_style = contact_window_days\
                          .rename({'count': 'Count',
                                   'median': 'Median',
                                   'std': 'Standard Deviation',
                                   'min': 'Minimum', 
                                   'max': 'Maximum',
                                   'mean': 'Mean',}, axis='columns')

Output:

输出：

continent  Count  Minimum       Mean  Median  Standard Deviation  Maximum
  0    Africa     52       39  54.326923    52.0            9.644100       76
  1  Americas     25       60  73.040000    72.0            4.495183       80
  2      Asia     33       43  70.151515    72.0            7.984834       82
  3    Europe     30       71  77.100000    78.0            2.916658       81
  4   Oceania      2       80  80.500000    80.5            0.707107       81

Also, note that the Python environment settings override those from R. For example, take a look at the number of digits for Mean or Standard Deviation. There are six digits instead of two set at the beginning.

另外，请注意， Python环境设置会覆盖R中的设置 。例如，查看“均值”或“标准偏差”的位数。有六位数字，而不是开头的两位。

结束语 (Closing remarks)

Okay, that’s enough for now. If you are hungry for more advanced things like unsupervised learning using scikit-learn package in R, take a look.

好吧，到此为止。如果您渴望使用R中的 scikit-learn 包进行无监督学习等更高级的事情，请看一下。

However, before closing this post, let me just say that if you think about switching to Python as such and using it often, consider IDE alternatives to RStudio.

但是，在结束这篇文章之前，我只想说，如果您考虑像这样切换到Python并经常使用它，请考虑IDE替代RStudio 。

Many analysts swear on Jupyter notebooks for the interactivity, integration of markdown or option to run code in various languages like R, Julia or JavaScript. JupyterHub is a platform based on Jupyter notebooks, adding version control. Usually, users run analyses in a containerised environment). Another take on interactivity and collaboration could be Colab, basically, Jupyter notebooks running on Google Cloud.

许多分析家在Jupyter笔记本电脑上发誓，它们具有交互性， markdown集成或以R ， Julia或JavaScript等各种语言运行代码的选项。 JupyterHub是一个基于Jupyter笔记本的平台，增加了版本控制。通常，用户在容器化环境中运行分析)。交互性和协作性的另一种表现可能是Colab ，基本上是在Google Cloud上运行的Jupyter笔记本。

Last but not least, there is a great piece of software called Visual Studio Code. It not only allows you to create and run code in a plethora of languages or seamless flow between pure Python code and interactive Jupyter notebooks. And maybe even more importantly, it provides you with very efficient version control management (like Git integration and extensions). If you choose this IDE, you can set up VS Code for Python development like RStudio.

最后但并非最不重要的一点是，有很多软件称为Visual Studio Code 。它不仅允许您以多种语言创建和运行代码，也可以在纯Python代码和交互式Jupyter笔记本之间无缝流动。甚至更重要的是，它为您提供了非常有效的版本控制管理(例如Git集成和扩展)。如果选择此IDE，则可以为Python开发(如RStudio)设置VS Code 。

But no matter what path to Python you choose, don’t forget that it is a tool suitable for some situations and maybe not so suitable to others. Just like R. Try to leverage the best of it while being aware of the pros of different tools.

但是，无论选择哪种方式使用Python，都不要忘记它是一种适用于某些情况的工具，也许不适用于某些情况。就像R.一样，在了解不同工具的优点的同时，尽量利用它的优点。