香港专业教育学院第一次尝试r有多糟糕

最新推荐文章于 2023-07-20 15:31:07 发布

weixin_26735933

最新推荐文章于 2023-07-20 15:31:07 发布

阅读量200

点赞数

文章标签： python java

原文链接：https://towardsdatascience.com/ive-tried-r-for-the-first-time-how-bad-was-it-ba344f22e90b

版权

It’s not a secret that I’m a heavy Python user. Just take a look at my profile and you’ll find over 100 articles on Python itself, or Python in data science. Lately, I’ve been trying out a lot of new languages and technologies, with R being the one I resisted the most. Below you’ll find my discoveries, comparisons with Python, and overall opinion on the language itself.

我是Python的重度用户，这不是秘密。看看我的个人资料，您会发现100多篇关于Python本身或数据科学中的Python的文章。最近，我一直在尝试许多新的语言和技术，其中R是我最反对的语言和技术。在下面，您将找到我的发现，与Python的比较以及对该语言本身的总体看法。

The biggest reason I’ve ignored R for so long is the lack of information on the language. Everyone I know who used it presented it strictly as a statistical language. Statistics is essential for data science, but what’s the point of building a model if you can’t present it — through a dashboard — and deploy it — as a REST API?

长期以来我一直忽略R的最大原因是缺乏有关该语言的信息。我认识的每个人都严格按照统计语言来介绍它。统计数据对于数据科学至关重要，但是如果您无法通过仪表板展示它并将其部署为REST API，那么构建模型的意义何在？

Those were my thoughts up to recently, but since then I discovered Shiny and Plumber, which basically solve the issues I had with R in the first place. That being said, there’s no point in avoiding the language anymore, and this article is the first of many in the R series.

这些是我最近的想法，但是从那时起，我发现了Shiny和Plumber ，它们基本上解决了我最初使用R遇到的问题。话虽这么说，再也没有必要避免使用该语言了，本文是R系列文章的第一篇。

Today we’ll compare R and Python in the process of exploratory data analysis and data visualization, both through code and through final outputs. I’m heavily Python-biased, but my conclusions may still surprise you. Keep reading to find out.

今天，我们将通过代码和最终输出，在探索性数据分析和数据可视化过程中比较R和Python。我对Python有偏见，但是我的结论可能仍然让您感到惊讶。继续阅读以找出答案。

Anyhow, let’s start with the comparisons, shall we?

无论如何，让我们从比较开始吧？

探索性数据分析 (Exploratory data analysis)

EDA is where data scientists spend the majority of their time, so an easy-to-write and easy-to-understand language is a must. I’m using external libraries in both languages — Pandas in Python and Tidyverse in R.

EDA是数据科学家花费大部分时间的地方，因此必须具有易于编写和易于理解的语言。我正在使用两种语言的外部库Tidyverse中的Pandas和R中的Tidyverse 。

数据集加载 (Dataset loading)

We’ll use the MPG dataset for this article. It’s built into R, but there isn’t the same dataset in Python. To accommodate, I’ve exported the dataset from R as a CSV so we can start clean with both languages.

我们将在本文中使用MPG数据集。它内置在R中，但是Python中没有相同的数据集。为了适应，我已经将R中的数据集导出为CSV，以便我们可以从两种语言开始进行整理。

Here’s how to read CSV file with R:

使用R读取CSV文件的方法如下：

mpg <- read.csv(‘mpg.csv’)
head(mpg)

The head function is used in R to see the first 6 rows, and the end result looks like this:

R中使用head函数查看前6行，最终结果如下所示：

Let’s do the same with Python:

让我们对Python做同样的事情：

mpg = pd.read_csv(‘mpg.csv’)
mpg.head()

Great! Looks like we have an extra column — X in R and Unnamed: 0 in Python, so let’s remove those next.

大！看起来我们还有一个额外的列-R中的X和Python中的Unnamed: 0 ，所以接下来删除它们。

删除属性 (Removing attributes)

Here’s how to remove the unwanted X column in R:

以下是删除R中不需要的X列的方法：

mpg <- mpg 
  %>% select(-X)

And here’s the Python variant:

这是Python变体：

mpg.drop(‘Unnamed: 0’, axis=1, inplace=True)

Specifying column names like variables (without quotation marks) is not something I’m most comfortable with, but it is what it is.

指定列名(例如变量)(不带引号)不是我最喜欢的事情，但这就是它的意思。

筛选资料 (Filtering data)

Let’s continue with something a bit more interesting — data filtering or subsetting. We’ll see how to select only those records where the number of cylinders cyl is 6.

让我们继续一些有趣的事情-数据过滤或子集。我们将看到如何，只选择那些记录，汽缸数cyl为6。

With R:

使用R：

head(mpg %>%
  filter(cyl == 6))

Keep in mind that head function is only here so we don’t get a ton of output in the console. It is not part of the data filtering process.

请记住，只有head功能在这里，所以我们在控制台中不会获得大量输出。它不是数据过滤过程的一部分。

And the same with Python:

与Python相同：

mpg[mpg[‘cyl’] == 6].head()

Awesome! Let’s see what more can we do.

太棒了！让我们看看我们还能做些什么。

创建派生列 (Creating derived columns)

We’ll create a boolean attribute is_newer, which is True if the car was made in 2005 or after, and False otherwise.

我们将创建一个布尔属性is_newer ，如果汽车是在2005年或之后生产的，则为True，否则为False。

Here’s the R syntax:

这是R语法：

head(mpg %>%
  mutate(is_newer = year >= 2005))

And here’s the same thing with Python:

Python也是如此：

mpg[‘is_newer’] = mpg[‘year’] >= 2005

And that’s it for the EDA. Let’s make a brief conclusion on it next.

这就是EDA。接下来让我们对此做一个简单的结论。

EDA的最终想法 (EDA final thoughts)

It’s hard to pick a winner here since both languages are great. I repeat, it’s very strange for me not to put quotation marks around the column names, but that’s just something I’ll have to get used to.

由于这两种语言都很出色，因此很难在这里选择赢家。我重复一遍，对我而言，不在列名两边加上引号是很奇怪的，但这只是我必须习惯的事情。

Furthermore, I absolutely love the easiness of chaining things in R. Here’s an example:

此外，我绝对喜欢在R中链接事物的简便性。这是一个示例：

mpg <-
  read.csv(‘mpg.csv’) %>%
  select(-X) %>% 
  filter(cyl == 6) %>%
  mutate(is_newer = year >= 2005) %>%
  select(displ, year, cyl, is_newer)

Here we basically did everything from above and more, all in a single command. Let’s proceed with the data visualization part.

在这里，我们基本上是通过一个命令来完成所有以上的工作。让我们继续进行数据可视化部分。

数据可视化 (Data visualization)

When it comes to data visualization, one thing is certain — Python doesn’t stand a chance! Well, at least if we’re talking about the default options for both languages. The following libraries were used for this comparison:

在数据可视化方面，可以肯定的是-Python绝不是偶然的机会！好吧，至少如果我们在谈论两种语言的默认选项。以下库用于此比较：

ggplot2 — for R
ggplot2用于R
matplotlib — for Python
matplotlib —适用于Python

Let’s start with a simple scatter plot of engine displacement on the X-axis and highway MPG on the Y-axis.

让我们从X轴上的发动机排量和Y轴上的公路MPG的简单散点图开始。

Here’s the R syntax and results:

这是R语法和结果：

ggplot(data = mpg, aes(x = displ, y = hwy)) + 
  geom_point()

And for Python:

对于Python：

plt.scatter(x=mpg[‘displ’], y=mpg[‘hwy’])

Neither of these looks particularly good out of the box, but R is miles ahead in this department, at least with the default styles.

这些看起来都不是特别好的开箱即用，但是R在该部门中至少要使用默认样式时要遥遥领先。

Let’s now add some colors. The points should be colored according to the class attribute, so we can easily know where each type of car is located.

现在让我们添加一些颜色。这些点应根据class属性进行着色，以便我们可以轻松知道每种汽车的位置。

Here’s the syntax and result for R:

这是R的语法和结果：

ggplot(data = mpg, aes(x = displ, y = hwy, color = class)) + 
  geom_point()

It doesn’t get much easier than that, and the Python example below is a clear indicator. I haven’t managed to find an easy way to map out a categorical variable as colors (at least with Matplotlib), so here’s what I ended up with:

它并没有比这容易得多，下面的Python示例是一个明确的指标。我还没有找到一种简单的方法来将分类变量映射为颜色(至少使用Matplotlib )，所以这就是我的最终结果：

def get_color(car_class):
    colors = {
        ‘compact’   : ‘brown’,
        ‘midsize’   : ‘green’,
        ‘suv’       : ‘pink’,
        ‘2seater’   : ‘red’,
        ‘minivan’   : ‘teal’,
        ‘pickup’    : ‘blue’,
        ‘subcompact’: ‘purple’
    }
 return colors[car_class]colors = mpg[‘class’].apply(get_color)
plt.scatter(x=mpg[‘displ’], y=mpg[‘hwy’], c=colors)

All of that work for a not-so-appealing chart. Point for R.

所有这些工作对于一个不太吸引人的图表而言。指向R。

Let’s now finalize the chart by adding a title and labels for axes. Here’s how to do it in R:

现在，通过添加轴的标题和标签来最终确定图表。这是在R中执行的方法：

ggplot(data = mpg, aes(x = displ, y = hwy, color = class)) + 
  geom_point(size = 3) + 
  labs(title = ‘Engine displacement vs. Highway MPG’,
       x = ‘Engine displacement (liters)’,
       y = ‘Highway miles per gallon’)

Again, fairly straightforward syntax, and the chart looks amazing (well, kind of).

同样，语法非常简单，图表看起来很棒(嗯，有点)。

Here’s how to do the same with Python:

使用Python的方法如下：

plt.scatter(x=mpg[‘displ’], y=mpg[‘hwy’], c=colors, s=75)
plt.title(‘Engine displacement vs. Highway MPG’)
plt.xlabel(‘Engine displacement (liters)’)
plt.ylabel(‘Highway miles per gallon’)

It’s up to you to decide which looks better, but R is a clear winner in my opinion. Visualizations can be tweaked, of course, but I deliberately wanted to use the default libraries for both languages. I know that Seaborn looks better, there’s no point in telling me that in the comment section.

由您决定哪个看起来更好，但是我认为R是明显的赢家。当然，可以对可视化进行调整，但是我故意要对两种语言都使用默认库。我知道Seaborn看起来更好，没有必要在评论部分告诉我。

And that about does it for this article. Let’s wrap things up in the next section.

这就是本文要做的。让我们在下一节中总结一下。

最后的想法 (Final thoughts)

This was a rather quick comparison of R and Python in the realm of data science. Choosing one over the other isn’t a simple task, as both are great. Among the two, Python is considered to be a general-purpose language, so it’s the only viable option if you want to build software with data science, and not work directly in data science.

这是R和Python在数据科学领域的相当快速的比较。选择一个不是一个简单的任务，因为两者都很棒。在这两种语言中，Python被认为是一种通用语言，因此，如果要使用数据科学构建软件而不直接在数据科学中工作，它是唯一可行的选择。

You can’t go wrong with either — especially now when I know that R supports dashboards, web scraping, and API development. More articles like this one are to come, guaranteed.

您都不会出错-尤其是当我知道R支持仪表板，Web抓取和API开发时。保证还会有更多这样的文章。

Thanks for reading.

谢谢阅读。