r包 数据科学 方匡南_数据科学项目必不可少的r包

r包 数据科学 方匡南

Leverage genius work from R community in your projects!

在您的项目中利用R社区的才华!

There are more than 16 000 packages on the Comprehensive R Archive Network (CRAN) that gather a lot of commonly used methods in data science projects. Time runs fast, and it may takes days to code functionalities for sometimes basic tasks… Fortunately, we can leverage many packages to focus on what is essential for projects to be successful!

综合R存档网络(CRAN)上有16000多个软件包 收集了数据科学项目中的许多常用方法。 时间过得很快,有时甚至需要花几天的时间来为有时要执行的基本功能编写功能代码……幸运的是,我们可以利用许多软件包来专注于项目成功的关键!

快速提醒:安装和使用软件包 (Quick reminder: install and use packages)

The most common way is to install a package directly from CRAN using the following R command:

最常见的方法是使用以下R命令直接从CRAN安装软件包:

# this command installs tidyr package from CRAN
install.packages("tidyr")

Once the package is installed on your local machine, you don’t need to run this command again, unless you want to update the package with its latest version! If you want to check the version of a package you installed, you may use:

将软件包安装到本地计算机上后,无需再次运行此命令,除非您想用最新版本更新软件包! 如果要检查安装的软件包的版本,可以使用:

# returns tidyr package version
packageVersion("tidyr")

RStudio IDE also provides a convenient way to check if any update is available for installed packages in Tools/Check for packages updates…

RStudio IDE还提供了一种便捷的方法来检查“ 工具”中的已安装软件包是否有可用的更新/检查软件包更新…

Image for post
Update all your packages in a few clicks using RStudio
使用RStudio几次单击即可更新所有软件包

Last but not least: how to use a package now it is installed :) You may either specify the package name in front of its included method:

最后但并非最不重要的一点:安装包后如何使用它:)您可以在其包含的方法前面指定包名称:

stringr::str_replace("Hello world!", "Hello", "Hi")

Or run the following command to load all the package’s functions at once:

或运行以下命令一次加载所有程序包的功能:

# load a package: it will throw an error if package is not installed
library(stringr)

Now you’re ready to go!

现在您可以开始了!

If you want to learn basically everything about R packages development, I highly recommend Hadley Wickham R packages book (free online version).

如果您想基本上学习有关R包开发的所有知识,我强烈推荐Hadley Wickham R包书(免费在线版本)

取得资料 (Fetching data)

Fetching data is often the starting point of a data science project: data can be located in a database, an Excel spreadsheet, a comma-separated values (csv) file… it is essential to be able to read it regardless of its format, and avoid headaches before even starting to work with the data!

提取数据通常是数据科学项目的起点:数据可以位于数据库,Excel电子表格,逗号分隔值(csv)文件中……无论格式如何,都必须能够读取数据,并且在开始使用数据之前,请避免头痛!

  • When data is located in a .csv files or any delimited-values file

    数据位于.csv文件或任何带分隔符的文件中时

The readr package provides functions that are up to 10 times faster than base R functions to read rectangular data.

readr程序包提供的功能比基本R函数读取矩形数据的速度快10倍

Image for post
Great R packages usually have a dedicated hex sticker: https://github.com/rstudio/hex-stickers
出色的R软件包通常带有专用的十六进制标签: https : //github.com/rstudio/hex-stickers

Convenient methods exist for reading and writing standard .csv files as well as custom files with a custom values separation symbol:

存在用于读取和写入标准.csv文件以及带有自定义值分隔符号的自定义文件的便捷方法:

# read csv data delimited using comma (,)
input_data <- readr::read_csv("./input_data.csv")
# read csv data delimited using semi-colon (;)
input_data <- readr::read_csv2("./input_data.csv")
# read txt data delimited using whatever symbols (||)
input_data <- readr::read_delim("./input_data.txt", delim = "||")

In addition to good looking stickers, great R packages also have cheat sheets you can refer to!

除了好看的贴纸,出色的R包还提供备忘单,您可以参考!

  • When data is located in an Excel file

    数据位于Excel文件中时

Microsoft Excel has its own file formats (.xls and .xlsx) and is very commonly used to store and edit data. The package readxl enables efficient reading of these files into R, you can even only read a specific spreadsheet:

Microsoft Excel具有自己的文件格式(.xls和.xlsx),非常常用于存储和编辑数据。 软件包readxl可以将这些文件有效地读取到R中,您甚至只能读取特定的电子表格:

# read Excel spreadsheets
input_data <- readxl::read_excel("input_data.xlsx", sheet = "page2")
  • When data is located in a database or in the cloud

    当数据位于数据库或云中时

When it comes to fetching data from databases, DBI makes it possible to connect to any server, as long as you provide the required credentials, and run SQL queries to fetch data. Because there are many different databases and ways to connect depending on your technical stack, I suggest that you refer to the complete documentation provided by RStudio to find the steps that suit your needs: Databases using R.

从数据库中获取数据时,只要您提供所需的凭据, DBI就能连接到任何服务器,并运行SQL查询以获取数据。 由于根据您的技术堆栈,有许多不同的数据库和连接方式,因此建议您参考RStudio提供的完整文档以找到适合您需要的步骤: 使用R的数据库

Make sure to check if a package exists to connect to your favorite cloud services provider! For example, bigrquery enables fetching data from Google BigQuery platform.

确保检查是否存在软件包以连接到您最喜欢的云服务提供商! 例如, bigrquery允许从Google BigQuery平台获取数据。

争吵数据 (Wrangling data)

You may have noticed a lot of the previously mentioned packages are part of the tidyverse. This collection of packages forms a powerful toolbox that you can leverage throughout your data science projects. Mastering these packages is key to become super efficient with R.

您可能已经注意到很多前面提到的软件包是tidyverse的一部分。 这些软件包集合构成了一个功能强大的工具箱,您可以在整个数据科学项目中利用该工具箱。 精通这些软件包是使R变得超高效的关键。

Image for post
The pipe operator shipped with the magrittr package is a game changer https://github.com/tidyverse/magrittr
magrittr软件包随附的管道操作员 是改变游戏规则的人 https://github.com/tidyverse/magrittr

Data wrangling is made easy using the pipe operator, which goal is simply to pipe left-hand values into right-hand expressions:

使用管道运算符可以使数据整理变得容易,其目的只是将左侧的值管道传输到右侧的表达式中:

# without pipe operator
paste("Hello", "world!")
# with pipe operator
"Hello" %>% paste("world!)

It may not seem obvious in this example, but this is a life-changing trick when you need to perform several sequential operations to a given object, typically a data frame.

在此示例中似乎并不明显,但是当您需要对给定对象(通常是数据帧)执行多个顺序操作时,这是改变生活的窍门。

Data frames usually contains your input data, making it the R object you probably work the most with. dplyr is a package that provides useful functions to edit, filter, rearrange or join data frames.

数据框通常包含您的输入数据,使其成为您使用最多的R对象。 dplyr是一个软件包,提供有用的功能来编辑,过滤,重新排列或dplyr数据框。

library(dplyr)
# mtcars is a toy data set shipped with base R
# create a column
mtcars <- mtcars %>% mutate(vehicle = "car")
# filter on a column
mtcars <- mtcars %>% filter(cyl >= 6)
# create a column AND filter on a column
mtcars <- mtcars %>%
mutate(vehicle = "car") %>%
filter(cyl >= 6)

Now you should understand my point about the power of the pipe operator :)

现在,您应该了解我对管道操作员功能的看法:)

There is so much more to say about data wrangling that you can find entire books discussing the topic, such as Data Wrangling with R. In addition, a key work on leveraging tidyr functionalities is R for Data Science. A free online version of the latter can be found here. Please notice that these are Amazon affiliated links so I will receive a commission if you decide to buy the books.

关于数据争用还有很多话要说,您可以找到讨论该主题的整本书,例如使用R进行数据争用 。 此外,利用tidyr功能的关键工作是R for Data Science 。 后者的免费在线版本可以在这里找到。 请注意,这些是Amazon的附属链接,因此,如果您决定购买书籍,我将获得佣金。

Visualization

可视化

One of the main reason R is a very good choice for data science projects may be ggplot2. This package makes it easy and eventually fun to build visualizations that looks good and gather a lot of informations.

对于数据科学项目,R是一个非常好的选择的主要原因之一可能是ggplot2 。 该软件包使构建看起来不错的可视化并收集大量信息变得容易且最终很有趣。

Image for post
http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html http ://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html

ggplot2 is also part of the tidyverse collection, that’s why it works perfectly with shapes of data you typically obtain after tidyr or dplyr data wrangling operations. Managing to plot histograms and scatter plots is rather quick. Then many additional elements can be used to enhance your plots.

ggplot2也是tidyverse集合的一部分,这就是为什么它可以完美处理通常在tidyrdplyr数据dplyr操作之后获得的数据形状的原因。 设法绘制直方图和散点图相当快。 然后,可以使用许多其他元素来增强绘图效果。

机器学习 (Machine learning)

Another very convenient package is caret, that wraps up a lot of methods typically used in machine learning processes. From data preparation, to model training and performances assessment, you will find everything you need when working on predictive analytics tasks.

另一个非常方便的软件包是caret ,它包装了机器学习过程中通常使用的许多方法。 从数据准备到模型培训和绩效评估,您将发现从事预测分析任务所需的一切。

I recommend reading the caret chapter about model training where this key task is discussed. Here is a very simple example of how to train a logistic regression:

我建议阅读有关模型训练caret一章其中讨论了此关键任务。 这是一个如何训练逻辑回归的非常简单的示例:

library(dplyr)# say we want to predict iris having a big petal width
observations <- iris %>%
mutate(y = ifelse(Petal.Width >= 1.5, "big", "small")) %>%
select(-Petal.Width)# set up a a 10-fold cross-validation
train_control <- caret::trainControl(method = "cv",
number = 10,
savePredictions = TRUE,
classProbs = TRUE)# make it reproducible and train the model
set.seed(123)
model <- caret::train(y ~ .,
data = observations,
method = "glm",
trControl = train_control,
metric = "Accuracy")

最后的话 (Final words)

Thanks a lot for reading my very first article on Medium! I feel like there is so much more to say in each section, as I did not talk about other super useful packages such as boot, shiny, shinydashboard, pbapply… Please share your thoughts in the comments, I am very interested in feedbacks on what you are willing to explore in future articles.

非常感谢您阅读我关于Medium的第一篇文章! 我觉得有这么多在每节的说,我没有谈论其他超级有用的包,如bootshinyshinydashboardpbapply ......请分享你的想法在评论中,我的反馈非常感兴趣的是什么您愿意在以后的文章中进行探讨。

有用的文档和参考 (Useful documentations and references)

翻译自: https://medium.com/@ericbonucci/essential-r-packages-for-data-science-projects-d79cb5698b96

r包 数据科学 方匡南

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值