编程中什么叫做元素什么叫帧_R编程中的数据帧

最新推荐文章于 2024-09-03 18:18:17 发布

cunchi4221

最新推荐文章于 2024-09-03 18:18:17 发布

阅读量529

点赞数

文章标签：数据库 python java 大数据机器学习

原文链接：https://www.journaldev.com/35741/data-frames-in-r-programming

版权

编程中什么叫做元素什么叫帧

Let’s continue in our R programming tutorial series, and understand data frames in R. If you have ever handled data in databases, you will be familiar with the idea of records. Records are nothing but a collection of variables. For example, a student record could contain the student’s roll number, name, age, and gender altogether in one observation termed as the record. Database management systems store collections of such records as a table. R has a structure similar to such tables. These are known as data frames.

让我们继续我们的R编程教程系列，了解R中的数据框架。如果您曾经处理过数据库中的数据，您将熟悉记录的概念。记录不过是变量的集合。例如，一个学生记录可以在一个称为记录的观察中包含学生的卷号，姓名，年龄和性别。数据库管理系统将这些记录的集合存储为表格。 R具有类似于此类表的结构。这些被称为数据帧 。

Unlike matrices or vectors, data frames have no restriction on the data type of variables. Each data frame can be a collection of numeric, strings, factors and so on. The only rule for writing a data frame in R is that all the records must be of the same length. Data frames in R are equipped with several functions and capabilities to handle large amounts of data for statistical processing purposes. Let us get started with data frames.

与矩阵或向量不同，数据帧对变量的数据类型没有限制。每个数据框可以是数字，字符串，因子等的集合。在R中写入数据帧的唯一规则是所有记录的长度必须相同。 R中的数据帧配备了多种功能，可以处理大量数据以进行统计处理。让我们开始使用数据帧。

在R编程中创建数据框 (Creating a Data frame in R Programming)

A data frame can be created using the data.frame() function in R. This function can take any number of equal length vectors as arguments, along with one optional argument stringsAsFactors. We will discuss about this shortly. The following is an example of a simple data frame creation.

可以使用R中的data.frame()函数创建一个数据帧。此函数可以将任意数量的等长向量作为参数，以及一个可选的参数stringsAsFactors 。我们将很快对此进行讨论。以下是创建简单数据框的示例。


#Create a string vector to hold the names.
> names <- c("Adam","Antony","Brian","Carl","Doug")

#Create an integer vector to hold the respective ages.
> ages <- c(23,22,24,25,26)

#Build a dataframe from the two vectors
> playerdata <- data.frame(names,ages,stringsAsFactors = FALSE)
#Display the data frame

> playerdata
   names ages
1   Adam   23
2 Antony   22
3  Brian   24
4   Carl   25
5   Doug   26

Notice how the data frame accommodates both the string and associated integer together in its structure. It is possible to have any number of columns like this for a data frame. R also provides a unique index number to each record in the data frame as shown.

注意数据帧如何在其结构中同时容纳字符串和关联的整数。数据框可以具有任意数量的这样的列。如图所示，R还为数据帧中的每个记录提供唯一的索引号。

The argument stringsAsFactors is set to FALSE. Otherwise, the R compiler would treat each name as a specific categorical variable as we have seen in the factors tutorial earlier.

参数stringAsFactors设置为FALSE。否则，R编译器会将每个名称视为特定的分类变量，正如我们在前面的因素教程中所看到的那样。

使用R语言从数据帧访问记录 (Accessing Records from Data Frames in R Language)

The components of data frames can be accessed via the index numbers or the column names. The indexing of columns is done using a double square brace symbol [[ ]]. When you access the columns using the names you need to precede the name by a dollar sign $.

数据帧的组成部分可以通过索引号或列名进行访问。列的索引是使用双方括号符号[[ ]] 。使用名称访问列时，您需要在名称前加一个美元符号$ 。


> playerdata[[2]]
[1] 23 22 24 25 26
> playerdata$names
[1] "Adam"   "Antony" "Brian"  "Carl"   "Doug"

When you wish to access data at a specific location, such as 2nd item on the 4th column, it can be done by a matrix indexing like notation. Let us look at an example.

当您希望访问特定位置的数据时，例如第四列的第二项，可以通过矩阵索引（如表示法）来完成。让我们来看一个例子。


> playerdata[3,2]
[1] 24

You can verify the size of your data frame using the ncol, nrow and dim functions.

您可以验证您的数据帧的使用规模ncol ， nrow和dim功能。


> names <-c("Akash","Amulya","Raju","Charita","Lokesh","Deepa","Ravi")
> sex<-factor(c("M","F","M","F","M","F","M"))
> age<-c(23,24,34,30,45,33,25)

> emp<-data.frame(names,sex,age, stringsAsFactors = FALSE)

> emp
    names sex age
1   Akash   M  23
2  Amulya   F  24
3    Raju   M  34
4 Charita   F  30
5  Lokesh   M  45
6   Deepa   F  33
7    Ravi   M  25

#Check the dimensions of the data frame
> dim(emp)
[1] 7 3
> nrow(emp)
[1] 7
> ncol(emp)
[1] 3

在R中扩展数据帧 (Extending Data Frames in R)

Often, real-time data is dynamic in nature. The structure of data keeps changing as new variables get added. The length of the data keeps changing as more observations are made. To accommodate these, R provides means to add and remove both rows and columns to data frames.

实时数据通常是动态的。随着新变量的添加，数据的结构不断变化。随着进行更多的观察，数据的长度不断变化。为了适应这些情况，R提供了在数据帧中添加和删除行和列的方法。

Let us try adding a new record to the above created emp data frame. To do this, we first need to create the records to be added as a data frame separately. Say we need to add a single record.

让我们尝试将新记录添加到上面创建的emp数据框中。为此，我们首先需要创建要单独添加为数据框的记录。假设我们需要添加一条记录。


> newdata <-data.frame(names="Indu",sex="F",age=29)

Now we add this row to the already created emp dataframe as follows.

现在，我们将此行添加到已创建的emp数据框中，如下所示。


> emp <- rbind(emp,newdata)
> emp
    names sex age
1   Akash   M  23
2  Amulya   F  24
3    Raju   M  34
4 Charita   F  30
5  Lokesh   M  45
6   Deepa   F  33
7    Ravi   M  25
8    Indu   F  29

Adding columns to the data frame can be done by using a cbind() function instead.

可以通过使用cbind()函数来将列添加到数据框中。


> salary <-c(10000,12000,20000,12000,21000,15000,13000,10000)

> emp<-cbind(emp,salary)
> emp
    names sex age salary
1   Akash   M  23  10000
2  Amulya   F  24  12000
3    Raju   M  34  20000
4 Charita   F  30  12000
5  Lokesh   M  45  21000
6   Deepa   F  33  15000
7    Ravi   M  25  13000
8    Indu   F  29  10000

子集数据框 (Subsetting Dataframes)

You might be familiar with SQL (Sequential Query Language) used to query tables in databases using some logical conditions. R offers similar capabilities to query the data frames and generate logical subsets of larger data frames.

您可能熟悉使用某些逻辑条件在数据库中查询表SQL（顺序查询语言）。 R提供类似的功能来查询数据帧并生成较大数据帧的逻辑子集。

Suppose that I wish to extract all the data records from the emp frame that belong to male employees. I can do that using the following line of code.

假设我希望从emp框架中提取属于男性雇员的所有数据记录。我可以使用以下代码行做到这一点。


> emp[emp$sex=="M",]
   names sex age salary
1  Akash   M  23  10000
3   Raju   M  34  20000
5 Lokesh   M  45  21000
7   Ravi   M  25  13000

The part emp$sex=='M‘ gives a Boolean vector of whether or not the value of sex is M for a particular row. We use the same logical Boolean vector to index upon the emp frame. The comma that follows is necessary to specify matrix-like indexing. The part before comma represents the rows and the one after is columns. Since we left it blank, we simply select all the columns. We could also choose to display only the names and sex instead.

emp$sex=='M '部分提供布尔向量，表明特定行的sex值是否为M。我们使用相同的逻辑布尔向量在emp框架上建立索引。对于指定类似矩阵的索引，必须使用以下逗号。逗号前的部分代表行，而逗号后的部分代表列。由于我们将其留为空白，因此我们只需选择所有列。我们也可以选择只显示姓名和性别。


> emp[emp$sex=="M",1:2]
   names sex
1  Akash   M
3   Raju   M
5 Lokesh   M
7   Ravi   M

Instead of supplying the index number of columns, you can also give the names of the columns like below.

除了提供列的索引号，您还可以提供如下的列名。


> emp[emp$sex=='F',c("names","sex")]
    names sex
2  Amulya   F
4 Charita   F
6   Deepa   F
8    Indu   F

You could also add logical conditions in indexing your data frames. Suppose you wish to extract the records of all the female employees with a salary greater than 12000. You can do that as follows.

您也可以在索引数据帧时添加逻辑条件。假设您希望提取所有薪水高于12000的女雇员的记录。您可以按照以下步骤进行操作。


> emp[emp$sex=='F'&salary&gt;12000,]
  names sex age salary
6 Deepa   F  33  15000

有用的数据框功能 (Useful Data Frame Functions)

Apart from the utilities listed above, there are some handy functions you may need while handling data frames. This section lists a few of them.

除了上面列出的实用程序外，在处理数据帧时可能还需要一些方便的功能。本节列出了其中的一些。

排序数据框 (Sorting a Data Frame)

Sorting can be done using an order function in the indexing.

可以使用索引中的顺序功能进行排序。


#Sort by decreasing order of salaries.
> emp[order(emp$salary,decreasing = TRUE),]
    names sex age salary
5  Lokesh   M  45  21000
3    Raju   M  34  20000
6   Deepa   F  33  15000
7    Ravi   M  25  13000
2  Amulya   F  24  12000
4 Charita   F  30  12000
1   Akash   M  23  10000
8    Indu   F  29  10000

#Sort by ascending alphabetical order of names.
> emp[order(emp$names,decreasing = FALSE),]
    names sex age
1   Akash   M  23
2  Amulya   F  24
4 Charita   F  30
6   Deepa   F  33
5  Lokesh   M  45
3    Raju   M  34
7    Ravi   M  25

`head()`和`tail()`函数 (`head()` and `tail()` functions)

These are used to get the first few or last few rows of a dataframe respectively. These are especially useful when you have a huge dataset. They allow you to examine the characteristics of data without having to clog memory by displaying the entire dataset.

这些分别用于获取数据帧的前几行或最后几行。当您拥有庞大的数据集时，这些功能特别有用。它们使您可以检查数据的特征，而不必通过显示整个数据集来阻塞内存。


#Get the top 2 rows of the dataset.
> head(emp,2)
   names sex age
1  Akash   M  23
2 Amulya   F  24

#Get the last 3 rows of the dataset.
> tail(emp,3)
  names sex age
6 Deepa   F  33
7  Ravi   M  25
8  Indu   F  29

By default, the head and tail functions fetch 6 rows without any number specified.

默认情况下，head和tail函数获取6行，但未指定任何数字。

合并两个数据框 (Merging two Data Frames)

Merging data frames is similar to performing database joins on tables. When there is more information available regarding one column of the data frame in a separate data frame, we can easily merge these two using the common column. For example, consider that we have the marital status of some of the employees available as below.

合并数据帧类似于在表上执行数据库联接。当在单独的数据帧中有关于该数据帧的一列的更多信息可用时，我们可以使用公共列轻松合并这两个列。例如，请考虑以下情况，我们拥有一些雇员的婚姻状况。


> mar -> data.frame(names=c("Akash","Amulya","Raju","Lokesh","Ravi","Indu"),marital=c("single","married","single","single","single","married"))

We can now merge this with our emp data frame to get the combined information.

现在，我们可以将其与emp数据框合并，以获取合并的信息。


> merge(emp,mar,by="names",all=TRUE)
    names sex age marital
1   Akash   M  23  single
2  Amulya   F  24 married
3 Charita   F  30    <NA>
4   Deepa   F  33    <NA>
5    Indu   F  29 married
6  Lokesh   M  45  single
7    Raju   M  34  single
8    Ravi   M  25  single

Even when we have no information about some employees, the data frame still gets populated with NA values to ensure a smooth merge.

即使我们没有关于某些员工的信息，数据框仍然会填充NA值，以确保顺利合并。