R语言从基础入门到提高(六) frame 框架

第1程序:

What's a dataframe?

100xp

You may remember from the chapter aboutmatrices that all the elements that you put in a matrix should be of the sametype. Back then, your data set on Star Wars only contained numeric elements.

When doing a market research survey,however, you often have questions such as:

'Are your married?' or 'yes/no' questions(logical)

'How old are you?' (numeric)

'What is your opinion on this product?' orother 'open-ended' questions (character)

...

The output, namely the respondents' answersto the questions formulated(定制) above, is a data set of different data types. You will often findyourself working with data sets that contain different data types instead ofonly one.

A data frame has the variables of a dataset as columns and the observations as rows. This will be a familiar(熟悉) concept for those coming fromdifferent statistical software packages such as SAS or SPSS.

spss 的安装与使用见之前博文,里面还有破解版永久使用版的spss的方法,以及学习笔记。

要求:

Click 'Submit Answer'. The data from thebuilt-in example data frame mtcars will be printed to the console.

源程序:

 

# Print outbuilt-in R data frame
mtcars
console:
> # Printout built-in R data frame
> mtcars
                    mpgcyl  disp  hp drat    wt  qsec vs am gear carb
MazdaRX4           21.0   6 160.0 110 3.902.620 16.46  0  1    4    4
Mazda RX4Wag       21.0   6 160.0 110 3.90 2.87517.02  0  1    4    4
Datsun710          22.8   4 108.0  93 3.852.320 18.61  1  1    4    1
Hornet 4Drive      21.4   6 258.0 110 3.08 3.215 19.44 1  0    3    1
HornetSportabout   18.7   8 360.0 175 3.15 3.440 17.02 0  0    3    2
Valiant            18.1   6 225.0 105 2.763.460 20.22  1  0    3    1
Duster360          14.3   8 360.0 245 3.21 3.57015.84  0  0    3    4
Merc240D           24.4   4 146.7  623.69 3.190 20.00  1  0    4    2
Merc230            22.8   4 140.8  953.92 3.150 22.90  1  0    4    2
Merc280            19.2   6 167.6 123 3.923.440 18.30  1  0    4    4
Merc280C           17.8   6 167.6 123 3.923.440 18.90  1  0    4    4
Merc450SE          16.4   8 275.8 180 3.07 4.07017.40  0  0    3    3
Merc450SL          17.3   8 275.8 180 3.07 3.73017.60  0  0    3    3
Merc450SLC         15.2   8 275.8 180 3.07 3.78018.00  0  0    3    3
CadillacFleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0 0    3    4
Lincoln Continental 10.4   8 460.0 2153.00 5.424 17.82  0  0    3    4
ChryslerImperial   14.7   8 440.0 230 3.23 5.345 17.42 0  0    3    4
Fiat128            32.4   4 78.7  66 4.08 2.200 19.47  1  1    4    1
HondaCivic         30.4   4  75.7  524.93 1.615 18.52  1  1    4    2
Toyota Corolla      33.9  4  71.1  65 4.22 1.835 19.90  1  1   4    1
Toyota Corona      21.5   4 120.1  97 3.70 2.465 20.01 1  0    3    1
DodgeChallenger    15.5   8 318.0 150 2.76 3.520 16.87 0  0    3    2
AMCJavelin         15.2   8 304.0 150 3.153.435 17.30  0  0    3    2
CamaroZ28          13.3   8 350.0 245 3.73 3.84015.41  0  0    3    4
Pontiac Firebird    19.2   8400.0 175 3.08 3.845 17.05  0  0    3    2
FiatX1-9           27.3   4 79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche914-2       26.0   4 120.3  91 4.43 2.14016.70  0  1    5    2
LotusEuropa        30.4   4  95.1 113 3.77 1.51316.90  1  1    5    2
Ford PanteraL      15.8   8 351.0 264 4.22 3.170 14.50 0  1    5    4
FerrariDino        19.7   6 145.0 175 3.62 2.77015.50  0  1    5    6
MaseratiBora       15.0   8 301.0 335 3.54 3.57014.60  0  1    5    8
Volvo142E          21.4   4 121.0 109 4.11 2.78018.60  1  1    4    2 



 

第2程序:

Quick, have alook at your data set

100xp

Wow, that is a lot of cars!

Working with large data sets is notuncommon in data analysis. When you work with (extremely) large data sets anddata frames, your first task as a data analyst is to develop a clearunderstanding of its structure(构造) and main elements(元素). Therefore, it is often useful to show only a small part of theentire data set.

So how to do this in R? Well, thefunction head() enables you to show the first observations(观察值) of a data frame. Similarly,the function tail() prints out the last observations in your data set.

Both head() and tail() print a top line called the 'header', whichcontains the names of the different variables in your data set.

#mark#重点理解

head( ) 查看前6行数据

tail ( ) 查看后6行数据

要求:

Call head() on the mtcars data set to have a look at the header and thefirst observations.

源程序:

 

# Callhead() on mtcars
head(mtcars)
console:
> # Callhead() on mtcars
>head(mtcars)
                  mpg cyldisp  hp drat    wt  qsec vs am gear carb
MazdaRX4         21.0   6  160 110 3.902.620 16.46  0  1    4    4
Mazda RX4Wag     21.0   6  160 110 3.90 2.875 17.02 0  1    4    4
Datsun710        22.8   4  108  93 3.85 2.32018.61  1  1    4    1
Hornet 4Drive    21.4   6  258 110 3.08 3.215 19.44 1  0    3    1
HornetSportabout 18.7   8  360 175 3.15 3.440 17.02  0 0    3    2
Valiant          18.1   6  225 105 2.76 3.46020.22  1  0    3    1



 

第3程序:

Have a look atthe structure(结构)

100xp

Another method that is often used to get arapid(迅速) overviewof your data is the function str(). The function str() shows you the structure of your data set. For adata frame it tells you:

The total number of observations (e.g. 32car types)

The total number of variables (e.g. 11 carfeatures)

A full list of the variables names(e.g. mpg, cyl ... )

The data type of each variable(e.g. num)

The first observations

Applying the str() function will often be the first thing that you dowhen receiving a new data set or data frame. It is a great way to get moreinsight in your data set before diving into the real analysis.

#mark#重点理解

拿到数据首先要做的几件事情之一

 

要求:

Investigate the structure of mtcars. Make sure that you see the same numbers, variables anddata types as mentioned(提到的) above.

源程序:

 

#Investigate the structure of mtcars
str(mtcars)
console:
> #Investigate the structure of mtcars
>str(mtcars)
'data.frame':   32 obs. of  11 variables:
 $ mpg: num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl: num  6 6 4 6 8 6 8 4 4 6 ...
 $disp: num  160 160 108 258 360 ...
 $hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $qsec: num  16.5 17 18.6 19.4 17 ...
 $vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $carb: num  4 4 1 1 2 1 4 2 2 4 ...



 

第4程序:

Creating a dataframe

100xp

Since using built-in data sets is not evenhalf the fun of creating your own data sets, the rest of this chapter is basedon your personally developed data set. Put your jet (飞机)pack on because it is time forsome space exploration(太空探险)!

As a first goal, you want to construct adata frame that describes the main characteristics of eight planets in oursolar system. According to your good friend Buzz, the main features of a planetare:

The type of planet (Terrestrial or GasGiant).

The planet's diameter relative to thediameter of the Earth.

The planet's rotation across the sunrelative to that of the Earth.

If the planet has rings or not (TRUE orFALSE).

After doing some high-quality researchon Wikipedia,you feel confident enough to create the necessaryvectors: name, type, diameter, rotation and rings;these vectors have already been coded up on the right. The first element ineach of these vectors correspond to the first observation.

You construct a data frame with the data.frame() function. As arguments, you pass the vectorsfrom before: they will become the different columns of your data frame. Becauseevery column has the same length, the vectors you pass should also have thesame length. But don't forget that it is possible (and likely) that theycontain different types of data.

要求:

Use the function data.frame() to construct a data frame. Pass thevectors name, type, diameter, rotation and rings asarguments to data.frame(), in this order. Call the resulting dataframe planets_df.

 

源程序:

 

# Definitionof vectors
name <-c("Mercury", "Venus", "Earth", "Mars","Jupiter", "Saturn", "Uranus","Neptune")
type <-c("Terrestrial planet", "Terrestrial planet","Terrestrial planet", 
         "Terrestrial planet", "Gas giant","Gas giant", "Gas giant", "Gas giant")
diameter<- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)
rotation<- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67)
rings <-c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)
 
# Create adata frame from the vectors
planets_df<- data.frame(name, type, diameter, rotation, rings)
planets_df
 
console:
> #Definition of vectors
> name<- c("Mercury", "Venus", "Earth","Mars", "Jupiter", "Saturn", "Uranus","Neptune")
> type<- c("Terrestrial planet", "Terrestrial planet","Terrestrial planet", 
         "Terrestrial planet", "Gasgiant", "Gas giant", "Gas giant", "Gasgiant")
>diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)
>rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67)
> rings<- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)
> 
> #Create a data frame from the vectors
>planets_df <- data.frame(name, type, diameter, rotation, rings)
>planets_df
    name               typediameter rotation rings
1 MercuryTerrestrial planet    0.382    58.64 FALSE
2  Venus Terrestrial planet    0.949  -243.02 FALSE
3  Earth Terrestrial planet    1.000     1.00 FALSE
4   Mars Terrestrial planet    0.532     1.03 FALSE
5Jupiter          Gas giant   11.209    0.41  TRUE
6 Saturn          Gas giant    9.449    0.43  TRUE
7 Uranus          Gas giant    4.007   -0.72  TRUE
8 Neptune          Gas giant   3.883     0.67  TRUE



 

第5程序:

Creating a dataframe (2)

100xp

The planets_df data frame shouldhave 8 observations and 5 variables. It has been made available in theworkspace, so you can directly use it.

要求:

Use str() to investigate(研究) the structure of the new planets_dfvariable.

源程序:

 

# Check thestructure of planets_df
str(planets_df)
 
console:
> # Checkthe structure of planets_df
>str(planets_df)
'data.frame':   8 obs. of  5 variables:
 $name    : Factor w/ 8 levels"Earth","Jupiter",..: 4 8 1 3 2 6 7 5
 $type    : Factor w/ 2 levels "Gas giant","Terrestrialplanet": 2 2 2 2 1 1 1 1
 $diameter: num  0.382 0.949 1 0.532 11.209 ...
 $rotation: num  58.64 -243.02 1 1.03 0.41 ...
 $rings   : logi  FALSE FALSE FALSE FALSE TRUE TRUE ...



 

第6程序:

Selection ofdata frame elements

100xp

Similar to vectors and matrices, you selectelements from a data frame with the help of square brackets [ ]. By usinga comma, you can indicate what to select from the rows and the columnsrespectively. For example:

my_df[1,2] selects the value at thefirst row and select element in my_df.

my_df[1:3,2:4] selects rows 1, 2, 3and columns 2, 3, 4 in my_df.

Sometimes you want to select all elementsof a row or column. For example, my_df[1, ] selects all elements of thefirst row. Let us now apply this technique on planets_df!

要求:

From planets_df, select the diameter(直径)of Mercury(水星): this is the value at the firstrow and the third column. Simply print out the result.

From planets_df, select all data onMars(火星) (thefourth row). Simply print out the result.

源程序:

 

# The planets_dfdata frame from the previous exercise is pre-loaded
 
# Print outdiameter of Mercury (row 1, column 3)
planets_df[1,3]
# Print outdata for Mars (entire fourth row)
planets_df[4,]
console:
> # Theplanets_df data frame from the previous exercise is pre-loaded
> 
> # Printout diameter of Mercury (row 1, column 3)
>planets_df[1,3]
[1] 0.382
> 
> 
> # Printout data for Mars (entire fourth row)
> 
>planets_df[4,]
 name               type diameterrotation rings
4 MarsTerrestrial planet    0.532     1.03 FALSE



 

第7程序:

Selection ofdata frame elements (2)

100xp

Instead of using numerics to selectelements of a data frame, you can also use the variable names to select columnsof a data frame.

Suppose you want to select the first threeelements of the type column. One way to do this is

planets_df[1:3,1]

A possible disadvantage(缺点) of this approach is that youhave to know (or look up) the column number of type, which gets hard ifyou have a lot of variables. It is often easier to just make use of thevariable name:

planets_df[1:3,"type"]

更加简单的选择元素,无需知道具体几行几列

 

要求:

Select and print out the first 5 values inthe "diameter" column of planets_df.

这个地方注意一下,因为我就把5给理解错啦,我错认为为第五行拉,其实是前五行的值

源程序:

 

# Theplanets_df data frame from the previous exercise is pre-loaded
 
# Select first5 values of diameter column
planets_df[1:5,"diameter"]
console:
> # Theplanets_df data frame from the previous exercise is pre-loaded
> 
> #Select first 5 values of diameter column
>planets_df[1:5,"diameter"]
[1] 0.382  0.949  1.000  0.532 11.209



第8程序:

Only planetswith rings

100xp

You will often want to select an entirecolumn, namely(也就是) onespecific(确切的) variablefrom a data frame. If you want to select all elements of thevariable diameter, for example, both of these will do the trick:

planets_df[,3]

planets_df[,"diameter"]

However, there is a short-cut. If yourcolumns have names, you can use the $sign:

planets_df$diameter

又是一种快捷方法,这种方法用于选择一整列!!!

#mark#重点理解

 

要求:

Use the $ sign to selectthe rings variable from planets_df. Store the vector that resultsas rings_vector.

Print out rings_vector to see ifyou got it right.

 

源程序:

 

# planets_dfis pre-loaded in your workspace
 
# Select therings variable from planets_df
rings_vector<- planets_df$rings
 
# Print outrings_vector
rings_vector
console:
> #planets_df is pre-loaded in your workspace
> 
> #Select the rings variable from planets_df
>rings_vector <- planets_df$rings
>  
> # Printout rings_vector
>rings_vector
[1] FALSEFALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE



 

第9程序:

Only planetswith rings (2)

100xp

You probably remember from high school thatsome planets in our solar system have rings and others do not. But due to otherpriorities(权利) at thattime (read: puberty) you can not recall their names, let alone their rotation(旋转) speed, etc.

Could R help you out?

If you type rings_vector in theconsole, you get:

[1] FALSE FALSE FALSE FALSE  TRUE TRUE  TRUE  TRUE

This means that the first four observations(or planets) do not have a ring (FALSE), but the other four do (TRUE). However,you do not get a nice overview of the names of these planets, their diameter,etc. Let's try to use rings_vector to select the data for the fourplanets with rings.

要求:

The code on the right selectsthe name column of all planets that have rings. Adapt(改编) the code so that instead ofonly the name column, all columns for planets that haverings are selected.

源程序:

 

# planets_dfand rings_vector are pre-loaded in your workspace
 
# Adapt thecode to select all columns for planets with rings
planets_df[rings_vector,]
console:
> #planets_df and rings_vector are pre-loaded in your workspace
> 
> # Adaptthe code to select all columns for planets with rings
>planets_df[rings_vector, ]
    name      type diameter rotation rings
5 JupiterGas giant   11.209     0.41  TRUE
6 Saturn Gas giant    9.449     0.43  TRUE
7 Uranus Gas giant    4.007    -0.72  TRUE
8 Neptune Gas giant    3.883    0.67  TRUE



 

第10程序:

Only planetswith rings but shorter

100xp

So what exactly did you learn in theprevious exercises? You selected a subset(子集) from a data frame (planets_df) based on whether or not a certaincondition was true (rings or no rings), and you managed to pull out allrelevant(相关) data.Pretty awesome! By now, NASA is probably already flirting with your CV ;-).

Now, let us move up one level and use thefunction subset(). You should see the subset() function as a short-cut to do exactly the sameas what you did in the previous exercises.

subset(my_df, subset = some_condition)

The first argument of subset() specifies the data set for which you want asubset. By adding the second argument, you give R the necessary information andconditions to select the correct subset.

The code below will give the exact sameresult as you got in the previous exercise, but this time, you didn't needthe rings_vector!

subset(planets_df, subset = rings)

要求:

Use subset() on planets_df toselect planets that have a diameter smaller than Earth. Becausethe diameter variable is a relative measure of the planet's diameterw.r.t that of planet Earth, your condition is diameter < 1.

源程序:

 

# planets_dfis pre-loaded in your workspace
 
# Selectplanets with diameter < 1
subset(planets_df,subset = diameter < 1)
# 先是集合,然后选取子集的条件是: diameter < 1
console:
> #planets_df is pre-loaded in your workspace
> 
> #Select planets with diameter < 1
>subset(planets_df, subset = diameter < 1)
    name               typediameter rotation rings
1 MercuryTerrestrial planet    0.382    58.64 FALSE
2  Venus Terrestrial planet    0.949  -243.02 FALSE
4   Mars Terrestrial planet    0.532     1.03 FALSE



 

第10程序:

Sorting

100xp

Making and creating rankings(排行榜) is one of mankind's favoriteaffairs(人类最爱干的就是各种排名).These rankings can be useful (best universities in the world), entertaining(most influential movie stars) or pointless (best 007 look-a-like).

In data analysis you can sort your dataaccording to a certain variable in the data set. In R, this is done with thehelp of the function order().

 

#mark#重点理解

 

order() is a function that gives you the ranked positionof each element when it is applied on a variable, such as a vector for example:

> a <- c(100, 10, 1000)

> order(a)

[1] 2 1 3

10, which is the second element in a,is the smallest element, so 2 comes first in the output of order(a). 100,which is the first element in a is the second smallest element, so 1comes second in the output of order(a).

This means we can use the outputof order(a) to reshuffle(改组) a:

> a[order(a)]
[1]   10  100 1000

这个地方要理解,是基于上面的结果   2,1,3 然后 进行了一个选择操作类似于

a[2,1,3]

然后就输出啦1,2,3  顺序输出:10,100,1000

 

要求:

Experiment with the order() function in the console. Click 'Submit Answer'when you are ready to continue.

 

这就是水题,- _ - =

源程序:

# Playaround with the order function in the console
console:
> # Playaround with the order function in the console



 

第11程序:

Sorting yourdata frame

100xp

Alright, now that you understand the order() function, let us do something useful with it. Youwould like to rearrange(重新安排) your data frame such that it starts with the smallest planet andends with the largest one. A sort on the diameter column.

 

要求:

Call order() on planets_df$diameter (the diameter columnof planets_df). Store the result as positions.

Now reshuffle(重组) planets_df with the positions vector as rowindexes inside square brackets. Keep all columns. Simply print out the result.

 

 

源程序:

 

# planets_dfis pre-loaded in your workspace
 
# Useorder() to create positions
positions<-  order(planets_df$diameter)
 
# Use positionsto sort planets_df
planets_df[positions,]
console:
> #planets_df is pre-loaded in your workspace
> 
> # Useorder() to create positions
>positions <-  order(planets_df$diameter)
> 
> # Usepositions to sort planets_df
>planets_df[positions, ]
    name               typediameter rotation rings
1 MercuryTerrestrial planet    0.382    58.64 FALSE
4   Mars Terrestrial planet    0.532     1.03 FALSE
2  Venus Terrestrial planet    0.949  -243.02 FALSE
3  Earth Terrestrial planet    1.000     1.00 FALSE
8 Neptune          Gas giant   3.883     0.67  TRUE
7 Uranus          Gas giant    4.007   -0.72  TRUE
6 Saturn          Gas giant    9.449    0.43  TRUE
5Jupiter          Gas giant   11.209    0.41  TRUE



 

仔细阅读题目要求:

Now reshuffle(重组) planets_df with the positions vector as rowindexes inside square brackets

这个地方说的很明确,就是要把position作为行坐标,在方括号里,所以要写为  planets_df[positions ,  ]

我就大意啦。。。。

好吧这一章节又结束啦!!! 

 

 

 

下面就要学习表啦,表格才是最有说服力的。


 

转载于:https://www.cnblogs.com/zhuhengjie/p/5966854.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值