A 数据分析基础(1)

最新推荐文章于 2024-08-23 13:46:43 发布

拉卡不是拉胯

最新推荐文章于 2024-08-23 13:46:43 发布

阅读量395

点赞数

文章标签： r语言数据分析

本文链接：https://blog.csdn.net/weixin_46241467/article/details/107786338

版权

写在最前面 for : 数据分析基础

20208月份正式开始学习自学数据分析基础
紧跟着也申请第二专业: 统计学 or 计算机与大数据
近期目前学习的主要的路线是: R语言的学习 + 数据分析课程
写下这篇文章是想要在记录一下自己学习数据分析基础的过程, 同时尝试能否找到能一起学习的朋友, 目前西部某211大一在读(好吧,开学大二).
交流的QQ: 1917823501
笔者在之前有学过C\C++\Java\Python\JavaScript等语言,但是R语言结合上数据分析的确有一定难度, 在这里记录下每次的学习过程

写在前面 for : this blog post

本篇blog适用于: 针对R语言的ggplot的朋友
我是花了几个小时的时间学习了B站上某位UP主的视频, 在此感谢他
下面所有的代码都是根据他的代码写的, 不过有些内容的确懒得敲, so…

文章目录

写在最前面 for : 数据分析基础
写在前面 for : this blog post
Ggplot 绘图 : 数据准备
- 数据由宽到长的转变-gather
- 数据由宽到长的转变-pivot_longer
Ggplot开始绘图

Ggplot 绘图 : 数据准备

数据由宽到长的转变-gather

首先要说明的是 : 这种稍有难度理解的数据转变需要先对已有数据比较清楚, 最好的学习方式就是敲代码然后猜测并尝试其结果咋来的

首先明确: 数据的长宽变化并不会引起数据信息的丢失
而是对数据进行了整合(组合)
目的是为了:留下某一个字段的一一对应的坐标:(目标字段, value)
并且这每一个数据都是在其他数据分类下的结果

步骤: 有一个标准数据(列:字段,行:记录) -> 有一个想拆分某n列的的想法(在性质上) -> 明确代码结构

# 下面一步一步来

## 1 这是一个标准数据(有必要认真体会一下)
df <- data.frame(
  boss = c( "李老板", "张老板", "李老板", "张老板" ), 
  name = c( "store1","store2","store3","store4" ),
  address = c( "普陀区","黄浦区","徐汇区","浦东新区" ),
  sale2014 = c( 3000,2500,2100,1000 ),
  sale2015 = c( 3020,2800,3900,2000 ),
  sale2016 = c( 5150 ,3600,2700,2500 ),
  sale2017 = c( 4450,4100,4000,3200 )
)
df

## 2 想法: 我要看价格 that 来自不年份
### 则结构就像是:李老板 store1 普陀区 sale2014 3000
### 也就是(n -2)个字段定位某一个抽象的平面(范围), 现在我只需要输入一个x:sale2014, 就可以找到一个y:3000了

## 3 结构: 关键词是sale2014 sale2015 sale2016 sale2017(除了关键词)
df %>%
  gather( key = "年份", 
          value = "价格", 
          -c( boss, name, address ))  # 实际上这里的-c( 我是不被拆分的列 )

df %>%
  gather( key = "年份", 
          value = "价格", 
          c( sale2014, sale2015, sale2016, sale2017))  # 实际上这里的-c( 我是被拆分的列 )

## 再观察其他的操作
df %>% 
  gather( key = "x", 
          value = 'y', 
  )

df %>% 
  select( boss, sale2017 ) %>% 
  gather( key = 'x', 
          value = 'y', 
          -c( boss ) )


# 其他的字段你就不要管了,放入到最后的:-c( boss, name, address )作为区域定位的字段
# 实际上对于区域定位的字段始终是保持了原来的一列一列,
# 但是对于目标字段就变成了两行,类似于函数表示方法: 表格法

数据由宽到长的转变-pivot_longer

df <- data.frame(
  boss = c( "李老板", "张老板", "李老板", "张老板" ), 
  name = c( "store1","store2","store3","store4" ),
  address = c( "普陀区","黄浦区","徐汇区","浦东新区" ),
  sale2014 = c( 3000,2500,2100,1000 ),
  sale2015 = c( 3020,2800,3900,2000 ),
  sale2016 = c( 5150 ,3600,2700,2500 ),
  sale2017 = c( 4450,4100,4000,3200 )
)
df

# 目的:主要研究2014年的价格 
df %>% 
  pivot_longer( cols = c( sale2014 ), 
                names_to = "Year", 
                values_to = 'Price' )

# 目的: 研究四年的价格
df %>% 
  pivot_longer( cols = c( sale2014, sale2015, sale2016, sale2017 ), 
                names_to = "Year", 
                values_to = 'Price' )
# 或者是这样
df %>% 
  pivot_longer( cols = -c( boss, name, address ), 
                names_to = "Year", 
                values_to = 'Price' )

Ggplot开始绘图

首先要明确ggplot的绘图框架

这里主要让大家初步能使用ggplot进行绘图
你需要清楚: ggplot绘图的结构

目的: 验证:对于鸢尾花不同种类的花,属性一般存在大小关系:Spepal.Length > Sepal.Width > Petal.Length > Petal.Width

# 思考需要的数据: value: 上述四种数据的大小 key 属性

# 1 获取数据
df <- iris %>%
  gather( key = "Property",  # Property : 属性
          value = "Value", 
          -c( Species ) )

head( df )

# 2 绘图: x 轴将不同的花分开 y上面用不同的颜色区分某种花的四种属性
# 基本结构:
#   ggplot( data = , mapping = aes() ) + # 这里搜集好数据
#     geom_XXX() +  # 这里正式使用数据来绘图
#     ...

ggplot( data = df, mapping = aes( Species, Value, color = Property ) ) +
  geom_point() # 最后观察绘制出来的图片很容易发现结论是正确的


# # 首先明确ggplot的基本结构
# ggplot() +   # 这是第一步,用于挑选数据,比如x y,不要打引号哦
#   geom_point() +   # 这里用于绘图
#   geom_point() ...  # 以后绘制的图形会覆盖到之前的上面,但是不会掩盖掉

## 下面针对上面的衣服图片设置小细节哈哈哈,细节决定成败哦
ggplot( data = df, mapping = aes( Species, Value, color = 属性 ) ) +
  geom_point() + ## 你仔细看哦,是不是叠加在一起了
  geom_point( color = "red", size = 1, alpha = 0.2, pch = 10 )

多个图层注释拟合曲线

## 绘图的要素: 数据 映射 图层
library( tidyverse )
citation( 'ggplot2' )  # 显示文献引用
citation()

## ggplot需要的参数: (数据: data) and (映射: mapping)
# data = , mapping = aes( x, y, color, fill = )

## 数据
# 使用钻石的数据啦
set.seed( 20200804 )
idx = sample( 1:nrow( diamonds ), 5000 )  # 只提取5000个
df <- diamonds[idx, ]
str( df )

## 简单的绘图
ggplot( data = df, mapping = aes( x = carat, y = price, color = cut ) ) +   # 这里为了整个图设置了数据
  geom_point( shape = 18, alpha = .5, 
              size = 1 )  # 根据祖传数据,设置稍加修改绘制图形

ggplot() +   # 利用自己的数据:自产知足,且不外传
  geom_point( data = df, mapping = aes( carat, price, color = cut ) )



## layer: 多个图层的绘制, 每个图层可绘制不同的图
# 使用鸢尾花的数据啦
df <- iris %>%
  gather( key = "Property", 
          value = "Value", 
          -"Species")
ggplot( data = df, mapping = aes( Species, Value, color = Property, shape = Property ) ) + 
  geom_point( size = 2, alpha = 0.5 ) + 
  geom_boxplot( size = 1, alpha = 0.5 )

## 学会其他的线条绘制
# 拟合曲线
ggplot( data = iris, mapping = aes( Sepal.Length, Petal.Length ) ) +   # 俩种长度
  geom_point() +  # 绘制结点
  geom_line( color = 'blue', linetype = 'dotted' ) +  # 连接每个节点
  geom_smooth( method = 'lm', color = 'red', 
               size = 2, linetype = 3  )  # 默认是多项式拟合,这里设置为线性拟合了


## 添加其他的信息
# 每个点的标签
ggplot( data = iris, mapping = aes( Sepal.Length, Petal.Length ) ) + 
  geom_point() + 
  geom_text( aes( label = Species ), check_overlap = T )

# 标题
ggplot( data = iris, mapping = aes( Sepal.Length, Petal.Length ) ) + 
  geom_point() +
  geom_text( aes( 4, 7.5, label = "This is title" ), color = 'red' ) + 
  geom_label( aes( 7, 7.5, label = "This is realy title " ) )

# 加注释
ggplot( data = iris, mapping = aes( Sepal.Length, Petal.Length ) ) + 
  geom_point() +
  annotate("text", x = 6, y = 7.5, 
           label = "This is the annotation", 
           color = 'red', 
           fontface = 'italic' )

密度图直方图

## 几何图形

## 加载数据
library( tidyverse)
set.seed( 2020 )
idx = sample( nrow( diamonds ), 5000 )
df <- diamonds[idx, ]

## density and histgram
# 直方图和密度图本质上是一样的
# 对象是一维的数据
ggplot( data = df, mapping = aes( carat, color = cut, fill = cut ) ) + 
  geom_density( alpha = .5 )

ggplot( data = df, mapping = aes( carat ) ) +  # 直方图不太适合重叠叠在一起
  geom_histogram( alpha = .5, color = 'white', fill = 'red' )
  # fill 是填充柱子的颜色 
  # color 是柱子旁边的颜色
# 观察不同颜色的钻石的价格分布
ggplot( diamonds, aes( price, color = color ) ) + 
  geom_density( size = 2 )

ggplot( df, aes( carat, depth, group = cut ) ) + 
  geom_density_2d()

柱状图

## 几何图形

## 加载数据
library( tidyverse)
set.seed( 2020 )
idx = sample( nrow( diamonds ), 5000 )
df <- diamonds[idx, ]

## barplot
# 数据处理
df2 = df[1:100, ]
df3 <- df2 %>%  # 这样进行预先处理会好一些
  group_by( color ) %>%
  summarise( mean = mean( price ) )

ggplot( data = df3, mapping = aes( color, mean, fill = color ) ) +
  geom_bar( stat = 'identity' )

# 层叠柱状图( 类似于之前的层叠散点图 )( 能很好的可视化出多个类别中某一类别的不同属性的值 )
ggplot( data = diamonds, mapping = aes( color, price, fill = cut ) ) + 
  geom_bar( stat = 'identity' ) 

# 分组柱状图
ggplot( data = diamonds, mapping = aes( color, price, fill = cut ) ) + 
  geom_bar( stat = 'identity', position = position_dodge() ) 

# 计算某种颜色的价格平均值
mean( subset( df2, color == "D" )$price )
sum( subset( df2, color == "D" )$price )
## 对于不同的颜色, 根据钻石分类, 绘出柱状图
ggplot( diamonds, aes( y = color ) ) + 
  geom_bar( aes( fill = clarity ), position = position_stack(reverse = F ))

折线图点线图

## 几何图形

## 加载数据
library( tidyverse)
df <- mtcars %>%
  group_by( cyl ) %>%
  summarise( mpg_m = mean( mpg ) )
df


## 基础绘图
# 散点图
ggplot( df, aes( x = cyl, y = mpg_m ) ) + 
  geom_point( size = 7, shape = 17, color = 'purple' )

# 直线图
ggplot( df, aes( x = cyl, y = mpg_m ) ) + 
  geom_line( size = 1, linetype = 2, color = 'purple' )

# 二者结合 -> 点线图
ggplot( df, aes( x = cyl, y = mpg_m ) ) + 
  geom_point( size = 7, shape = 17, color = 'purple' ) + 
  geom_line( size = 1, linetype = 2, color = 'purple' )

## grouped
# 一定要区分数据的属性!!!
df <- mtcars %>%
  group_by( cyl, am ) %>%
  summarise( mpg_m = mean( mpg ) ) %>%
  mutate_at( vars( am ), factor )   # 注意要把第二个改为因子,否则会被当做连续变量
df

# 散点的划分
ggplot( data = df, mapping = aes( x = cyl, y = mpg_m, color = am, 
                                  shape = am ) ) + 
  geom_point()

ggplot( data = df, mapping = aes( x = cyl, y = mpg_m, shape = am ) ) + 
  geom_point()

# 曲线的区分
ggplot( data = df, mapping = aes( x = cyl, y = mpg_m, color = am, linetype = am ) ) + 
  geom_line( size = 1 )


# 分组曲线图的结合:分组点线图
ggplot( df, aes( x = cyl, y = mpg_m, linetype = am, shape = am, color = am ) ) + 
  geom_point( size = 4 ) + 
  geom_line( size = 2 )

## 当x转化为因子的时候: 需要用group对因子进行分组,而原来是直接根据的数值大小就可以分开了
df$cyl <- as.factor( df$cyl )
ggplot( df, aes( cyl, mpg_m, group = am, color = am, shape = am, linetype = am ) ) + 
  geom_point( size = 2 ) +  
  geom_line( size = 1 )

## 时间序列图
df = read.csv( "./Data1/tsdata.csv" )
df$Date <- as.Date( df$Date )  # 转化为时间数据

Sys.setlocale( "LC_TIME", "English" )
ggplot( df, aes( Date, Births ) ) + 
  geom_line()

气泡图箱线图回归拟合面积饼图

library( tidyverse )

## 气泡图
ggplot( mtcars, aes( wt, mpg ) ) + 
  geom_point( aes( size = cyl ), alpha = .5 )

ggplot( mtcars, aes( wt, mpg ) ) + 
  geom_point( size = mtcars$cyl, alpha = .5 )

ggplot( mtcars, aes( wt, mpg, color = factor( cyl ) ) ) + 
  geom_point( aes( size = cyl ), alpha = .5 ) + 
  scale_size( breaks = c( 7, 8, 10 ) )

mtcars$cyl <- as.factor( mtcars$cyl )
ggplot( mtcars, aes( wt, mpg, color = cyl ) ) + 
  geom_point( aes( size = cyl ) )


## 拟合线:fit a line
# 一般曲线拟合越复杂,那么拟合的效果会越好,但是对于预测来说就非常差, 对于直接观察出模型也不便利
ggplot( mtcars, aes( wt, mpg ) ) + 
  geom_point() + 
  # geom_smooth( method = 'lm', formula = y ~ x, color = 'red', se = F)
  # geom_smooth( method = 'lm', formula = y ~ log( x ), color = 'red', se = F)
  # geom_smooth( method = 'lm', formula = y ~ poly( x, 2 ), color = 'red', se = F )
  # geom_smooth( method = 'lm', formula = y ~ poly( x, 3 ), color = 'red', se = T ) 
  geom_smooth( method = 'loess', formula = y ~ x, color = 'red', se = T ) 

## boxplot
df <- data.frame( group1 = rep( letters[1:5], each = 10, length = 50 ), 
                  group2 = rep( c( '一号', '二号' ), each = 5, length = 50 ), 
                  value = rnorm( 50, mean = 100, sd = 20 ) ) 

ggplot( df, aes( group1, value, fill = group2 ) ) +
  geom_boxplot( alpha = .5 ) + 
  geom_jitter( width = 0.1, alpha = .5 )


# 小提琴图
ggplot( df, aes( group1, value, fill = group2 ) ) + 
  geom_violin( trim = F )  # 不删除尾巴,小提琴图实际上是对称的密度分布曲线

# 面积图
df <- read.csv( "./Data1/tsdata.csv" )
head( df )
df$Date <- as.Date( df$Date )
ggplot( df[1:20, ], aes( Date, Births ) ) + 
  geom_area( stat = 'identity', fill = 'orange' )

# 饼图:pie 
df <- data.frame( cate = LETTERS[1:5], 
                  value = c( 10, 20, 30, 40, 50 ) )
df$per <- round( df$value / sum( df$value ), 4 ) * 100

ggplot( df, aes( x = "", y = value, fill = cate ) ) + 
  geom_bar( stat = 'identity', position = position_stack( reverse = T ) ) +   # 先画堆积柱状图
  geom_text( aes( y = cumsum( value ) - value / 2, label = paste( per, '%', sep = '' ) ) ) +   # 指定位置贴上标签
  coord_polar( theta = 'y', start = 0 ) +   # 转化为极坐标
  theme_void()

玫瑰图热图误差线段

library( tidyverse )

## 单组玫瑰图
df <- data.frame( cate = LETTERS[1:5], 
                  value = c( 10, 20, 30, 40, 50 ) )
ggplot( data = df, mapping = aes( x = cate, y = value, fill = cate ) ) + 
  geom_bar( stat = 'identity', width = 1 ) + 
  coord_polar( theta = 'x' ) +
  theme_bw()

## 多组玫瑰图
df <- data.frame( cate = rep( LETTERS[1:5], each = 5 ), 
                  group = rep( letters[1:5], 5 ), 
                  value = rnorm( 25, 20, 5 ) )

ggplot( df, aes( cate, value, fill = group ) ) + 
  geom_bar( stat = 'identity' ) + 
  coord_polar()

# 分组啦
ggplot( df, aes( cate, value, fill = group ) ) + 
  geom_bar( stat = 'identity', position = position_dodge() ) + 
  coord_polar()

## 雷达图
library( ggradar )
library( scales )

# 准备数据
mtcars_radar <- mtcars %>%
  as_tibble( rownames = 'group' ) %>%  # 将row的字段转化为了: group
  mutate_at( vars( -group ), rescale ) %>%
  tail( 4 ) %>%
  select( 1:10 )

ggradar( mtcars_radar[-c( 8 , 9 )] )

## Ps reverse x / y axis
ggplot( df, aes( cate, value, fill = cate ) ) + 
  geom_bar( stat = 'identity' ) + 
  coord_flip()

## heat plot
df <- as.matrix( mtcars )  # 需要的是矩阵数据
heatmap( cor( df ) )

## ggplo里面
df <- cor( mtcars ) 
df <- as.data.frame( df )
df$Variable1 <- rownames( df )
# 转化为长数据
df <- df %>%
      gather( key = 'Variable2', 
              value = 'Cor', 
              -c( Variable1 ) )

# 画图
ggplot( df, aes( Variable1, Variable2 ) ) + 
  geom_tile( aes( fill = Cor ), color = 'white' ) + 
  scale_fill_gradient2( low = 'blue', high = 'red', mid = 'white', 
                        midpoint = 0, limit = c( -1, 1 ) )


## 线段 and 误差棒
df <- data.frame( a = letters[1:5], 
                  b = 1:5, 
                  c = 6:10 )
ggplot() + 
  geom_point( data = df, aes( a, b ) ) + 
  geom_point( data = df, aes( a, c ) ) + 
  geom_segment( data = df, aes( a, b, xend = a, yend = c ), 
                color = 'red', linetype = 1, 
                arrow = arrow( length = unit( 0.5, 'cm' ) ) )

## 误差棒
df$err1 <- df$b*.1
df$err2 <- df$c*.1

ggplot() + 
  geom_point( data = df, aes( a, b ), size = 3, color = 'blue' ) + 
  geom_point( data = df, aes( a, c ), size = 3, color = 'red' ) + 
  geom_errorbar( data = df, aes( a, b, ymin = b - err1, ymax = b + err1 ), 
                 color = 'blue', width = .05 ) + 
  geom_errorbar( data = df, aes( a, c, ymin = c - err2, ymax = c + err2 ), 
                 color = 'red', width = .05 ) + 
  geom_text( data = df, aes( a, b + .75, label = a ) ) + 
  geom_text( data = df, aes( a, c + 1.5, label = a ) )

参考线坐标轴图例标题颜色形状主题分面设置

library( tidyverse )

## 鸢尾花的数据
df <- iris[, -c(3, 4)]  # Sepal:花萼 Petal:花瓣
colnames( df ) <- c( 'SepalL', 'SepalW', '种类' )
head( df )

p <- ggplot( data = df, aes( SepalL, SepalW, color = 种类, fill = 种类 ) ) + 
  geom_point()

# 加上虚线: 参考线的作用
p + geom_hline( aes( yintercept = 3.25), linetype = 2 )+ 
    geom_vline( aes( xintercept = 6), linetype = 2  )

# 从一个点出发,然后按照某一个轴进行变换
p + geom_linerange( aes( x = 5, y = 2, ymin = 2, ymax = 3 ), color = 'red' ) + 
    geom_linerange( aes( x = 6, y = 2, ymin = 1.5, ymax = 2.5 ), color = 'blue' ) + 
    geom_linerange( aes( x = 6, y = 2, xmin = 5, xmax = 6 ), color = 'black' )

# 标题
p + ggtitle( "I\'m \ndoing greate !" ) +
  theme( plot.title = element_text( hjust = 0.5, face = 'bold' ) )

# X Y label
p + labs( title = "ENHen\'", x = 'X', y = 'Y' )

# 图例:legend
p + ggtitle( "标题" ) +
  theme( plot.title = element_text( hjust = 0.5, face = 'bold' ) ) + 
  theme( legend.position = 'right' )  # bottom top 

p + ggtitle( "标题" ) +
  theme( plot.title = element_text( hjust = 0.5, face = 'bold' ) ) + 
  theme( legend.position = c( .8, .9 ) )

## 颜色 or 形状
head( df )
ggplot( df, aes( SepalL, SepalW, color = 种类, shape = 种类 ) ) + 
  geom_point( size = 3, alpha = .75 ) + 
  scale_color_manual( values = c( 'red', '#3D3936', '#862DB3' ) ) + 
  scale_shape_manual( values = c( 'triangle', 'square', 'circle' ) )

## 坐标系
p + coord_cartesian( xlim = c( 4, 8 ), ylim = c( 2, 4.5 ) ) + 
    scale_x_continuous( breaks = c( 4, 4.5, 5, 5.5, 6, 6.5, 7, 7.5, 8 ) )

## theme
# 原有的图例
p + theme_bw()
p + theme_classic()  
p + theme_minimal()

# 原油主题上修改
p + theme_classic() + 
    theme( legend.position = c( .8, .9 ) )

p + theme_classic() + 
  theme( legend.position = c( .8, .9 ), 
         panel.background = element_rect( fill = 'black' ) )

## facet :面板
# 按不同物种进行绘图
ggplot( df, aes( SepalL, SepalW, color = 种类 ) ) + 
  geom_point( size = 3 ) + 
  facet_grid( .~种类 ) + 
  theme( legend.position = c( .5, .8 ) )


library( reshape2 )
head( tips )
# 散点图1
ggplot( data = tips, mapping = aes( total_bill, tip ) ) + 
  geom_point( size = 3 )

# 散点图2
ggplot( data = tips, mapping = aes( total_bill, tip, color = sex ) ) + 
  geom_point( size = 3 ) + 
  facet_grid( .~smoker )

ggplot( data = tips, mapping = aes( total_bill, tip ) ) + 
  geom_point( size = 1 ) + 
  facet_grid( smoker ~ sex )

拉卡不是拉胯

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
A 数据分析基础(1)

最近在开始学习数据分析基础目前的路线是: R语言 + 数据分析课程目前主要是看B站上的视频来学习二者写下这篇文章是想要在记录一下自己学习数据分析基础的过程笔者在之前有学过C\C++\Java\Python\JavaScript等语言,但是R语言结合上数据分析的确有一定难度, 在这里记录下每次的学习过程文章目录数据由宽到长的转变-gather数据由宽到长的转变-pivot_longer数据由宽到长的转变-gather首先明确: 数据的长宽变化并不会引起数据信息的丢失而是对数据进行了整合(.
复制链接

扫一扫