R语言数据处理常用函数

zhengxj_

已于 2023-02-20 10:23:37 修改

阅读量616

点赞数

分类专栏： R语言数据处理文章标签： r语言开发语言

于 2022-09-03 21:54:12 首次发布

本文链接：https://blog.csdn.net/zhengxj_/article/details/126683118

版权

R语言数据处理专栏收录该内容

5 篇文章 0 订阅

订阅专栏

本文介绍了R语言中常用的数据处理函数，包括去重、子集选择、数据格式转换、循环控制、字符串操作和排序等功能。重点讲解了melt函数在ggplot2数据准备中的应用，以及if、repeat、while和for循环的使用。此外，还涉及了字符串替换、按字符筛选行以及浮点数的近似比较方法near()。

摘要由CSDN通过智能技术生成

R语言快捷键

去重函数 distinct unique duplicated

子集函数 subset

melt函数数据格式转换为ggplot识别

R语言快捷键

library(tidyverse)
library(data.table)
library(stringr)

data(mtcars)  ##演示数据集

data=fread("",header = T, sep = '\t',data.table = F)
data=fread("",header = T, sep = ',',data.table = F)
data=read.table("data.csv",header=TRUE,sep=",")
sep 是函数的形式参数，csv 文件是用逗号分隔的，故而 sep = "," tsv 文件是用制表符分隔的，故而 sep = "\t"

Alt + “-”: 输入R推荐使用的赋值运算符“<-”
Tab 输入变量一半名称时快速锁定
Ctrl+ ↑ 选错变量时快速返回


Ctrl + 1：移动鼠标至代码区(左上区)
Ctrl + 2：移动鼠标到控制台(左下区)
Ctrl + L：对控制台(左下区)清屏
Ctrl + Shift + N: 新建一个R脚本文件
Ctrl + W / Ctrl + Shift + W: 关闭一个/所有脚本程序文档
Ctrl + O / Ctrl + S: 打开工作目录的文件/保存文件到目录(直接到工作目录)
Ctrl + Enter: 运行一行脚本
Ctrl + Shift + S：运行整个脚本
Ctrl + Shift + Enter: 运行该文件所有程序行
Ctrl + Shift + M: 输入“%>% ”管道符号
Ctrl + Shift + 1 使代码区(左上区)全屏
Ctrl + Shift + 3 使代码区(右下区)全屏
Ctrl+shift+C 批量注释
ls()#查看当前工作空间中的对象(可与Global Environment里对比)
 
rm(x)#删除当前工作空间中的一个对象x
 
rm(list = ls())#删除工作空间所有变量

ls("package:stringr") ##查看包的所有函数


num_char <- as.character(num)  # 将整数型转换为字符型
weight_char <- as.character(weight) # 将数字型转换为字符型
married_char <- as.character(married) # 将因子型转换为字符型
as.numeric()  #转换为数值型
as.data.frame()
as.matrix()

数据读取
data[row.names(data)=="ENSG00000223972.5",] #读取行名为ENSG00000223972.5的行

Object

assign
> for (i in 1:5) {
 assign(paste0("name", i), i * 10)      ## 批量生成5个变量
 }
> name1
[1] 10
> name2
[1] 20
> name5
[1] 50

数据整理

去重函数 distinct unique duplicated

distinct(data,gene,.keep_all = T)  ###去除gene名字相同的行
avereps(expr[,-1],ID = expr$X) # limma包函数，对重复基因名取平均表达量，然后将基因名作为行名

unique()适用于向量
a=c(1,2,2,3,3,3,3,4,5)
b=unique(a)
> b
[1] 1 2 3 4 5

duplicated()适用于数据框

distinct（data，列名，.keep_all=F）

子集函数 subset

1. 取子集
subset(mtcars,carb=="4")  ##可多条件查询
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4


2. 显示子集指定列
subset(mtcars,carb=="4",select="mpg")
                     mpg
Mazda RX4           21.0
Mazda RX4 Wag       21.0
Duster 360          14.3

melt函数数据格式转换为ggplot识别

ggplot2常用melt型数据，使用eshape2包里的melt()函数转换

melt和dcast函数都是来自于reshape2程序包的函数，melt的作用为将宽数据转化为长数据，而dcast的作用为将长数据转化为宽数据，二者互为“逆函数”。

reshape2::melt(data,id.vars,measure.vars,variable.name=“variable”,…,na.rm=FALSE,value.name=“value”,factorsAsStrings=TRUE)

id.vars:标识变量（依旧在列上，保持不变的变量）
measure.vars：度量变量（我们想要放进同一列的变量）
variable.name：为新列取名，如果不取名，默认新增列的列名就是“variable”
value.name：新列对应值所在的变量名

变换需指定哪些数据是id variables，哪些是measured variables

reshape2::dcast(data, variable1+variable2~variable3,value.var = 'value')

示例：

melt(score,id.vars=c("LH","id"))

if函数

1. 单个if语句
if(cond) {expr}
x <- c("Bio","Info","Cloud","BioInfoCloud")
if("BioInfoCloud" %in% x) {
   print("BioInfoCloud is found")
} 
[1] "BioInfoCloud is found"

2. if else函数
如果if后的条件满足，则执行if与else间的语句，否则执行离else最近的一条语句，如果if块和else块有多条语句，需要将多个语句放在花括号中。
if(condition) {expr1}else{expr2}

x=2
if(x==2) {print("x==2")}else{print("x!=2")}
[1] "x==2"

x=3
if(x==2) {print("x==2")}else{print("x!=2")}
[1] "x!=2"

x=2
if(x==2) {print("x==2")
print("x=2*1")}else{print("x!=2")}
[1] "x!=2"
[1] "x=2*1"

3.多个if...else嵌套情况
(1)if(cond1）{expr1} else if(cond2){exp2} ... else{exprx} or
(2)if(cond){if(cond1){expr1} else{expr2} else if(cond2}{if(cond3}...else...} else...

x=20
if(x<=60){print("不及格")
    } else if(x>60 & x<=70){print("及格")
    } else if(x>70 & x<=80){print("良好")
    }else if(x>80 & x<100){print("优秀")
    }else if(x==100){print("满分")
    } else{print("x not found")
    }

4 ifelse()
diamonds2 <- diamonds %>%
mutate(y = ifelse(y < 3 | y > 20, NA, y))

repeat函数循环

# commands就是要重复执行的代码，condition就是给定重复条件。
repeat { 
   commands 
   if(condition) {
      break
   }
}


bio = 1
repeat { 
   print(bio)
   bio = bio +1
   if(bio>5) {
      break
   }
}

while循环

#while循环重复地执行一个语句，知道条件不为真为止。
#condition是循环条件，当条件判断为真时执行循环，当条件判断为假时跳出循环。
while(cond){循环体expr}

i <- 1
#获得向量的元素个数
while( i <= 4)
{
print(i)
i <- i + 1 ##一般i+1来循环
}
[1] 1
[1] 2
[1] 3
[1] 4

for循环

#for循环重复地执行一个语句，直到某个变量的值不再包含在序列seq中为止。
for (value in vector) {
   statements
   }

n=3
num<-array(0,dim=c(n,1))
z=c(5,4,2,6,1)
for (i in 1:n){
    num[i]=z[i]+1 
}
> num
     [,1]
[1,]    6
[2,]    5
[3,]    3

swith函数

switch(expression, case1, case2, case3....)

centre <- function(x, type) {
  switch(type,
         mean = mean(x),
         median = median(x),
         trimmed = mean(x, trim = .1))
}
#switch的部分type是你选择的类型，是你要填入的选项，有3个备用选项待选mean，median ，trimmed

x <- rcauchy(10)
> x
 [1]  3.0658318 -2.0691489 -1.1984827  1.2027225 -1.0840769 -1.4111171 -0.2872199  0.4009320 -1.5269106  0.8661515

centre(x, 1)  #centre(x, “mean”)
[1] -0.2041318


下面来点字符的用法
ccc <- c("b","QQ","a","A","bb")
for(ch in ccc)
  cat(ch,":", switch(EXPR = ch, a = 1, b = 2:3), "\n") 
##cat()类似print()

字符串处理

cat函数

cat(…, file = "", sep = "", fill = FALSE, labels = NULL, append = FALSE)

num=10
cat("num+1 = ", num + 1,sep="")
num+1 = 11

cat("num+1 = ", num + 1,8,9,sep="")
num+1 = 11
num+1 = 1189

cat("num+1 = ", num + 1,8,9,sep=":")
num+1 = :11:8:9

cat("num+1 = ", num + 1,8,9,sep=" ")
num+1 =  11 8 9

粘贴函数 paste

TCGAFPKMgene$id=paste(TCGAFPKMgene$gene,",") ####转换加标记

重复函数 rep及间隔函数seq

> x3 <- rep(1:3, 2)
> x3
[1] 1 2 3 1 2 3
> x4 <- seq(0, 10, 2)
> x4
[1]  0  2  4  6  8 10

循环函数
> for(i in 1:10){
+   cat("正在处理第", i,"次 ... ...", "\n")
+ }
正在处理第 1 次 ... ... 
正在处理第 2 次 ... ... 
正在处理第 3 次 ... ...

gsub 字符串替换

TMP=gsub("-01R","",colnames(tpmdata))

“.”是万能字符，需要
b=gsub("[.]","-",a)
b=gsub("\\.","-",a)
b=gsub(".","-",a,fixed=T)

greb函数按字符筛选行可以用到

grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE, fixed = FALSE, useBytes = FALSE, invert = FALSE)
各参数的含义如下：
（1）pattern: 字符串类型，正则表达式，指定搜索模式，当将fixed参数设置为TRUE时，也可以是一个待搜索的字符串。
（2）x : 字符串向量，用于被搜索的字符串。
（3）ignore.case: 是否忽略大小写。为FALSE时，大小写敏感，为TRUE时，忽略大小写。
（4）perl: 用于指定是否Perl兼容的正则表达式
（5）value：逻辑值，为FALSE时，grep返回搜索结果的位置信息，为TRUE时，返回结果位置的值。
（6）fixed:逻辑值，为TRUE时，按pattern指定的字符串进行原样搜索，且会忽略产生冲突的参数设置。
（7） useBytes：逻辑值，如果为真，则按字节进行匹配，而不是按字符进行匹配。
（8）invert：逻辑值，如果为TRUE，则返回未匹配项的索引或值，也就是反向搜索。

name=grep(pattern, x,)
name=x[name]

sort 排序

sort() ##排序

数字生成 rnom

rnorm(n, mean = 0, sd = 1)
n 为产生随机值个数（长度），mean 是平均数， sd 是标准差 。

near

sqrt(2) ^ 2 == 2
#> [1] FALSE
1/49 * 49 == 1
#> [1] FALSE

计算机使用的是有限精度运算（显然无法存储无限位的数），因此请记住，你看到的每个
数都是一个近似值。比较浮点数是否相等时，不能使用==，而应该使用near()：

near(sqrt(2) ^ 2, 2)
#> [1] TRUE
near(1 / 49 * 49, 1)
#> [1] TRUE

其余常规计算函数

分散程度度量
sd()
IQR() #四分位距
mad() #绝对中位差

秩的度量
mean()
median()
quantile(x, 0.25) #quantile(x, 0.25) 会找出x 中按从小到大顺序大于前25% 而小于后75% 的值
max(x)

定位度量
first(x)
nth(x, 2)
last(x)

计数
n() #dplyr为count()
sum(!is.na(x))  #非缺失值的数量
n_distinct(x)   #唯一值的数量

逻辑值的计数和比例
sum(x > 10)
mean(y == 0)

List访问

S3类属性访问（数据访问）：S3和S4的一大区别就是S3通过$访问不同属性，而S4通过@访问

文件访问

.  ## 代表当前目录
.. ## 代表上一层目录
/  ## 代表根目录


files <- list.files() ###获取路径下文件名；
for (f in files){
  newname<-sub('-01A-01R.gene.quantification','',f) #将原文件中的字符""，替换为字符""
  file.rename(f,newname) ##重命名文件名
}   # 批量更改文件名，方便后面批量读取和匹配



clinical <- read.csv(file="./TARGET OS.csv", head = T)
colnames(clinical)[1] <- "patients" ##重命名列名
write.csv(clinical, file = "clinical.csv", row.names = F)

patients <- clinical$patients

ls <- list.files()[-1]  ###获取路径下文件名，同时去除第一个文件名
ls <- sub(".txt","",ls)  ###替换

head(ls)

included <- clinical[match(ls, clinical$patients),] ####匹配is和clinical$patients；取clinical匹配的行

aa <- na.omit(included$patients)###去除某一列为NA的值，取得行名

head(aa)

included <- included[match(aa,included$patients),]

match(aa,ls)

先把数据down下来，把gene表达信息和临床数据的患者编号匹配上

setwd("E:/scRNA seq/osteosarcoma/target/gene copy")

filenames <- list.files()
for (file in filenames) {
  if (!exists("all.data")) {
    all.data <- read.table(file, sep = "\t",
                          header = TRUE)[,4]
  }
  if (exists("all.data")) {
    new.data <- read.table(file, sep = "\t",
                          header = TRUE)[,4]
    all.data <- cbind(all.data, new.data)
  }
}

setwd("G:/down")
memory.limit(size=100000)
library(tidyverse)
gs()
filenames <- list.files()
for (file in filenames) {
  if (!exists("all.data")) {
    all.data <- read.table(file, sep = "\t",
                          header = TRUE)[,c(1,4)]
colnames(all.data)[2]=file
}
  if (exists("all.data")) {
    new.data <- read.table(file, sep = "\t",
                          header = TRUE)[,c(1,4)]
colnames(new.data)[2]=file
    all.data = merge(all.data, new.data,by.x="Name",by.y="Name")
  }
}



dim(all.data)
counts <- all.data[,-1]
dim(counts)
ls <- sub(".txt","",filenames)
colnames(counts) <- ls
rownames(counts) <- genes

然后循环批量读取表达量数据，这次选取读TPM数。