R语言包应用_r语言 gensa-CSDN博客

本文链接：https://blog.csdn.net/lltanwh/article/details/137187772

本文介绍了R语言的用途，详细阐述了如何安装和使用R包，特别是处理大数据集和创建用户自定义函数。此外，文章深入探讨了模拟退火算法的实现，包括在R中的应用示例，展示了如何找寻函数的最大值。最后，提到了遗传算法和遗传编程在R中的应用，如mcga和genalg包，以及它们在全局优化问题中的作用。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

R是一个程序设计语言

R的书籍：《数据挖掘与R语言》《R语言实战》《R语言编程艺术》

R的安装

设置工作目录：setwd()
查看工作目录：getwd（）
Install.packages():安装包
R处理大数据集
R的用户自定义函数
R访问MySQL数据库：（1）安装RODBC包（2）在http://dev.mysql.com/downioads/connector/odbc下载connectorsODBC

Windows：控制面板->管理工具->数据源->双击->添加->选中mysqlODBC driver

模拟退火算法

模拟退火是一种通用的概率算法，用来在一个大的搜索空间内找到命题的最优解

举例

在【0，12.55】上寻找f(x)=x*sin(x)在给定区域的最大值

#自定义目标函数

C = function(x)

{

1/(x*sin(x)+12)

}

#初始化

stmp = NULL

for (i in 1:100) {stmp = c(stmp,runif(1,0,12.55))}

t0 = var(C(stmp)) #设定初始温度 var求方差

s0 = runif(1,0,12.55) #设定初始解状态

iters = 3000 #设定迭代次数

ccnt = 200 #设定终止条件，连续ccnt个新解都没有接收时终止算法

hisbest = 12.55 #保存历史最好状态，默认取上边界值

ccntVc = NULL

for (t in 1:iters)

{

#在s0附近，产生新解，但又能包括定义内的所有值

s1 = rnorm(1,mean = s0, sd = 2)

while(s1<0 || s1>12.55)

{

s1 = rnorm(1,mean = s0, sd = 3)

}

#计算能量增量

delta_t = C(s1) - C(s0)

if(delta_t < 0)

{

s0 = s1

ccntVc = c(ccntVc,1)

}

else

{

p = exp(-delta_t/t0)

if(runif(1,0,1) < p)

{

s0 = s1

ccntVc = c(ccntVc,1)

}

else

{

ccntVc = c(ccntVc,0)

}

hisbest = ifelse(C(s1)<C(hisbest),s1,hisbest)

hisbest = ifelse(C(s0)<C(hisbest),s0,hisbest)

#更新温度

t0 = t0/log(1+t)

#检查终止条件

if(NROW(ccntVc)>ccnt && sum(ccntVc[(NROW(ccntVc)-ccnt+1):NROW(ccntVc)])==0)

{

print(paste("连续",ccnt,"次没有接受新解，算法终止！")) break;

}

print(s0)

print(t)

print(hisbest)

R语言说明

#用R语言实现

#R语言程序包GenSA和stats包括有模拟退火算法的实现。GenSA包可以执行非常复杂的非线性目标函数的全局最小化搜索

#而stats包中的optim函数是基于Nelder-Mead、拟牛顿法、共轭梯度算法的通用优化、它包括一个用于箱约束优化和模拟

#退火的选型

library(GenSA)

gensa0 = GenSA(par = runif(1,0,12.55),fn = function(s)1/(s*sin(s)+12),lower = c(0),upper = 12.55)

str(gensa0)

#List of 4

#$ value : num 0.0502

#$ par : num 7.98

#$ trace.mat: num [1:9992, 1:4] 1 1 2 2 3 3 4 4 5 5 ...

#..- attr(*, "dimnames")=List of 2

#.. ..$ : NULL

#.. ..$ : chr [1:4] "nb.steps" "temperature" "function.value" "current.minimum"

#$ counts : int 35895

遗传算法

R语言中有程序包实现了遗传算法，通常使用mcga\genalg\rgenoud

mcga包

mcga主要包括两个函数，mcga与multi_mcga,其中mcga适用于单目标函数最小化问题

而multi_mcga可以使用mcga一样的逻辑实现多目标化的优化

计算f(x) = x&sin(x)

library(GA)

library(foreach)

library(mcga)

getAdjust <- function(x)

{

if(x>=0 && x<=12.55)

{

return(-(x*sin(x)))

}else{

return(exp(100))

}

m = mcga(popsize = 20,chsize = 1,minval = 0,maxval = 12.55,maxiter = 1000,evalFunc = getAdjust)

str(m) #List of 10

#$ population: num [1:20, 1] 7.96 7.96 7.96 7.96 7.96 ...

#$ costs : num [1:20] -7.91 -7.91 -7.91 -7.91 -7.91 ...

#$ popsize : num 20

#$ chsize : num 1

#$ crossprob : num 1

#$ mutateprob: num 0.01

#$elitism : num 1

#$ minval : num 0

#$ maxval : num 12.6

#$ maxiter : num 1000

#从population可知，我们求得的最优解为7.98，需要注意的是，在计算适应度的函数中，需要限制参数

#的范围，由于评估函数的是针对最小化问题的，所以要求最大值，需要加个负号

模型参数介绍

#mcga(popsize, chsize, crossprob = 1.0, mutateprob = 0.01,

# elitism = 1, minval, maxval, maxiter = 10, evalFunc)

#popsize 种群规模

#chsize 参数数量

#crossprob 交叉概率，默认为1.0

#mutateprob 变异概率，默认为0.01

#elitism 直接复制到子代的最佳个体数目，默认为1个

#minval 随机生成种群的下边界值

#maxval 随机生成种群的上边界值

#maxiter 最大世代次数，即繁殖次数，默认为10

#evalFunc 一个R函数，用来计算个体适应度，每一个问题都默认是最少化问题

genalg包

genalg包是基于R语言用于二元和浮点染色体的遗传算法，它主要包括了两个函数 rbga.bin与rbga

rbga.bin实现了基于二元染色体的遗传算法，可用于特征选择，其结果最优时对应的染色体的评估结果是最小的

rbga实现了基于浮点染色体的遗传算法，采用待优化的浮点值的最大最小值作为输入，对应最佳染色体的评估结果是最小的

它们的实现过程，都需要自定义评估函数evalFunc,rbga.bin的函数evalFunc以二元染色体为参数，而在rbga函数中，需要一个浮点向量作为参数,它们都可以通过设置monitorFunc对遗传算法的实现过程进行监控，monitorFunc需要rbga对象作为参数

与mcga包不同，genalg包中的rbga对象，可以调用plot函数进行进行可视化，展现遗传算法运行过程中的特征，默认显示最小值和平均的评估值，指示遗传算法执行的进度。直方图用于呈现二元染色体基因的选择频率，即一个基因在当前种群被选择的次数如果是浮点染色体，它将为每个变量绘制直方图来说明当前种群中被选择的值。

参数图用于呈现评估函数与变量值，这对查看变量与评估值之间的相关关系是很有用

library(genalg)

#定义适应度函数

getAdjust <- function(x)

{

if(x>=0 && x<=12.55)

{

return(-(x*sin(x)))

}else{

return(exp(100))

}

#定义监控函数

monitor = function(rbga0)

{

#打印种群中的第一个个体的值population[1,]

print(rbga0$population[1,])

}

rbgaObj = rbga(stringMin = c(0),stringMax = c(12.55),popSize = 50,iters = 1000,

mutationChance = 0.01,monitorFunc = monitor,evalFunc = getAdjust,verbose = TRUE)

str(rbgaObj)

#List of 12

#$ type : chr "floats chromosome"

#$ stringMin : num 0

#$ stringMax : num 12.6

#$ popSize : num 50

#$ iters : num 1000

#$ suggestions : NULL

#$ population : num [1:50, 1] 7.98 7.98 7.98 7.98 7.98 ...

#$ elitism : num 10

#$ mutationChance: num 0.01

#$ evaluations : num [1:50] -7.92 -7.92 -7.92 -7.92 -7.92 ...

#$ best : num [1:1000] -7.92 -7.92 -7.92 -7.92 -7.92 ...

#$ mean : num [1:1000] 1.291 -0.251 -1.625 -2.506 -3.273 ...

#- attr(*, "class")= chr "rbga"

#从population可知，我们求得的最优解为7.98，这与我们上文捉到的最优解相同，对rbgaObj调用，可知，迭代次数的增加。

#评估先是骤降，经过一段不稳定的变化之后，在1000次附近趋于稳定。

plot(rbgaObj)

#当设置参数type = "hist"时，可得直方图，如图可知，可知变量在7.98附近取值的频率最高，可见算法在此处收敛，并得到最优解

plot(rbgaObj,type = "hist",breaks = 50)

#当设置参数type = "vars"时，可得参数图，从图中可以看出，当变量值在7.98处时，评估值最低，对应全局最优

plot(rbgaObj,type = "vars")

模型参数介绍

rbga(stringMin=c(), stringMax=c(),

suggestions=NULL,

popSize=200, iters=100,

mutationChance=NA,

elitism=NA,

monitorFunc=NULL, evalFunc=NULL,

showSettings=FALSE, verbose=FALSE)

#stringMin 含有每个基因最小值的向量

#stringMax 含有每个基因最大值的向量

#suggestions 建议染色体可选列表

#popSize 种群规模，个体数量，也是染色体数量，默认200

#iters 迭代次数，默认为200

#mutationChance 染色体的基因突变机会，默认为 1/（size+1）,它影响收敛速度和搜索空间的探测，低的突变率收敛更加快，然而高的突变率增加搜索空间的跨度

#elitism 保留到子代的染色体的数目，默认为为种群规模的20%

#monitorFunc 监控函数，每产生一代后运行

#evalFunc 用户自定义方法，计算给定染色体的适应度

#showSettings 如果为TRUE，设置信息会打到屏幕上，默认FALSE

#verbose 如果为TRUE，算法将会打印更多的动态信息，默认FALSE

rbga函数功能

function (stringMin = c(), stringMax = c(), suggestions = NULL,

popSize = 200, iters = 100, mutationChance = NA, elitism = NA,

monitorFunc = NULL, evalFunc = NULL, showSettings = FALSE,

verbose = FALSE)

{

if (is.null(evalFunc)) {

stop("A evaluation function must be provided. See the evalFunc parameter.")

}

vars = length(stringMin)

if (is.na(mutationChance)) {

mutationChance = 1/(vars + 1)

}

if (is.na(elitism)) {

elitism = floor(popSize/5)

}

if (verbose)

cat("Testing the sanity of parameters...\n")

if (length(stringMin) != length(stringMax)) {

stop("The vectors stringMin and stringMax must be of equal length.")

}

if (popSize < 5) {

stop("The population size must be at least 5.")

}

if (iters < 1) {

stop("The number of iterations must be at least 1.")

}

if (!(elitism < popSize)) {

stop("The population size must be greater than the elitism.")

}

if (showSettings) {

if (verbose)

cat("The start conditions:\n")

result = list(stringMin = stringMin, stringMax = stringMax,

suggestions = suggestions, popSize = popSize, iters = iters,

elitism = elitism, mutationChance = mutationChance)

class(result) = "rbga"

cat(summary(result))

}

else {

if (verbose)

cat("Not showing GA settings...\n")

}

if (vars > 0) {

if (!is.null(suggestions)) {

if (verbose)

cat("Adding suggestions to first population...\n")

population = matrix(nrow = popSize, ncol = vars)

suggestionCount = dim(suggestions)[1]

for (i in 1:suggestionCount) {

population[i, ] = suggestions[i, ]

}

if (verbose)

cat("Filling others with random values in the given domains...\n")

for (var in 1:vars) {

population[(suggestionCount + 1):popSize, var] = stringMin[var] +

runif(popSize - suggestionCount) * (stringMax[var] -

stringMin[var])

}

else {

if (verbose)

cat("Starting with random values in the given domains...\n")

population = matrix(nrow = popSize, ncol = vars)

for (var in 1:vars) {

population[, var] = stringMin[var] + runif(popSize) *

(stringMax[var] - stringMin[var])

}

bestEvals = rep(NA, iters)

meanEvals = rep(NA, iters)

evalVals = rep(NA, popSize)

for (iter in 1:iters) {

if (verbose)

cat(paste("Starting iteration", iter, "\n"))

if (verbose)

cat("Calucating evaluation values... ")

for (object in 1:popSize) {

if (is.na(evalVals[object])) {

evalVals[object] = evalFunc(population[object,

])

if (verbose)

cat(".")

}

bestEvals[iter] = min(evalVals)

meanEvals[iter] = mean(evalVals)

if (verbose)

cat(" done.\n")

if (!is.null(monitorFunc)) {

if (verbose)

cat("Sending current state to rgba.monitor()...\n")

result = list(type = "floats chromosome",

stringMin = stringMin, stringMax = stringMax,

popSize = popSize, iter = iter, iters = iters,

population = population, elitism = elitism,

mutationChance = mutationChance, evaluations = evalVals,

best = bestEvals, mean = meanEvals)

class(result) = "rbga"

monitorFunc(result)

}

if (iter < iters) {

if (verbose)

cat("Creating next generation...\n")

newPopulation = matrix(nrow = popSize, ncol = vars)

newEvalVals = rep(NA, popSize)

if (verbose)

cat(" sorting results...\n")

sortedEvaluations = sort(evalVals, index = TRUE)

sortedPopulation = matrix(population[sortedEvaluations$ix,

], ncol = vars)

if (elitism > 0) {

if (verbose)

cat(" applying elitism...\n")

newPopulation[1:elitism, ] = sortedPopulation[1:elitism,

]

newEvalVals[1:elitism] = sortedEvaluations$x[1:elitism]

}

if (vars > 1) {

if (verbose)

cat(" applying crossover...\n")

for (child in (elitism + 1):popSize) {

parentProb = dnorm(1:popSize, mean = 0, sd = (popSize/3))

parentIDs = sample(1:popSize, 2, prob = parentProb)

parents = sortedPopulation[parentIDs, ]

crossOverPoint = sample(0:vars, 1)

if (crossOverPoint == 0) {

newPopulation[child, ] = parents[2, ]

newEvalVals[child] = sortedEvaluations$x[parentIDs[2]]

}

else if (crossOverPoint == vars) {

newPopulation[child, ] = parents[1, ]

newEvalVals[child] = sortedEvaluations$x[parentIDs[1]]

}

else {

newPopulation[child, ] = c(parents[1, ][1:crossOverPoint],

parents[2, ][(crossOverPoint + 1):vars])

}

else {

if (verbose)

cat(" cannot crossover (#vars=1), using new randoms...\n")

newPopulation[(elitism + 1):popSize, ] = sortedPopulation[sample(1:popSize,

popSize - elitism), ]

}

population = newPopulation

evalVals = newEvalVals

if (mutationChance > 0) {

if (verbose)

cat(" applying mutations... ")

mutationCount = 0

for (object in (elitism + 1):popSize) {

for (var in 1:vars) {

if (runif(1) < mutationChance) {

dempeningFactor = (iters - iter)/iters

direction = sample(c(-1, 1), 1)

mutationVal = stringMax[var] - stringMin[var] *

0.67

mutation = population[object, var] +

direction * mutationVal * dempeningFactor

if (mutation < stringMin[var])

mutation = stringMin[var] + runif(1) *

(stringMax[var] - stringMin[var])

if (mutation > stringMax[var])

mutation = stringMin[var] + runif(1) *

(stringMax[var] - stringMin[var])

population[object, var] = mutation

evalVals[object] = NA

mutationCount = mutationCount + 1

}

if (verbose)

cat(paste(mutationCount, "mutations applied\n"))

}

result = list(type = "floats chromosome", stringMin = stringMin,

stringMax = stringMax, popSize = popSize, iters = iters,

suggestions = suggestions, population = population, elitism = elitism,

mutationChance = mutationChance, evaluations = evalVals,

best = bestEvals, mean = meanEvals)

class(result) = "rbga"

return(result)

}

<bytecode: 0x0000004874420a80>

<environment: namespace:genalg>

mcga函数功能

function (popsize, chsize, crossprob = 1, mutateprob = 0.01,

elitism = 1, minval, maxval, maxiter = 10, evalFunc)

{

par <- as.double(rep(0, chsize))

best <- as.double(rep(0, chsize))

population <- as.double(rep(0, chsize * popsize))

costs <- as.double(rep(0, popsize))

envv <- .GlobalEnv

result <- .Call("mcga", popsize, chsize, crossprob,

mutateprob, elitism, minval, maxval, maxiter, par, best,

evalFunc, population, costs, envv, PACKAGE = "mcga")

resmat <- matrix(population, ncol = chsize, nrow = popsize)

return(list(population = matrix(population, popsize, chsize,

byrow = TRUE), costs = costs, popsize = popsize, chsize = chsize,

crossprob = crossprob, mutateprob = mutateprob, elitism = elitism,

minval = minval, maxval = maxval, maxiter = maxiter))

}

<bytecode: 0x0000004874c077e0>

<environment: namespace:mcga>

RBF神经网络

library(dplyr)

library(data.table)

rbf <- NULL

#Guassian径向基函数

Green <- function(x, c, delta){

greenValue <- exp(-1.0 * sum((x - c)^2) / (2 * delta^2))

}

hiddenSize <- 2

# cols <- 5

# rows <- 7

# train.x <- matrix(runif(cols * rows), ncol = 1)

# train.y <- matrix(runif( cols * rows), ncol = 1)

x <- seq(0,3.14*2, by = 0.01)

y <- sin(x) + runif(length(x))

train.x <- x %>% data.matrix()

train.y <- y %>% data.matrix()

kmeans.parameters <- kmeans(train.x, hiddenSize)

init.centers <- kmeans.parameters$centers

init.delta <- kmeans.parameters$withinss/kmeans.parameters$size + 0.2

rbf$hiddenSize <- hiddenSize

rbf$inputSize <- ncol(train.x)

rbf$outputSize <- ncol(train.y)

rbf$numSample <- nrow(train.x)

rbf$c <- matrix(init.centers, ncol = ncol(train.x))

rbf$delta <- matrix(init.delta, nrow = 1) #delta of RBF

rbf$wts <- matrix(rnorm(rbf$hiddenSize * rbf$outputSize, 0,2), ncol = rbf$outputSize) *2 - 1 #weight of RBF

rbf$cost <- 0

rbf$alpha <- 0.1 # learning rate (should not be large!)

#Core Functions

TrainRBF <- function(rbf, train.x, train.y){

## step 1: calculate gradient

#size : 1*n

delta_delta <- rep(0,rbf$hiddenSize) %>% matrix(nrow = 1)

delta_center <- rep(0, rbf$inputSize * rbf$hiddenSize) %>% matrix(ncol = rbf$inputSize)

delta_weight <- rep(0,rbf$outputSize*rbf$hiddenSize) %>% matrix(ncol = rbf$outputSize)

rbf$cost <- 0

num.sample <- nrow(train.x)

for( i in 1:rbf$numSample){

#size : 1*n

green <- matrix(rep(0, rbf$hiddenSize), nrow = 1)

## Feed forward

y = 0

for(j in 1:rbf$hiddenSize){

green[1, j] <- Green(train.x[i, ], rbf$c[j, ], rbf$delta[j])

}

output <- green %*% rbf$wts

if(is.nan(output))

output <- 0

# cat(output, "\n")

## Back propagation

# error = y - output

error <- -(output - train.y[i, ])

# sum E

rbf$cost <- rbf$cost + sum(error * error)

#updata weight hiddensize * 1 * 1 * outputSize

delta_weight <- delta_weight + t(green) %*% error

# delta2[j] = w[j, ] * (y - output) delta2 = w * e

delta2 = error %*% t(rbf$wts) * green

for( j in 1:rbf$hiddenSize){

delta_center[j, ] = delta_center[j, ] + delta2[j] * (train.x[i, ] - rbf$c[j, ])/rbf$delta[j]^2

delta_delta[j] = delta_delta[j] + delta2[j] *sum((train.x[i, ] - rbf$c[j, ])^2)/rbf$delta[j]^3

}

## Step2 : update parameters

rbf$cost <- 0.5 * rbf$cost/num.sample

rbf$delta <- rbf$delta + rbf$alpha * delta_delta/num.sample

rbf$c <- rbf$c + rbf$alpha * delta_center/num.sample

rbf$wts <- rbf$wts + rbf$alpha * delta_weight/num.sample

return (rbf)

}

## Start Train

cat("Step 2: Start training...\n")

maxIter <- 50

preCost <- 0

for(i in 1:maxIter) {

cat("Iteration ", i, " ")

rbf <- TrainRBF(rbf, train.x, train.y)

cat("The cost is ", rbf$cost, "\n")

curCost <- rbf$cost

if(abs(curCost - preCost) < 1e-8 ) {

Cat('Reached iteration termination condition and Termination now! \n')

break

}

###Visuliazation

green <- matrix(rep(0, rbf$numSample * rbf$hiddenSize) , nrow = rbf$numSample)

for( i in 1: rbf$numSample){

for(j in 1:rbf$hiddenSize){

green[i, j] <- Green(train.x[i, ], rbf$c[j, ], rbf$delta[j])

}

z <- green %*% rbf$wts

plot(train.x,train.y)

lines(train.x, z, col = "red")

决策树

install.packages("rpart")#安装库

library("rpart")

dt<-function(data,n,form){

require(rpart)

accu<-0

for(i in 1:1000){

ind<-sample(1:nrow(data),round(0.7*nrow(data)),replace=F)

treemodel<-rpart(form,data[ind,],method="class")

#决策树 method表示决策树的类型，我们这里是分类问题所以选择class class表示分类树

#得到模型

treepre<-predict(treemodel,data[-ind,-n],type="class")

#根据模型预测

t<-table(treepre,data[-ind,n])

accu[i]<-sum(diag(t)/sum(t))

}

return (c(mean(accu),sd(accu)))#统计1000次模型准确率和标准差

}

data<-iris

n<-5

form<-(as.formula("Species~."))

dt(data,n,form)

install.packages("rpart.plot")

library("rpart.plot")

ind<-sample(1:nrow(iris),round(0.7*nrow(iris)),replace=F)

treemodel<-rpart(as.formula("Species~."),iris[ind,],method="class")

rpart.plot(treemodel,uniform=T)

formula指定参与分析的变量公式，类似于y~x1+x2；

data指定分析的数据集；

subset为一个索引向量，指定分析的样本数据；

na.action针对缺失值的处理方法，默认会删除缺失值所在的行；

scale逻辑参数，是否标准化变量，默认标准化处理；

x可以是矩阵，可以是向量，也可以是稀疏矩阵；

y为分类变量；

type指定建模的类别，支持向量机通常用于分类、回归和异常值检测，默认情况下，svm模型根据因变量y是否为因子，type选择C-classification或eps-regression。

kernel指定建模过程中使用的核函数，目的在于解决支持向量机线性不可分问题。有四类核函数可选，即线性核函数、多项式核函数、径向基核函数（高斯核函数）和神经网络核函数。研究人员发现，识别率最高，性能最好的是径向基核函数（默认的kernel值），其次是多项式核函数，最差的是神经网络核函数；

degree用于多项式核函数的参数，默认为3；

gamma用于除线性核函数之外的所有核函数参数，默认为1；

coef0用于多项式核函数和神经网络核函数的参数，默认为0；

nu用于nu-classification、nu-regression和one-classification回归类型中的参数

class.weights指定类权重；

cachesize默认缓存大小为40M;

cross可为训练集数据指定k重交叉验证；

probability逻辑参数，指定模型是否生成各类的概率预测，默认不产生概率值；

fitted逻辑参数，是否将分类结果包含在模型中，默认生成拟合值。

1、线性模型~回归分析：

【包】：stats

【函数】：lm(formula, data, ...)

逐步回归：step(lm(formula, data, ...))

回归诊断：influence.measure(lm(formula, data, ...))

多重共线性：kappa(XX,exact=T), eigen(XX)

自相关检验：一阶：dwtest(y~x) 多阶：bgtest(y~x,order=2,type=”Chisq”)

【备注】：1）stats包里的lm()可做多元线形模型，

anova.mlm()比较多个多元线形模型，

manova()做多元方差分析（MANOVA）。

2）sn包的msn.mle()和 and mst.mle()可拟合多元偏正态和偏t分布模型。

3）pls包提供偏最小二乘回归（PLSR）和主成分回归；

4）ppls包可做惩罚偏最小二乘回归；

5）dr包提供降维回归方法，

如：片逆回归法(Sliced Inverse Regression)、片平均方差估计(sliced average variance estimation)。

6）plsgenomics包做基于偏最小二乘回归的基因组分析。

7）relaimpo包可评估回归参数的相对重要性。

2、logistic回归：

【包】：stats

【函数】：glm(formula, family=gaussian,data, ...)

注：family

binomial(link = "logit")

gaussian(link = "identity")

Gamma(link = "inverse")

inverse.gaussian(link = "1/mu^2")

poisson(link = "log")

quasi(link = "identity", variance = "constant")

quasibinomial(link = "logit")

quasipoisson(link = "log")

3、无监督分类~决策树：

【包】：rpart

【函数】：rpart(formula,data, method="class",control=ct,parms=list(prior=c(p,1-p),split="information"))

rpart.plot(fit,branch=1,branch.type=2,type=1,extra=102,shadow.col=”gray”,box.col=”green”,

split.cex=1.2,main=”Kyphosis决策树”) #提供了复杂度损失修剪的修剪方法

printcp(fit):告诉分裂到哪一层，CP，nsplit，rel，error，交叉验证的估计误差（xerror），标准误差（xstd）

prune(fit,cp=fit$cptable[which.min(fit$cptable[,"xerror"]),"CP"])：剪枝函数

【备注】：1）CRAN的 MachineLearning任务列表有对树方法的细节描述。

2）分类树也常常是重要的多元方法，rpart包正是这样的包，

3）rpart.permutation包还可以做rpart()模型的置换（permutation）检验。

4）TWIX包的树可以外部剪枝。

5）hier.part包分割多元数据集的方差。

6）mvpart包可做多元回归树，

7）party包实现了递归分割（recursive partitioning），

8）rrp包实现了随机递归分割。

9）caret包可做分类和回归训练，进而caretLSF包实现了并行处理。

10）kknn包的k-近邻法可用于回归，也可用于分类。

4、支持向量机：

【包】：e1071，kernlab

【函数】：svm(x_train,y_train,type="C-classification",cost=10,kernel="radial",probability=TRUE,scale=FALSE)

svp=ksvm(x,y,type="C-svc",kernel="rbf",kpar=list(sigma=1),C=1)

5、无监督分类~聚类分析：

【包】：stats

【函数】：系统聚类：hclust(d,method=”complete”,members=NULL)

快速聚类：kmeans(x,centers,iter.max=10,nstart=1,algorithm=“Hartigan-Wong”)

距离函数：dist(x,method=”euclidean”,diag=FALSE,upper=FALSE,p=2)

【备注】：1）CRAN的Cluster任务列表全面的综述了R实现的聚类方法。

2）stats里提供等级聚类hclust()和k-均值聚类kmeans()。

3）cluster包里有大量的聚类和可视化技术，

4）clv包里则有一些聚类确认程序，

5）e1071包的classAgreement()可计算Rand index比较两种分类结果。

6）Trimmed k-means聚类分析可由trimcluster包实现，

7）聚类融合方法（Cluster Ensembles）由clue包实现，

8）clusterSim包能帮助选择最佳的聚类，

9）hybridHclust包提供一些混合聚类方法。

10）energy包里有基于E统计量的距离测度函数edist()和等级聚类方法hclust.energy()。

11）LLAhclust包提供基于似然（likelihood linkage）方法的聚类，也有评定聚类结果的指标。

12）fpc包里有基于Mahalanobis距离的聚类。

13）clustvarsel包有多种基于模型的聚类。

14）模糊聚类（fuzzy clustering）可在cluster包和hopach包里实现。

15）Kohonen包提供用于高维谱（spectra）或模式（pattern）的有监督和无监督的SOM算法。

16）clusterGeneration包帮助模拟聚类。

17）CRAN的Environmetrics任务列表里也有相关的聚类算法的综述。

18）mclust包实现了基于模型的聚类，

19）MFDA包实现了功能数据的基于模型的聚类。

6、关联分析：

【包】：arules，Matrix，lattice，arulesViz

【函数】：apriori(Groceries,parameter=list(support=0.01,confidence=0.2))

eclat(Groceries, parameter = list(support = 0.05),control = list(verbose=FALSE))

7、主成分分析：

【包】：stats

【函数】：princomp(data,cor=False,scores=TRUE,covmat=NULL,subset=rep(TRUE,nrow(as.matrix(x))) ,…)

prcomp(data,cor=False,scores=TRUE,covmat=NULL,subset=rep(TRUE,nrow(as.matrix(x))) ,…)

prcomp：采用观测值的奇异值分解方法；princomp：采用相关系数阵的特征值分解方法

【备注】：1）stats 包的prcomp()（基于svd()）和princomp()（基于eigen()）能计算主成分。

2）sca包做单分量分析。

3）nFactors可评价碎石图(Scree plot)，

4）paran包可评估主成分分析得到的主成分和因子分析得到的因子。

5）pcurve包做主曲线（Principal Curve）分析和可视化。

6）gmodels包提供适合大矩阵的fast.prcomp()和fast.svd()。

7）kernlab包里的kpca()用核方法做非线性的主成分分析。

8）pcaPP包用投影寻踪（projection pursuit）法计算稳健/鲁棒（robust）主成分。

9）amap包的acpgen()和acprob()函数分别针对广义（generalized）和稳健（robust）主成分分析。

8、对应分析：

【包】：ca，MASS，vegan，FactoMineR

【函数】：简单对应分析：ca(data,...)

多重对应分析：mjca(data,...)

plot3d.ca(ca(data,nd=3))

plot(mjca(data,lambda="Burt"))

【备注】：1）MASS包的corresp（）和mca（）可以做简单和多重对应分析；

2）ca包提供单一、多重和联合对应分析；

3）FactoMineR包的CA（）和MCA（）函数也能做类似的简单和多重对应分析，还有画图函数，

homals可执行同质分析。

9、因子分析：

【包】：psycho，stats

【函数】：Bartlett球形检验：cortest.bartlett(cor(data),n=length(data))

factanal(~X1 + X2 + X3 + X4,data=freeny,factors=1)

10、神经网络

【包】：nnet

【备注】：nnet包执行单隐层前馈神经网络，nnet是VR包的一部分。

11、随机森林：（回归和分类）

【包】：randomForest，ipred，varSelRF

【备注】：1）ipred包用bagging的思想做回归，分类和生存分析，组合多个模型；

2）party包也提供了基于条件推断树的随机森林法；

3）varSelRF包用随机森林法做变量选择。

12、递归拆分：（回归，分类，生存分析）

【包】：rpart，tree

【备注】：1）递归拆分利用树形结构模型，来做回归、分类和生存分析，主要在rpart包和tree里执行，尤其推荐rpart包。

2）Weka里也有这样的递归拆分法，如：J4.8, C4.5, M5，包Rweka提供了R与Weka的函数的接口

13、Boosting：（提高给定任意学习算法精确度的方法）

【包】：gbm，boost

【备注】：1）gbm包做基于树的梯度下降boosting；

2）boost包包括LogitBoost和L2Boost；

3）GAMMoost包提供基于boosting 的广义相加模型(generalized additive models)的程序；

4）mboost包做基于模型的boosting。

14、模型确认和选择：

【包】：e1071，ipred，svmpath，ROCR，caret，caretLSF

【函数】：tune，errorest，cost，

【备注】：1）e1071 包的tune()函数在指定的范围内选取合适的参数；

2）svmpath包里的函数可用来选取支持向量机的cost参数C；

3）ROCR包提供了可视化分类器执行效果的函数，如画ROC曲线；

4）caret包供了各种建立预测模型的函数，包括参数选择和重要性量度；

15、缺失数据

【包】：mitools，mice，mvnlme，norm，cat，mix，pan，VIM，Hmisc，EMV，monomvn

【备注】：1）mitools包里有缺失数据的多重估算（multiple imputation)的函数；

2）mice包用chained equations实现了多重估算；

3）mvnmle包可以为多元正态数据的缺失值做最大似然估计（ML Estimation）；

4）norm包提供了适合多元正态数据的估计缺失值的期望最大化算法（EM algorithm）；

5）cat包允许分类数据的缺失值的多重估算，mix包适用于分类和连续数据的混合数据；

6）pan包可为面版数据(panel data)的缺失值做多重估算；

7）VIM包做缺失数据的可视化和估算；

8）Hmisc包的aregImpute()和transcan()提供了其它的估算缺失值方法；

9）EMV包提供了knn方法估计缺失数据；

10）monomvn包估计单调多元正态数据的缺失值。

16、隐变量方法

【包】：stats，MCMCpack，GPArotation，FAiR，ifa，sem，ltm，eRm，FactoMineR，tsfa，polCA

【备注】：1）stats包的factanal()执行最大似然因子分析，

2）MCMCpack包可做贝叶斯因子分析。

3）GPArotation包提供投影梯度（Gradient Projection）旋转因子法。

4）FAiR包用遗传算法作因子分析。ifa包可用于非正态的变量。

5）sem包拟合线形结构方程模型。

6）ltm包可做隐含式语义分析 (Latent semantic analysis)，

7）eRm包则可拟合Rasch模型(Rasch models)。

8）FactoMineR包里有很多因子分析的方法，

包括：MFA()多元因子分析，HMFA()等级多元因子分析，ADFM()定量和定性数据的多元因子分析。

9）tsfa包执行时间序列的因子分析。

10）poLCA包针对多分类变量(polytomous variable)做潜类别分析（Latent Class Analysis）。

17、有监督分类和判别分析

【包】：MASS ，mda，

【备注】：1）MASS 包里的lda()和qda()分别针对线性和二次判别分析。

2）mda包的mda() and fda()允许混合和更灵活的判别分析，

3）mars()做多元自适应样条回归（multivariate adaptive regression splines），

4）bruto()做自适应样条后退拟合（adaptive spline backfitting）。

5）earth包里也有多元自适应样条回归的函数。

6）rda包可用质心收缩法（shrunken centroids regularized discriminant analysis）实现高维数据的分类。

7）VR的class包的knn()函数执行k-最近邻算法，

8）knncat包里有针对分类变量的k-最近邻算法。

9）SensoMineR包的FDA()用于因子判别分析。

10）许多包结合了降维（dimension reduction）和分类。

klaR包可以做变量选择，可处理多重共线性，还有可视化函数。

11）superpc包利用主成分做有监督的分类，

12）classPP 包则可为其做投影寻踪（projection pursuit），

13）gpls包用广义偏最小二乘做分类。

14）hddplot包用交叉验证的线性判别分析决定最优的特征个数。

15）supclust包可以根据芯片数据做基因的监督聚类。

16）ROCR提供许多评估分类执行效果的方法。

17）predbayescor包可做朴素贝叶斯（na?ve Bayes）分类。

18、典型相关分析

【包】：stats，kernlab

【备注】：1）stats包里的cancor()是做典型相关的函数。

2）kernlab包提供更稳健的核方法kcca()。

3）concor包提供了许多concordance methods。

4）calibrate包里的rda()函数可做冗余度分析和典型相关。