数据分析实例--R语言如何对垃圾邮件进行分类

Coursera上数据分析实例 --R语言如何对垃圾邮件进行分类

 

Structure of a Data Analysis

  1. 数据分析的步骤

l  Define the question

l  Define the ideal data set

l  Determine what data you can access

l  Obtain the data

l  Clean the data

l  Exploratory data analysis

l  Statistical prediction/model

l  Interpret results

l  Challenge results

l  Synthesize/write up results

l  Create reproducible code

  1. A sample

1)    问题.

Can I automatically detect emails that are SPAM or not?

2)    具体化问题

Can I use quantitative characteristics of the emails to classify them as SPAM/HAM?

3)    获取数据

http://search.r-project.org/library/kernlab/html/spam.html

4)    取样

#if it isn't installed,please install the package first.

library(kernlab)

data(spam)

 

#perform the subsampling

set.seed(3435)

trainIndicator =rbinom(4601,size = 1,prob = 0.5)

table(trainIndicator)

 

 

trainSpam = spam[trainIndicator == 1, ]

testSpam = spam[trainIndicator == 0, ]

5)    初步分析

a)      Names:查看的列名

names(trainSpam)

 

b)      Head:查看前六行

head(trainSpam)

 

c)       Summaries:汇总

table(trainSpam$type)

 

d)      Plots:画图,查看垃圾邮件及非垃圾邮件的分布

plot(trainSpam$capitalAve ~ trainSpam$type)

 

上图分布不明显,我们取对数后,再看看

plot(log10(trainSpam$capitalAve + 1) ~ trainSpam$type)

 

e)      寻找预测的内在关系

plot(log10(trainSpam[, 1:4] + 1))

 

f)        试用层次聚类

hCluster = hclust(dist(t(trainSpam[, 1:57])))

plot(hCluster)

 

太乱了.不能发现些什么。老方法不是取log看看

hClusterUpdated = hclust(dist(t(log10(trainSpam[, 1:55] + 1))))

plot(hClusterUpdated)

 

 

 

6)    统计预测及建模

trainSpam$numType = as.numeric(trainSpam$type) - 1

costFunction = function(x, y) sum(x != (y > 0.5))

cvError = rep(NA, 55)

library(boot)

for (i in 1:55) {

    lmFormula = reformulate(names(trainSpam)[i], response = "numType")

    glmFit = glm(lmFormula, family = "binomial", data = trainSpam)

    cvError[i] = cv.glm(trainSpam, glmFit, costFunction, 2)$delta[2]

}

## Which predictor has minimum cross-validated error?

names(trainSpam)[which.min(cvError)]

 

7)     检测

## Use the best model from the group

predictionModel = glm(numType ~ charDollar, family = "binomial", data = trainSpam)

## Get predictions on the test set

predictionTest = predict(predictionModel, testSpam)

predictedSpam = rep("nonspam", dim(testSpam)[1])

## Classify as `spam' for those with prob > 0.5

predictedSpam[predictionModel$fitted > 0.5] = "spam"

## Classification table 查看分类结果

table(predictedSpam, testSpam$type)

 

分类错误率:0.2243 =(61 + 458)/(1346 + 458 + 61 + 449)

8)    Interpret results(结果解释)

The fraction of charcters that are dollar signs can be used to predict if an email is Spam

Anything with more than 6.6% dollar signs is classified as Spam

More dollar signs always means more Spam under our prediction

Our test set error rate was 22.4%

9)    Challenge results

10)  Synthesize/write up results

11)   Create reproducible code

转载于:https://www.cnblogs.com/colinqin/p/6939981.html

With more than 200 practical recipes, this book helps you perform data analysis with R quickly and efficiently. The R language provides everything you need to do statistical work, but its structure can be difficult to master. This collection of concise, task-oriented recipes makes you productive with R immediately, with solutions ranging from basic tasks to input and output, general statistics, graphics, and linear regression. Each recipe addresses a specific problem, with a discussion that explains the solution and offers insight into how it works. If you're a beginner, R Cookbook will help get you started. If you're an experienced data programmer, it will jog your memory and expand your horizons. You'll get the job done faster and learn more about R in the process. * Create vectors, handle variables, and perform other basic functions * Input and output data * Tackle data structures such as matrices, lists, factors, and data frames * Work with probability, probability distributions, and random variables * Calculate statistics and confidence intervals, and perform statistical tests * Create a variety of graphic displays * Build statistical models with linear regressions and analysis of variance (ANOVA) * Explore advanced statistical techniques, such as finding clusters in your data "Wonderfully readable, R Cookbook serves not only as a solutions manual of sorts, but as a truly enjoyable way to explore the R language-one practical example at a time." -Jeffrey Ryan, software consultant and R package author
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值