#P4项目知识点整理
##P4项目概述
##R语言知识点汇总
##数据分析实例解析
#P4项目概述
使用R+EDA(exploratory data analysis探索性数据分析)(探索式数据分析是在应用正式的、严格的统计分析之前,对数据的特征和关系进行数字和图表的测试) 来探索一个变量或多个变量之间的关系,以及在一个选定的数据集中探索分布,异常值和反常现象。
#R语言知识点汇总
1.R语言概述
R语言是一款强大,免费,扩展性高的开源编程语言,用于统计计算,同时运用了command-line scripting, you can store a series of complex data-analysis steps in R.
Re-use your data analysis work
make it easier for others to validate research results and check your work for errors
The language is actually fairly simple, but it is unconventional
2.数据处理:
###ggplot2 - Multiple Plots in One graph using gridExtra
区分 facet_wrap facet_grid 命令将数据分面在不同的数据表中显示,gridExtra可以在同一张表格中显示不同的数据
###生成有序变量 factor variables
http://statistics.ats.ucla.edu/stat/r/modules/factor_variables.htm
3.数据转换 (data transformation)
log transformation
1.用于Monetary amounts--incomes, customer value, account, purchase sizes
basic data work
2.用于多个数量级的数据
3.用于倍增特征的数据 例如涨价 2% 需要根据原价调整,范围可能是2,可能是200,可能是20000
signedlog 10 = function(x) { ifelse(abs(x)<=1, 0, sign(x)*log10(abs(x))) }
extracting key statistics out of a data set
explore a data set with basic graphics
reshape data to make it easier to analyze
4400+的数据包,18000+的领英小组
R的语言 is different from that of many other languages
##数据分析实例解析
Netflix Prize
The Netflix Prize was an open competition for the best collaborative filtering algorithm to predict user ratings for films, based on previous ratings without any other information about the users or films.
训练数据集 <user, movie, data of grade, grade>
RMSE(root mean squared error) measure of the differences between values (sample and population values) predicted by a model or an estimator and the values actually observed
qualifying set表示包含了 user, movie, date of grade 三个变量的数据集, 其中quiz set用来做预测算法的检验工作
提高推荐算法准确率
Foodborne Chicago finds dodgy restaurants with tweets, and R
http://blog.revolutionanalytics.com/2013/08/foodborne-chicago.html