探索性数据分析入门_入门指南:R中的探索性数据分析

探索性数据分析入门

When I started on my journey to learn data science, I read through multiple articles that stressed the importance of understanding your data. It didn’t make sense to me. I was naive enough to think that we are handed over data which we push through an algorithm and hand over the results.

当我开始学习数据科学的旅程时,我通读了多篇文章,其中强调了理解您的数据的重要性。 对我来说这没有意义。 我很天真,以为我们已经交出了我们通过算法推送并交出结果的数据。

Yes, I wasn’t exactly the brightest. But I’ve learned my lesson and today I want to impart what I picked from my sleepless nights trying to figure out my data. I am going to use the R language to demonstrate EDA.

是的,我并不是最聪明的人。 但是我已经吸取了教训,今天我想讲讲我从不眠之夜中挑选出的东西,以弄清楚我的数据。 我将使用R语言来演示EDA。

WHY R?

为什么R?

Because it was built from the get-go keeping data science in mind. It’s easy to pick up and get your hands dirty and doesn’t have a steep learning curve, *cough* Assembly *cough*.

因为它是从一开始就牢记数据科学而构建的。 它很容易拿起并弄脏您的手,没有陡峭的学习曲线,*咳嗽* 组装 *咳嗽*。

Before I start, This article is a guide for people classified under the tag of ‘Data Science infants.’ I believe both Python and R are great languages, and what matters most is the Story you tell from your data.

在开始之前,本文是针对归类为“数据科学婴儿”标签的人们的指南。 我相信Python和R都是很棒的语言,最重要的是您从数据中讲述的故事。

为什么使用此数据集? (Why this dataset?)

Well, it’s where I think most of the aspiring data scientists would start. This data set is a good starting place to heat your engines to start thinking like a data scientist at the same time being a novice-friendly helps you breeze through the exercise.

好吧,这是我认为大多数有抱负的数据科学家都会从那里开始的地方。 该数据集是加热引擎以像数据科学家一样开始思考的良好起点,同时对新手友好,可以帮助您轻而易举地完成练习。

我们如何处理这些数据? (How do we approach this data?)

  • Will this variable help use predict house prices?

    这个变量是否有助于预测房价?
  • Is there a correlation between these variables?

    这些变量之间有相关性吗?
  • Univariate Analysis

    单变量分析
  • Multivariate Analysis

    多元分析
  • A bit of Data Cleaning

    一点数据清理
  • Conclude with proving the relevance of our selected variables.

    最后证明我们选择的变量的相关性。

Best of luck on your journey to master Data Science!

在掌握数据科学的过程中祝您好运!

Now, we start with importing packages, I’ll explain why these packages are present along the way…

现在,我们从导入程序包开始,我将解释为什么这些程序包一直存在...

easypackages::libraries("dplyr", "ggplot2", "tidyr", "corrplot", "corrr", "magrittr",   "e1071","ggplot2","RColorBrewer", "viridis")
options(scipen = 5) #To force R to not use scientfic notationdataset <- read.csv("train.csv")
str(dataset)

Here, in the above snippet, we use scipen to avoid scientific notation. We import our data and use the str() function to get the gist of the selection of variables that the dataset offers and the respective data type.

在此,在上面的代码段中,我们使用scipen来避免科学计数法。 我们导入数据并使用str()函数来获取数据集提供的变量以及相应数据类型的选择依据。

Image for post

The variable SalePrice is the dependent variable which we are going to base all our assumptions and hypothesis around. So it’s good to first understand more about this variable. For this, we’ll use a Histogram and fetch a frequency distribution to get a visual understanding of the variable. You’d notice there’s another function i.e. summary() which is essentially used to for the same purpose but without any form of visualization. With experience, you’ll be able to understand and interpret this form of information better.

变量SalePrice是因变量,我们将基于其所有假设和假设。 因此,最好先了解更多有关此变量的信息。 为此,我们将使用直方图并获取频率分布以直观了解变量。 您会注意到还有另一个函数,即summary(),该函数本质上用于相同的目的,但没有任何形式的可视化。 凭借经验,您将能够更好地理解和解释这种形式的信息。

ggplot(dataset, aes(x=SalePrice)) + 
theme_bw()+
geom_histogram(aes(y=..density..),color = 'black', fill = 'white', binwidth = 50000)+
geom_density(alpha=.2, fill='blue') +
labs(title = "Sales Price Density", x="Price", y="Density")summary(dataset$SalePrice)
Image for post
Image for post

So it is pretty evident that you’ll find many properties in the sub $200,000 range. There are properties over $600,000 and we can try to understand why is it so and what makes these homes so ridiculously expensive. That can be another fun exercise…

因此,很明显,您会找到许多价格在20万美元以下的物业。 有超过60万美元的物业,我们可以试着理解为什么会这样,以及是什么使这些房屋如此昂贵。 那可能是另一个有趣的练习……

在确定要购买的房屋的价格时,您认为哪些变量影响最大? (Which variables do you think are most influential when deciding a price for a house you are looking to buy?)

Now that we have a basic idea about SalePrice we will try to visualize this variable in terms of some other variable. Please note that it is very important to understand what type of variable you are working with. I would like you to refer to this amazing article which covers this topic in more detail here.

现在,我们对SalePrice有了基本的了解,我们将尝试根据其他变量来形象化此变量。 请注意,了解要使用的变量类型非常重要。 我想你指的这个惊人的物品,其更为详细地介绍这个主题在这里

Moving on, We will be dealing with two kinds of variables.

继续,我们将处理两种变量。

  • Categorical Variable

    分类变量
  • Numeric Variable

    数值变量

Looking back at our dataset we can discern between these variables. For starters, we run a coarse comb across the dataset and guess pick some variables which have the highest chance of being relevant. Note that these are just assumptions and we are exploring this dataset to understand this. The variables I selected are:

回顾我们的数据集,我们可以区分这些变量。 首先,我们对数据集进行粗梳,并猜测选择一些具有最大相关性的变量。 请注意,这些只是假设,我们正在探索此数据集以理解这一点。 我选择的变量是:

  • GrLivArea

    GrLivArea
  • TotalBsmtSF

    TotalBsmtSF
  • YearBuilt

    建立年份
  • OverallQual

    综合素质

So which ones are Quantitive and which ones are Qualitative out of the lot? If you look closely the OveralQual and

  • 2
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值