python数据规范化_Python中的数据规范化

本文探讨了如何使用Python对MLB薪资数据进行规范化处理,以解决不同年份间数据不一致的问题。通过计算每年联盟平均薪资,创建一个派生字段,将球队薪资与全联盟平均薪资进行比较。此外,还介绍了将数值标准化到0到1之间,适用于机器学习算法。请注意,选择的规范化策略会影响结果。
摘要由CSDN通过智能技术生成

python数据规范化

开幕当天 (Opening Day)

Well it’s that time of the year again in the United States. The 162 game marathon MLB season is officially underway. In honor of the opening of another season of America’s Pasttime I was working on a post that uses data from the MLB. What I realized was that as I was writing the post, I found that I kept struggling with inconsistent data across different seasons. It was really annoying and finally it hit me: This is what I should be writing about! Why not just dedicate an entire post to normalizing data!

好吧,又是一年中的那个时候在美国。 162场马拉松MLB赛季正式开始。 为了纪念《美国过去的时光》另一个季节的开幕,我正在撰写一篇使用美国职业棒球大联盟数据的文章。 我意识到,当我写这篇文章时,我发现自己一直在努力应对不同季节之间不一致的数据。 真的很烦人,最后让我震惊:这就是我应该写的! 为什么不将整个帖子专门用于规范化数据!

So that’s what I’ve done. In this post we’ll be digging into some MLB payroll data. In particular I’m going to show you how you can use normalization techniques to compare seemlingly incomparable data! Sounds like magic? Well it’s actually really simple, but I think these little Python scripts will really help you out 🙂

这就是我所做的。 在这篇文章中,我们将挖掘一些MLB工资数据。 特别是,我将向您展示如何使用规范化技术来比较看似不可比拟的数据! 听起来像魔术? 嗯,这实际上非常简单,但是我认为这些小的Python脚本确实可以帮到您🙂

我们的数据 (Our Data)

The data I’m using is a collection of MLB standings and attendance data from the past 70 years. You can read more about how I collected it in this post

我使用的数据是过去70年美国职业棒球大联盟(MLB)排名和出勤率数据的集合。 您可以在这篇文章中阅读有关我如何收集它的更多信息

I’m sure a lot of you saw the news last week about feather, the brainchild from Wes McKinney and Hadley Wickham. As both a Python and an R user, I think it’s a really compelling idea. It’ll be interesting to see how the project progresses over time. Can’t wait to see what else they cook up!

我敢肯定,上周有很多人看到了有关羽毛的新闻,这是韦斯·麦金尼和哈德利·威克汉姆(Hadley Wickham)的创意。 作为Python和R用户,我认为这是一个非常引人注目的想法。 看到项目随着时间的推移将会如何进展将会很有趣。 迫不及待地想看看他们还准备了什么!

In any event, I thought I’d give it a try for this post. I did my data collection using R (comes from a previous post on the MLB), but I wanted to do the analysis in Rodeo. After running my data collection script in R, I sent the output to a .feather file using the feather R package.

无论如何,我都想尝试一下这篇文章。 我使用R(来自MLB的前一篇文章)进行了数据收集,但是我想在Rodeo中进行分析。 R中运行我的数据采集脚本后,我送输出到.feather使用文件feather [R包。

library(feather)
write_feather(standings, "standings.feather")
write_feather(attendance, "attendance.feather")

library(feather)
write_feather(standings, "standings.feather")
write_feather(attendance, "attendance.feather")

I then read that data back into Python.

然后,我将该数据读回到Python中。

1run 1轮 exinn 外在 g G homeinterllast_year 去年 lg lg luck 运气 pythwl th r [R rardiff 迪夫 rk rk roadsos s srs srs tm Tm值 vlhp 超高压 vrhp 病毒 vs_teams_above_500 vs_teams_above_500 vs_teams_below_500 vs_teams_below_500 w w wins_losses wins_losses year
0 0 22-19 22-19 5-3 5-3 155.0 155.0 53-24 53-24 None 没有 56.0 56.0 1949.0 1949.0 AL2 2 96-58 96-58 5.9 5.9 4.5 4.5 1.4 1.4 1 1个 45-32 45-32 -0.2 -0.2 1.3 1.3 NYY 纽约州 42-25 42-25 56-31 56-31 38-28 38-28 60-28 60-28 98.0 98.0 0.636 0.636 1950 1950年
1 1个 30-16 30-16 7-4 7-4 157.0 157.0 48-29 48-29 None 没有 63.0 63.0 1949.0 1949.0 NL NL 4 4 87-67 87-67 4.6 4.6 4.0 4.0 0.6 0.6 2 2 43-34 43-34 -0.1 -0.1 0.5 0.5 PHI PHI 20-17 20-17岁 71-46 71-46 46-42 46-42 45-21 45-21 91.0 91.0 0.591 0.591 1950 1950年
2 2 20-20 20-20 8-4 8-4 157.0 157.0 50-30 50-30 None 没有 59.0 59.0 1949.0 1949.0 AL7 7 88-66 88-66 5.3 5.3 4.5 4.5 0.8 0.8 3 3 45-29 45-29 -0.1 -0.1 0.7 0.7 DET DET 37-23 37-23 58-36 58-36 32-34 32-34岁 63-25 63-25 95.0 95.0 0.617 0.617 1950 1950年
3 3 23-21 23-21 5-8 5-8 155.0 155.0 48-30 48-30 None 没有 65.0 65.0 1949.0 1949.0 NL NL 1 1个 88-66 88-66 5.5 5.5 4.7 4.7 0.8 0.8 4 4 41-35 41-35 -0.1 -0.1 0.7 0.7 BRO BRO 35-20 35-20 54-45 54-45 48-40 48-40 41-25 41-25 89.0 89.0 0.578 0.578 1950 1950年
4 4 21-11 21-11 4-5 4-5 154.0 154.0 55-22 55-22 None 没有 60.0 60.0 1949.0 1949.0 AL0 0 94-60 94-60 6.7 6.7 5.2 5.2 1.4 1.4 5 5 39-38 39-38 -0.2 -0.2 1.3 1.3 BOS BOS 31-22 31-22 63-38 63-38 29-37 29-37 65-23 65-23 94.0 94.0 0.610 0.610 1950 1950年
attend_per_game 参加比赛 attendance 出勤率 batage 警戒线 bpf bpf est_payroll est_payroll managers 管理者 n_a_ta_s n_a_ta_s n_aallstars n_aallstars n_hof n_hof pageppf ppf time 时间 tm Tm值 year
0 0 10708.0 10708.0 535418.0 535418.0 26.5 26.5 103 103 5571200.0 5571200.0 Cox 考克斯 15 15 1 1个 2 2 31.5 31.5 103 103 2:37 2:37 ATL ATL 1981 1981年
1 1个 18623.0 18623.0 1024247.0 1024247.0 30.2 30.2 100 100 NaN N Weaver 编织者 13 13 3 3 3 3 29.3 29.3 99 99 2:42 2:42 BAL 巴尔 1981 1981年
2 2 20007.0 20007.0 1060379.0 1060379.0 29.2 29.2 106 106 NaN N Houk 胡克 15 15 1 1个 4 4 27.9 27.9 106 106 2:40 2:40 BOS BOS 1981 1981年
3 3 26695.0 26695.0 1441545.0 1441545.0 30.5 30.5 99 99 3828834.0 3828834.0 Fregosi and Mauch 弗雷戈西和莫赫 13 13 4 4 1 1个 30.0 30.0 99 99 2:40 2:40 CAL CAL 1981 1981年
4 4 9752.0 9752.0 565637.0 565637.0 28.2 28.2 104 104 NaN N Amalfitano 阿马尔菲塔诺 14 14 1 1个 0 0 28.1 28.1 106 106 2:42 2:42 CHC CHC 1981 1981年

Wow! Really easy. Great work Wes and Hadley! 🙂

哇! 真的很容易。 韦斯和哈德利都很出色! 🙂

Now that we’ve got our data, it’s time to do some munging.

现在我们有了数据,是时候做些调整了。

问题 (The Problem)

I’m looking to compare payrolls over time. There are a couple of tricky things about this.

我希望比较一段时间内的薪资。 关于这有一些棘手的事情。

First off (and probaby most obviously) is that the value of the dollar has changed over the past 70 years. So there will be obvious differences between a payroll from 1970 and a payroll from 2010.

首先(最明显的是概率)是美元的价值在过去70年中发生了变化。 因此,1970年的薪资与2010年的薪资之间将存在明显的差异。

payrolls = attendance[['year', 'est_payroll']].groupby('year').mean() / 1000
payrolls[(payrolls.index==1970) | (payrolls.index==2010)]
Out[24]:
       est_payroll (1000s)
year
1970    434.565455
2010  91916.006567

payrolls = attendance[['year', 'est_payroll']].groupby('year').mean() / 1000
payrolls[(payrolls.index==1970) | (payrolls.index==2010)]
Out[24]:
       est_payroll (1000s)
year
1970    434.565455
2010  91916.006567

Yikes! When adjusted for inflation, that $434k becomes $2.5M. Compare that to the actual average payroll in 2010, $92M, and not quite everything seems to be adding up.

kes! 扣除通货膨胀因素后,这43.4万美元变为250万美元。 相比之下,2010年的实际平均薪资为9,200万美元,似乎并不是所有事情都在增加。

That’s because the value of baseball players has ALSO been increasing over time. As teams have been able to make more money through TV revenue and other means, ballplayers salaries have gone up…way up! As a result normalizing our data isn’t as simple as just adjusting for inflation. Darn!

那是因为随着时间的推移,棒球运动员的价值也在不断增加。 由于球队能够通过电视收入和其他方式赚更多的钱,所以球员的薪水已经上升了……上升了! 结果,归一化我们的数据并不像调整通货膨胀那么简单。 该死!

Brief Aside: While on the subject, a super interesting factoid is the “Bobby Bonilla Mets contract”. Despite having been retired for 15 years, the Mets still pay him over $1M per year, thanks to an interesting negotiation and Mets owner Fred Wilpon’s involvement in Bernie Madoff’s Ponzi scheme. Full story here.

暂且不提 :关于这个话题,一个超级有趣的事实是“鲍比·博尼利亚·梅茨合同”。 尽管已经退休15年,大都会仍每年向他支付超过100万美元,这要归功于有趣的谈判以及大都会老板Fred Wilpon参与了伯尼·马多夫的庞氏骗局。 全文在这里

Bobby Bonilla still makes over $1M / year despite not having played baseball since 2001

Bobby Bonilla尽管自2001年以来从未参加过棒球比赛,但仍然每年赚取超过100万美元

基本规范化 (Basic Normalization)

Not to worry! We can still get an apples to apples comparison of payrolls over time. In order to make that comparison, we need our payrolls to be on the same numerical scale.

不要担心! 随着时间的推移,我们仍然可以得到苹果与苹果的工资对比。 为了进行比较,我们需要工资单在相同的数字范围内。

We’re going to use a really simple approach for this. For each year we’re going to calculate the mean salary for the league as whole, and then create a derived field which compares a given team’s payroll to the mean payroll for the entire league.

我们将为此使用一种非常简单的方法。 对于每一年,我们将计算整个联盟的平均工资,然后创建一个派生字段,该字段将给定团队的薪水与整个联盟的平均薪水进行比较。

Lucky for us, Python and pandas make this super easy to do. Here goes…

对我们来说幸运的是,Python和pandas使此超级易操作变得容易。 开始…

Let’s take a look at what our norm_payroll field looks like. Ahh there we go!

让我们看一下我们的norm_payroll字段是什么样的。 啊,我们去!

获得0到1的值 (Getting the 0 to 1 Value)

But what if we wanted to do something a little different? for instance, what if you wanted the norm_payroll to bet a standardized value between 0 and 1 (instead of a uncapped scale as in the previous example)?

但是,如果我们想做一些不同的事情怎么办? 例如,如果您希望norm_payroll投注0到1之间的标准化值(而不是如上例中所示的无上限比例)?

This is actually something that’s really common. Many machine learning algorithms perform much better using scaled data (support vector machine comes to mind). Again, lucky for us doing this in Python is super easy.

这实际上是很普遍的事情。 使用缩放数据,许多机器学习算法的性能要好得多(想到支持向量机)。 同样,幸运的是,我们使用Python做到这一点非常容易。

To do this we’ll use the same approach as before (as in, normalizing by year) but instead of using the mean, we’re going to use the max and min values for each year.

为此,我们将使用与以前相同的方法(如按年归一化),但我们将不使用平均值,而是使用每年的最大值和最小值。

min_payrolls = attendance[['year', 'est_payroll']].groupby('year').min().reset_index()
min_payrolls.columns = ['year', 'league_min_payroll']
max_payrolls = attendance[['year', 'est_payroll']].groupby('year').max().reset_index()
max_payrolls.columns = ['year', 'league_max_payroll']
attendance = pd.merge(attendance, min_payrolls, on='year')
attendance = pd.merge(attendance, max_payrolls, on='year')
attendance['norm_payroll_0_1'] = (attendance.est_payroll - attendance.league_min_payroll)  / (attendance.league_max_payroll - attendance.league_min_payroll)

min_payrolls = attendance[['year', 'est_payroll']].groupby('year').min().reset_index()
min_payrolls.columns = ['year', 'league_min_payroll']
max_payrolls = attendance[['year', 'est_payroll']].groupby('year').max().reset_index()
max_payrolls.columns = ['year', 'league_max_payroll']
attendance = pd.merge(attendance, min_payrolls, on='year')
attendance = pd.merge(attendance, max_payrolls, on='year')
attendance['norm_payroll_0_1'] = (attendance.est_payroll - attendance.league_min_payroll)  / (attendance.league_max_payroll - attendance.league_min_payroll)

As you can see things actually look a bit different than they did using the first method. Keep this in mind: Your normalization strategy can impact your results! Please don’t forget this!

如您所见,实际情况看起来与使用第一种方法时有所不同。 请记住:您的规范化策略可能会影响您的结果! 请不要忘记这一点!

你有它 (There You Have It)

There are lots more ways to normalize your data (really whatever strategy you can think of!). These are just 2 ways that work a lot of the time and can be nice starting points. By no means is this the end all be all of data normalization (there are many books on the subject), but hopefully this gives you a quick intro to this very important topic.

还有很多标准化数据的方法(实际上是您能想到的任何策略!)。 这些只是在很多时间都有效的两种方式,并且可以作为很好的起点。 这绝不是数据标准化的全部结局(关于该主题的书籍很多),但是希望这可以使您快速入门这个非常重要的主题。

翻译自: https://www.pybloggers.com/2016/04/data-normalization-in-python/

python数据规范化

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值