python数据规范化_Python中的数据规范化

最新推荐文章于 2024-07-25 10:20:53 发布

cumei1658

最新推荐文章于 2024-07-25 10:20:53 发布

阅读量1.2k

点赞数

文章标签：算法 python java 人工智能机器学习

原文链接：https://www.pybloggers.com/2016/04/data-normalization-in-python/

版权

本文探讨了如何使用Python对MLB薪资数据进行规范化处理，以解决不同年份间数据不一致的问题。通过计算每年联盟平均薪资，创建一个派生字段，将球队薪资与全联盟平均薪资进行比较。此外，还介绍了将数值标准化到0到1之间，适用于机器学习算法。请注意，选择的规范化策略会影响结果。

摘要由CSDN通过智能技术生成

python数据规范化

开幕当天 (Opening Day)

Well it’s that time of the year again in the United States. The 162 game marathon MLB season is officially underway. In honor of the opening of another season of America’s Pasttime I was working on a post that uses data from the MLB. What I realized was that as I was writing the post, I found that I kept struggling with inconsistent data across different seasons. It was really annoying and finally it hit me: This is what I should be writing about! Why not just dedicate an entire post to normalizing data!

好吧，又是一年中的那个时候在美国。 162场马拉松MLB赛季正式开始。为了纪念《美国过去的时光》另一个季节的开幕，我正在撰写一篇使用美国职业棒球大联盟数据的文章。我意识到，当我写这篇文章时，我发现自己一直在努力应对不同季节之间不一致的数据。真的很烦人，最后让我震惊：这就是我应该写的！为什么不将整个帖子专门用于规范化数据！

So that’s what I’ve done. In this post we’ll be digging into some MLB payroll data. In particular I’m going to show you how you can use normalization techniques to compare seemlingly incomparable data! Sounds like magic? Well it’s actually really simple, but I think these little Python scripts will really help you out 🙂

这就是我所做的。在这篇文章中，我们将挖掘一些MLB工资数据。特别是，我将向您展示如何使用规范化技术来比较看似不可比拟的数据！听起来像魔术？嗯，这实际上非常简单，但是我认为这些小的Python脚本确实可以帮到您🙂

我们的数据 (Our Data)

The data I’m using is a collection of MLB standings and attendance data from the past 70 years. You can read more about how I collected it in this post

我使用的数据是过去70年美国职业棒球大联盟（MLB）排名和出勤率数据的集合。您可以在这篇文章中阅读有关我如何收集它的更多信息

I’m sure a lot of you saw the news last week about feather, the brainchild from Wes McKinney and Hadley Wickham. As both a Python and an R user, I think it’s a really compelling idea. It’ll be interesting to see how the project progresses over time. Can’t wait to see what else they cook up!

我敢肯定，上周有很多人看到了有关羽毛的新闻，这是韦斯·麦金尼和哈德利·威克汉姆（Hadley Wickham）的创意。作为Python和R用户，我认为这是一个非常引人注目的想法。看到项目随着时间的推移将会如何进展将会很有趣。迫不及待地想看看他们还准备了什么！

In any event, I thought I’d give it a try for this post. I did my data collection using R (comes from a previous post on the MLB), but I wanted to do the analysis in Rodeo. After running my data collection script in R, I sent the output to a .feather file using the feather R package.

无论如何，我都想尝试一下这篇文章。我使用R（来自MLB的前一篇文章）进行了数据收集，但是我想在Rodeo中进行分析。 R中运行我的数据采集脚本后，我送输出到.feather使用文件feather [R包。

library(feather)
write_feather(standings, "standings.feather")
write_feather(attendance, "attendance.feather")

library(feather)
write_feather(standings, "standings.feather")
write_feather(attendance, "attendance.feather")

I then read that data back into Python.

然后，我将该数据读回到Python中。

		1run	1轮	exinn	外在	g	G	home	家	inter	间	l	升	last_year	去年	lg	lg	luck	运气	pythwl	th	r	[R	ra	拉	rdiff	迪夫	rk	rk	road	路	sos	s	srs	srs	tm	Tm值	vlhp	超高压	vrhp	病毒	vs_teams_above_500	vs_teams_above_500	vs_teams_below_500	vs_teams_below_500	w	w	wins_losses	wins_losses	year	年
0	0	22-19	22-19	5-3	5-3	155.0	155.0	53-24	53-24	None	没有	56.0	56.0	1949.0	1949.0	AL	铝	2	2	96-58	96-58	5.9	5.9	4.5	4.5	1.4	1.4	1	1个	45-32	45-32	-0.2	-0.2	1.3	1.3	NYY	纽约州	42-25	42-25	56-31	56-31	38-28	38-28	60-28	60-28	98.0	98.0	0.636	0.636	1950	1950年
1	1个	30-16	30-16	7-4	7-4	157.0	157.0	48-29	48-29	None	没有	63.0	63.0	1949.0	1949.0	NL	NL	4	4	87-67	87-67	4.6	4.6	4.0	4.0	0.6	0.6	2	2	43-34	43-34	-0.1	-0.1	0.5	0.5	PHI	PHI	20-17	20-17岁	71-46	71-46	46-42	46-42	45-21	45-21	91.0	91.0	0.591	0.591	1950	1950年
2	2	20-20	20-20	8-4	8-4	157.0	157.0	50-30	50-30	None	没有	59.0	59.0	1949.0	1949.0	AL	铝	7	7	88-66	88-66	5.3	5.3	4.5	4.5	0.8	0.8	3	3	45-29	45-29	-0.1	-0.1	0.7	0.7	DET	DET	37-23	37-23	58-36	58-36	32-34	32-34岁	63-25	63-25	95.0	95.0	0.617	0.617	1950	1950年
3	3	23-21	23-21	5-8	5-8	155.0	155.0	48-30	48-30	None	没有	65.0	65.0	1949.0	1949.0	NL	NL	1	1个	88-66	88-66	5.5	5.5	4.7	4.7	0.8	0.8	4	4	41-35	41-35	-0.1	-0.1	0.7	0.7	BRO	BRO	35-20	35-20	54-45	54-45	48-40	48-40	41-25	41-25	89.0	89.0	0.578	0.578	1950	1950年
4	4	21-11	21-11	4-5	4-5	154.0	154.0	55-22	55-22	None	没有	60.0	60.0	1949.0	1949.0	AL	铝	0	0	94-60	94-60	6.7	6.7	5.2	5.2	1.4	1.4	5	5	39-38	39-38	-0.2	-0.2	1.3	1.3	BOS	BOS	31-22	31-22	63-38	63-38	29-37	29-37	65-23	65-23	94.0	94.0	0.610	0.610	1950	1950年

		attend_per_game	参加比赛	attendance	出勤率	batage	警戒线	bpf	bpf	est_payroll	est_payroll	managers	管理者	n_a_ta_s	n_a_ta_s	n_aallstars	n_aallstars	n_hof	n_hof	page	页	ppf	ppf	time	时间	tm	Tm值	year	年
0	0	10708.0	10708.0	535418.0	535418.0	26.5	26.5	103	103	5571200.0	5571200.0	Cox	考克斯	15	15	1	1个	2	2	31.5	31.5	103	103	2:37	2:37	ATL	ATL	1981	1981年
1	1个	18623.0	18623.0	1024247.0	1024247.0	30.2	30.2	100	100	NaN	N	Weaver	编织者	13	13	3	3	3	3	29.3	29.3	99	99	2:42	2:42	BAL	巴尔	1981	1981年
2	2	20007.0	20007.0	1060379.0	1060379.0	29.2	29.2	106	106	NaN	N	Houk	胡克	15	15	1	1个	4	4	27.9	27.9	106	106	2:40	2:40	BOS	BOS	1981	1981年
3	3	26695.0	26695.0	1441545.0	1441545.0	30.5	30.5	99	99	3828834.0	3828834.0	Fregosi and Mauch	弗雷戈西和莫赫	13	13	4	4	1	1个	30.0	30.0	99	99	2:40	2:40	CAL	CAL	1981	1981年
4	4	9752.0	9752.0	565637.0	565637.0	28.2	28.2	104	104	NaN	N	Amalfitano	阿马尔菲塔诺	14	14	1	1个	0	0	28.1	28.1	106	106	2:42	2:42	CHC	CHC	1981	1981年

Wow! Really easy. Great work Wes and Hadley! 🙂

哇！真的很容易。韦斯和哈德利都很出色！ 🙂

Now that we’ve got our data, it’s time to do some munging.

现在我们有了数据，是时候做些调整了。

问题 (The Problem)

I’m looking to compare payrolls over time. There are a couple of tricky things about this.

我希望比较一段时间内的薪资。关于这有一些棘手的事情。

First off (and probaby most obviously) is that the value of the dollar has changed over the past 70 years. So there will be obvious differences between a payroll from 1970 and a payroll from 2010.

首先（最明显的是概率）是美元的价值在过去70年中发生了变化。因此，1970年的薪资与2010年的薪资之间将存在明显的差异。

payrolls = attendance[['year', 'est_payroll']].groupby('year').mean() / 1000
payrolls[(payrolls.index==1970) | (payrolls.index==2010)]
Out[24]:
       est_payroll (1000s)
year
1970    434.565455
2010  91916.006567

payrolls = attendance[['year', 'est_payroll']].groupby('year').mean() / 1000
payrolls[(payrolls.index==1970) | (payrolls.index==2010)]
Out[24]:
       est_payroll (1000s)
year
1970    434.565455
2010  91916.006567

Yikes! When adjusted for inflation, that $434k becomes $2.5M. Compare that to the actual average payroll in 2010, $92M, and not quite everything seems to be adding up.

kes！扣除通货膨胀因素后，这43.4万美元变为250万美元。相比之下，2010年的实际平均薪资为9,200万美元，似乎并不是所有事情都在增加。

That’s because the value of baseball players has ALSO been increasing over time. As teams have been able to make more money through TV revenue and other means, ballplayers salaries have gone up…way up! As a result normalizing our data isn’t as simple as just adjusting for inflation. Darn!

那是因为随着时间的推移，棒球运动员的价值也在不断增加。由于球队能够通过电视收入和其他方式赚更多的钱，所以球员的薪水已经上升了……上升了！结果，归一化我们的数据并不像调整通货膨胀那么简单。该死！

Brief Aside: While on the subject, a super interesting factoid is the “Bobby Bonilla Mets contract”. Despite having been retired for 15 years, the Mets still pay him over $1M per year, thanks to an interesting negotiation and Mets owner Fred Wilpon’s involvement in Bernie Madoff’s Ponzi scheme. Full story here.

暂且不提 ：关于这个话题，一个超级有趣的事实是“鲍比·博尼利亚·梅茨合同”。尽管已经退休15年，大都会仍每年向他支付超过100万美元，这要归功于有趣的谈判以及大都会老板Fred Wilpon参与了伯尼·马多夫的庞氏骗局。全文在这里。

Bobby Bonilla still makes over $1M / year despite not having played baseball since 2001

Bobby Bonilla尽管自2001年以来从未参加过棒球比赛，但仍然每年赚取超过100万美元

基本规范化 (Basic Normalization)

Not to worry! We can still get an apples to apples comparison of payrolls over time. In order to make that comparison, we need our payrolls to be on the same numerical scale.

不要担心！随着时间的推移，我们仍然可以得到苹果与苹果的工资对比。为了进行比较，我们需要工资单在相同的数字范围内。

We’re going to use a really simple approach for this. For each year we’re going to calculate the mean salary for the league as whole, and then create a derived field which compares a given team’s payroll to the mean payroll for the entire league.

我们将为此使用一种非常简单的方法。对于每一年，我们将计算整个联盟的平均工资，然后创建一个派生字段，该字段将给定团队的薪水与整个联盟的平均薪水进行比较。

Lucky for us, Python and pandas make this super easy to do. Here goes…

对我们来说幸运的是，Python和pandas使此超级易操作变得容易。开始…

Let’s take a look at what our norm_payroll field looks like. Ahh there we go!

让我们看一下我们的norm_payroll字段是什么样的。啊，我们去！

获得0到1的值 (Getting the 0 to 1 Value)

But what if we wanted to do something a little different? for instance, what if you wanted the norm_payroll to bet a standardized value between 0 and 1 (instead of a uncapped scale as in the previous example)?

但是，如果我们想做一些不同的事情怎么办？例如，如果您希望norm_payroll投注0到1之间的标准化值（而不是如上例中所示的无上限比例）？

This is actually something that’s really common. Many machine learning algorithms perform much better using scaled data (support vector machine comes to mind). Again, lucky for us doing this in Python is super easy.

这实际上是很普遍的事情。使用缩放数据，许多机器学习算法的性能要好得多（想到支持向量机）。同样，幸运的是，我们使用Python做到这一点非常容易。

To do this we’ll use the same approach as before (as in, normalizing by year) but instead of using the mean, we’re going to use the max and min values for each year.

为此，我们将使用与以前相同的方法（如按年归一化），但我们将不使用平均值，而是使用每年的最大值和最小值。

min_payrolls = attendance[['year', 'est_payroll']].groupby('year').min().reset_index()
min_payrolls.columns = ['year', 'league_min_payroll']
max_payrolls = attendance[['year', 'est_payroll']].groupby('year').max().reset_index()
max_payrolls.columns = ['year', 'league_max_payroll']
attendance = pd.merge(attendance, min_payrolls, on='year')
attendance = pd.merge(attendance, max_payrolls, on='year')
attendance['norm_payroll_0_1'] = (attendance.est_payroll - attendance.league_min_payroll)  / (attendance.league_max_payroll - attendance.league_min_payroll)

min_payrolls = attendance[['year', 'est_payroll']].groupby('year').min().reset_index()
min_payrolls.columns = ['year', 'league_min_payroll']
max_payrolls = attendance[['year', 'est_payroll']].groupby('year').max().reset_index()
max_payrolls.columns = ['year', 'league_max_payroll']
attendance = pd.merge(attendance, min_payrolls, on='year')
attendance = pd.merge(attendance, max_payrolls, on='year')
attendance['norm_payroll_0_1'] = (attendance.est_payroll - attendance.league_min_payroll)  / (attendance.league_max_payroll - attendance.league_min_payroll)

As you can see things actually look a bit different than they did using the first method. Keep this in mind: Your normalization strategy can impact your results! Please don’t forget this!

如您所见，实际情况看起来与使用第一种方法时有所不同。请记住：您的规范化策略可能会影响您的结果！请不要忘记这一点！

你有它 (There You Have It)

There are lots more ways to normalize your data (really whatever strategy you can think of!). These are just 2 ways that work a lot of the time and can be nice starting points. By no means is this the end all be all of data normalization (there are many books on the subject), but hopefully this gives you a quick intro to this very important topic.

还有很多标准化数据的方法（实际上是您能想到的任何策略！）。这些只是在很多时间都有效的两种方式，并且可以作为很好的起点。这绝不是数据标准化的全部结局（关于该主题的书籍很多），但是希望这可以使您快速入门这个非常重要的主题。