ipl 图像_IPL数据的数据分析

最新推荐文章于 2023-05-03 12:48:40 发布

weixin_26750511

最新推荐文章于 2023-05-03 12:48:40 发布

阅读量727

点赞数

文章标签： python 人工智能大数据机器学习数据分析

原文链接：https://medium.com/@prolaybanik/data-analysis-on-ipl-data-acd319a313d9

版权

ipl 图像

Being a cricket lover, I was waiting for the start of IPL,2020, as we all know this is the best tournament of the world. So, I thought to introduce myself performing IPL Data analysis with some data of IPL matches which I’ve found in Kaggle.

作为板球爱好者，我一直在等待2020年IPL比赛的开始，因为我们都知道这是世界上最好的比赛。因此，我想介绍一下自己如何使用在Kaggle中找到的一些IPL匹配数据进行IPL数据分析。

What’s within?

里面有什么？

This data set consists of IPL matches and its details till season 10. It includes the following:

该数据集包括第10季之前的IPL匹配及其详细信息，包括以下内容：

The number of matches per season
每个赛季的比赛次数
The Team who won by maximum runs
赢得最大跑步次数的团队
The Team who won by maximum wickets
以最大检票口获胜的球队
Top cities where the matches are held
举行比赛的热门城市
Most number of winning team
获胜人数最多
Is Toss Winner also the Match Winner
是折腾冠军还是比赛冠军
Maximum Toss Winners
最大折腾获胜者
Maximum Man Of Matches
最大比赛人数
Visual representation of number of matches won by runs with respect to toss winner.
比赛相对于掷骰赢家的比赛次数的可视化表示。

So, I will try to categorize the data by analyzing IPL matches data.

因此，我将尝试通过分析IPL匹配数据来对数据进行分类。

First of all, I’ve opened Jupyter notebook (it can be done in google colab also) and import pandas, numpy, matplotlib, seaborn libraries, and load the data set in a variable named, details. It will create a copy of whole data set in memory keeping the original file unchanged.

首先，我打开了Jupyter笔记本(也可以在Google colab中完成)并导入熊猫，numpy，matplotlib，seaborn库，并将数据集加载到名为details的变量中。它将在内存中创建整个数据集的副本，并保持原始文件不变。

Here, details.head() is used to show top five rows of the data frame. Likewise details.tail() to retrieve last 5 rows, the default value is 5 for these. There is a column, named id, which has been used as index for our data frame.

在这里，details.head()用于显示数据帧的前五行。同样，details.tail()可以检索最后5行，这些默认值是5。有一个名为id的列，已用作我们数据框的索引。

We can check shape or size of our dataset, by details.shape(), so we have 636 rows and 18 columns and can have the information also of the dataset. As per the below snap, its clear that its pandas DataFrame with 636 entries between 0 to 635, contains 18 columns, with datatypes & not null columns and size of 89.6 KB.

我们可以通过details.shape()检查数据集的形状或大小，因此我们有636行和18列，并且还可以获取数据集的信息。根据下面的快照，很明显，它的pandas DataFrame具有636个介于0到635之间的条目，包含18列，数据类型为非空列，大小为89.6 KB。

Image for post — Fig:2- Shape and Information

We can describe the dataframe to check count, min, max, standard deviation, 25%, 50%, 75% quartile value. Here it has been done in 2 different ways, first, we checked the important facts of the dataframe and second, with all the column values, where NaN is the null value, that means there will not be any mean,std value for umpire3 as its a categorical/qualitative data, similarly for city, date, team1, team2 etc. Also, top section will tell us maximum matches were played in Mumbai, most toss winner is Mumbai Indians, most toss decision is Field first, most of the result is normal, i.e, Duckworth-Lewis (D/L) method has not been applied etc.

我们可以描述数据帧来检查计数，最小，最大，标准偏差，25％，50％，75％四分位数。在这里，它以两种不同的方式完成，首先，我们检查了数据帧的重要事实，其次，使用所有列值，其中NaN是空值，这意味着umpire3不会有任何均值，std值，因为其分类/定性数据，同样适用于城市，日期，团队1，团队2等。此外，顶部会告诉我们在孟买进行的最大比赛，大多数掷骰的获胜者是孟买印第安人，大多数掷掷的决定是场上领先，大部分结果是正常的，即未应用Duckworth-Lewis(D / L)方法等。

We can also check the Standard deviation value as the Standard deviation has a proportional relationship with outlier. Now look for mean and median(50%) of each column. If mean and median are equal or nearly equal then there will be no outlier. If mean>median then the distribution will be positive skewed or if mean<median then the distribution will be negative skewed. We can also check the quartiles to see if there are skewness/outliers.IQR(Inter Quartile Range)=Q3-Q1=75%-25%Upper limit= Q1–1.5*IQRLower limit= Q3+1.5*IQRAny value beyond this limit is a outlier.We could also see the differences of 25%-min, 50%-25%, 75%-50% and max-75% to understand the symmetry of the distribution.

我们还可以检查标准偏差值，因为标准偏差与离群值具有比例关系。现在寻找每列的均值和中位数(50％)。如果均值和中位数相等或几乎相等，则不会有异常值。如果均值>中位数，则分布将为正偏斜；如果均值<中位数，则分布将为负偏斜。我们还可以检查四分位数，看是否存在偏斜/离群值。IQR(四分位数间距)= Q3-Q1 = 75％-25％上限= Q1-1.5 * IQR下限= Q3 + 1.5 * IQRAny值超出此限制这是一个离群值，我们还可以看到25％-min，50％-25％，75％-50％和max-75％的差异以了解分布的对称性。

Now, as we have seen that there are few null values, to check that, we can use details.isnull() and to check the how many null values are there for each columns we can use details.isnull().sum(). By using heatmap also, we can visualize that, its excellent features to visualize data when we have large datasets.

现在，如我们所见，空值很少，要进行检查，我们可以使用details.isnull()并检查每列中有多少个空值，我们可以使用details.isnull()。sum() 。通过使用热图，我们可以可视化它的出色功能，以便在拥有大型数据集时可视化数据。

So, for umpire3 columns, we have maximum null values, and for analysis purpose, we can remove umpire3 by executing details.drop(‘umpire3’,axis=1,inplace=True), here inplace=True is used to save the changes permanently on the data frame, axis=1 is for column, means we want to delete umpire3 column. Similarly, we can delete rows which has null values by executing, details.dropna(axis=0,inplace=True), this is useful for our data analysis.

因此，对于umpire3列，我们具有最大的空值，并且出于分析目的，我们可以通过执行details.drop('umpire3'，axis = 1，inplace = True)删除umpire3，这里inplace = True用于保存更改永久在数据帧上， axis = 1表示列，这意味着我们要删除umpire3列。同样，我们可以通过执行details.dropna(axis = 0，inplace = True)删除具有空值的行，这对于我们的数据分析很有用。

Now to check the no of matches, played per season, we can use, details[‘season’].value_counts() and also we can sort the values as per the requirement or interpretation and the same has been depicted in graphical way also. Here, we’ve used one column (season) to analyse, so this kind of analysis is called as Univariate analysis.

现在要检查每个赛季的比赛次数，我们可以使用details ['season']。value_counts() ，还可以根据需求或解释对值进行排序，并且也以图形方式进行了描述。在这里，我们使用了一个列(季节)来进行分析，因此这种分析称为单变量分析。

Now, we have to find out the team who won by maximum runs and maximum wickets in which season.

现在，我们必须找出在哪个赛季以最大跑步次数和最大检票口获胜的球队。

The most number of matches played in different cities:

在不同城市玩的比赛最多：

The most number of wins in all the season till 2017:

到2017年为止的整个赛季中获胜次数最多：

Now we need to check is the Toss Winner team is the Match winner or not by comparing the toss winner and winner of the dataset.

现在我们需要通过比较折腾赢家和数据集赢家来检查折腾赢家团队是否是比赛赢家。

Maximum Toss Winners by which Team:

哪个团队的最高胜率：

Maximum Man of the Matches won by players:

玩家赢得的最高比赛人数：

We have to analyse and find out data for a team, Kolkata Knight Riders, who won the number of matches by runs and with respect to toss winner and represent in in different plots

我们必须分析和找出加尔各答骑士车手队的数据，该队通过奔跑和折腾获胜者赢得比赛的次数，并代表不同的地块

结论：(Conclusion:)

We have analysed the data of IPL matches with the help of above explanation and visualization and can conclude that Mumbai Indians has done a great job so far. This kind of analysis can help cricket statisticians more and to all the cricket lovers.

在上面的解释和可视化的帮助下，我们已经分析了IPL比赛的数据，可以得出结论，到目前为止，孟买印第安人做得很好。这种分析可以为板球统计学家和所有板球爱好者提供更多帮助。

Reference:

参考：

https://www.kaggle.com/manasgarg/ipl
https://www.kaggle.com/manasgarg/ipl

翻译自: https://medium.com/@prolaybanik/data-analysis-on-ipl-data-acd319a313d9

ipl 图像

weixin_26750511

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
ipl 图像_IPL数据的数据分析

ipl 图像Being a cricket lover, I was waiting for the start of IPL,2020, as we all know this is the best tournament of the world. So, I thought to introduce myself performing IPL Data analysis with some ...
复制链接

扫一扫