matlab 板球_如何在板球中计算所有取整等级

最新推荐文章于 2023-01-01 17:34:52 发布

李_涛

最新推荐文章于 2023-01-01 17:34:52 发布

阅读量243

点赞数

文章标签： matlab python

原文链接：https://towardsdatascience.com/how-all-rounder-ratings-are-calculated-in-cricket-38176e06ce30

版权

matlab 板球

If you have been following cricket at all in the last year, you will have heard of Ben Stokes. He has been a match winner for England and had the comeback story of a life-time.

如果您去年一直都在打板球，那么您将听说过本·斯托克斯(Ben Stokes)。他一直是英格兰的比赛冠军，并拥有一生的复出故事。

Now, he is statistically the world’s best test all-rounder, having recently overtaken the captain of the West Indies, Jason Holder.

现在，他在统计上已成为世界上测试水平最高的全能选手，最近超过了西印度群岛队长杰森·霍尔德(Jason Holder)。

I want to better understand what that title means.

我想更好地理解该标题的含义。

Today, I will use web scraping and basic machine learning in python to uncover some findings of how all-rounders are rated in test cricket. Although many of my conclusions are already outlined in ranking overviews, the setup and approach I use can be applied to more complex problems.

今天，我将在python中使用网络抓取和基础机器学习来揭示一些关于如何在测试板球中对全能者进行评分的发现。尽管在排名概述中已经概述了我的许多结论，但是我使用的设置和方法可以应用于更复杂的问题。

背景 (Background)

This piece is not intended to teach cricket. While I will provide a basic overview of Test cricket, I would recommend checking out these videos if you would like to learn more.

这件作品不打算教板球。虽然我将提供测试板球的基本概述，但如果您想了解更多信息，我建议您查看这些视频。

Image for post — Photo by Chirayu Trivedi on Unsplash

In a Test cricket match, each team gets to participate in batting and bowling. Batsmen try to score “runs”, and bowlers try to limit these runs. To win a match, a team must score more runs than their opponent.

在板球比赛中，每个团队都可以参加击球和保龄球。击球手试图得分“奔跑”，而投球手则试图限制这些奔跑。要赢得比赛，一支球队必须比对手得分更高。

Typically, teams have some players who are “specialist bowlers” and some players who are “specialist batsmen”. Specialist players have one main responsibility — batting or bowling.

通常，团队中有一些球员是“专职保龄球手”，而有些球员是“专职击球手”。专业运动员的主要责任之一是击球或打保龄球。

There are also all-rounders like Ben Stokes. These are players who are good enough at batting and bowling to be selected to do both.

还有像Ben Stokes这样的全能球员。这些球员在击球和保龄球方面都非常出色，可以被选中同时做这两项。

The International Cricket Council (ICC) has a somewhat opaque way of calculating a players batting or bowling rating. There are many considerations that they incorporate into how a player’s batting or bowling rating changes after a game, including the quality of the other team’s players, the performance of other players on the same team, and whether the player’s team wins or not.

国际板球委员会(ICC)在计算球员击球或保龄球等级时有些不透明。他们将许多考虑因素纳入到比赛后一名球员的击球或保龄球等级变化中，包括另一支球队的球员的素质，同一支球队中其他球员的表现以及该球员的球队是否获胜。

Batting and bowling ratings are designed to be entirely derivable from match statistics alone — there is no panel of judges that decides how a player’s rating changes.

击球和保龄球的等级设计完全可以从比赛统计数据中完全推导出来-没有裁判员来决定球员等级的变化。

Likewise, all-rounder ratings are supposed to be entirely derivable from batting and bowling ratings. To find the exact relationship between batting, bowling, and all-rounder ratings, I looked to data.

同样，全能者的等级应该完全源自击球和保龄球等级。为了找到击球，保龄球和全能运动员之间的确切关系，我参考了数据。

搜集数据 (Scraping the Data)

I started off by gathering ICC-maintained test cricket ratings of players. The Python scraping library I used is called BeautifulSoup. I chose it because it handles custom scraping solutions without requiring me to use a specific post-processing solution. BeautifulSoup also only handles HTML strings as inputs, so I have to use a tool like requests to retrieve the URL in question first. An example is shown below:

我从收集由ICC维护的球员的板球评分开始。我使用的Python抓取库称为BeautifulSoup。我选择它是因为它可以处理自定义的抓取解决方案，而无需我使用特定的后处理解决方案。 BeautifulSoup还仅将HTML字符串作为输入处理，因此我必须使用诸如requests类的工具来首先检索有问题的URL。一个例子如下所示：

import requests
from bs4 import BeautifulSoup


batting_request = requests.get(batting_url)
batting_soup = BeautifulSoup(batting_request.content, "lxml")

I knew that scraping would be doable after I saw the form of the HTML code that displays the rankings for batting, bowling or scraping. In the batting rankings specifically, we can see that each row of the table corresponds to a <tr class="table-body"> tag that contains a player’s name, ranking, and rating. All the rankings (batting, bowling, and all-rounder) happen to follow this format.

我知道在看到显示击球，打保龄球或抓球的排名HTML代码形式之后，抓取是可行的。具体来说，在击球排名中，我们可以看到表格的每一行对应一个<tr class="table-body">标签，其中包含玩家的姓名，排名和评分。所有排名(击球，保龄球和全能选手)都遵循这种格式。

<tr class="table-body">
   <td class="table-body__cell table-body__cell--position u-text-right">
      2
      <span class="ranking-pos no-change"></span>
   </td>
   <td class="table-body__cell rankings-table__name name">
      <a href="/rankings/mens/player-rankings/164">Virat Kohli</a>
   </td>
   <td class="table-body__cell nationality-logo rankings-table__team">
      <span class="flag-15 table-body_logo IND"></span>
      <span class="table-body__logo-text">IND</span>
   </td>
   <td class="table-body__cell rating">886</td>
   <td class="table-body__cell u-text-right u-hide-phablet">937 v England, 22/08/2018</td>
</tr>

As shown, a sample row contains several <td> tags that themselves store the information we are looking for. I noticed two types of <td> tags: the ones that contain relevant text at the outermost tag level like <td class="table-body__cell table-body__cell--position u-text-right">, and the ones that contain relevant text within their innermost nested tags like <td class="table-body__cell rankings-table__name name">. To avoid duplication of code, I created two functions to handle these common scenarios:

如图所示，示例行包含几个<td>标记，它们本身存储了我们正在寻找的信息。我注意到了两种<td>标记：一种在最外层的标记级别包含相关文本，例如<td class="table-body__cell table-body__cell--position u-text-right"> ， <td class="table-body__cell table-body__cell--position u-text-right">一种包含相关<td class="table-body__cell table-body__cell--position u-text-right">他们最里面的嵌套标记中的文本，例如<td class="table-body__cell rankings-table__name name"> 。为了避免重复代码，我创建了两个函数来处理以下常见情况：

def get_outer_data_from_tag(tag, data_tag_name, data_class_name):
  return tag.find(name=data_tag_name, attrs={'class' : data_class_name}).find(recursive=False, text=True).strip();


def get_inner_data_from_tag(tag, data_tag_name, data_class_name):
  return tag.find(data_tag_name, {'class' : data_class_name}).text.strip();

Both functions use BeautifulSoup’s find function, which finds and returns the first tag that with the specified “tag_name” and attributes. The functions return the text of the found tag after stripping it of any white-space.

这两个函数都使用BeautifulSoup的find函数，该函数查找并返回具有指定“ tag_name”和属性的第一个标签。函数将找到的标记的文本去除任何空格后返回。

get_outer_data_from_tag has the additional parameters “recursive=false” and “text=True” in its find function. These parameters ensure that we only look for text in the outermost tag rather than looking for tags or text in any nested tags.

get_outer_data_from_tag在其find功能中具有附加参数“ recursive = false”和“ text = True”。这些参数确保我们只在最外面的标签中查找文本，而不在任何嵌套标签中查找标签或文本。

储存资料 (Storing the Data)

Ultimately, we need the data in an easy-to-manipulate format. I decided to use Pandas DataFrames since they not only display data easily, but also allow for fast and intuitive data transformations.

最终，我们需要一种易于操作的格式的数据。我决定使用Pandas DataFrame，因为它们不仅可以轻松显示数据，还可以进行快速，直观的数据转换。

Building the DataFrame involves calling the parsing functions I discussed earlier. For every row in a ranking table, we created a dictionary with a player’s name, ranking, and rating. We appended this dictionary to a list of dictionaries and after traversing every row, we created a DataFrame from this list.

构建DataFrame涉及调用我前面讨论的解析函数。对于排名表中的每一行，我们创建了一个包含玩家姓名，排名和评分的字典。我们将此字典附加到词典列表中，遍历每一行后，从该列表中创建一个DataFrame。

trs = ranking_div.find_all("tr", {'class' : "table-body"})
rows_list = []
for index, tr in enumerate(trs):
  row_dict = {}
  row_dict[player_column_name] = get_inner_data_from_tag(tr, "td", name_class_name)
  row_dict[ranking_column_name] = get_outer_data_from_tag(tr, "td", rank_class_name)
  row_dict[rating_column_name] = get_outer_data_from_tag(tr, "td", rating_class_name) 
  rows_list.append(row_dict)
return pd.DataFrame(rows_list, columns=[player_column_name, ranking_column_name, rating_column_name])

As an example, the produced Pandas DataFrame for batting is shown below.

例如，下面显示了生成的用于打击的Pandas DataFrame。

I constructed similar tables for bowling and all-rounder rankings, and I merged all the tables using the Pandas merge function. The result was a complete table with all of our data:

我为保龄球和全能选手排名构建了类似的表格，并使用Pandas merge功能合并了所有表格。结果是一个包含所有数据的完整表格：

发现 (Findings)

Once I had the data in a suitable form, I needed to find the equation that produced all-rounder ratings.

一旦以合适的形式获得了数据，我就需要找到产生全方位评价的方程式。

To do this, we define a new “X” matrix that contains the batting and bowling rating columns from our complete DataFrame. Our “Y” matrix contains the corresponding all-rounder rating column.

为此，我们定义了一个新的“ X”矩阵，其中包含完整DataFrame中的击球和保龄球等级列。我们的“ Y”矩阵包含相应的全能等级列。

Now we need to find an equation that maps X to Y. In the simplest linear regression, this involves finding a coefficient for each variable in the X matrix. For our problem, this would mean finding a coefficient for the batting rating and one for the bowling rating. This resembled a linear combination of variables:

现在，我们需要找到一个将X映射到Y的方程。在最简单的线性回归中，这涉及到X矩阵中每个变量的系数。对于我们的问题，这意味着找到击球等级的系数和保龄球等级的系数。这类似于变量的线性组合：

a * (batting rating) + b * (bowling rating) = all rounder rating

a *(击球等级)+ b *(保龄球等级)=所有四舍五入等级

The approach that gives much better results involves assuming that the form of the equation is partially quadratic:

给出更好结果的方法包括假设方程式的形式为部分二次式：

a * (batting rating) + b * (bowling rating) + c * (batting rating ) * (bowling rating) + d = all rounder rating

a *(击球等级)+ b *(保龄球等级)+ c *(击球等级)*(保龄球等级)+ d =所有四舍五入等级

This means that the degree of our equation is two, but we only consider interaction terms beyond first degree variables.

这意味着我们方程的阶数为2，但是我们只考虑第一阶变量以外的相互作用项。

X = complete_df[[bowling_rating, batting_rating]].values.astype(int)
Y = complete_df[all_rounder_rating].to_numpy().astype(int)


poly = PolynomialFeatures(interaction_only=True)


# Transform [x1, x2] to [1, x1, x2, x1 * x2]
X_ = poly.fit_transform(X)

The fit_transform function transforms the simple 2-column “X” matrix we defined to have 4 columns: one for a constant term, one for the batting rating, one for the bowling rating, and one for the product of the batting rating and the bowling rating (the “interaction term”).

fit_transform函数将我们定义的简单的2列“ X”矩阵转换为具有4列：一列表示常数项，一列表示打击率，一列表示保龄球，以及一列表示打击率和保龄球的乘积评级(“互动条件”)。

Then we can run a linear regression on our new 4-column “X” matrix and the previously defined “Y” matrix.

然后，我们可以对新的4列“ X”矩阵和先前定义的“ Y”矩阵进行线性回归。

clf = linear_model.LinearRegression()
clf.fit(X_, Y)

If we examine our model’s coefficients, we would get an array of 4 values. Although we didn’t quite have enough data to get a completely noise-less formula, we see that all terms except the (batting rating) * (bowling rating) term go almost to zero with our produced coefficients. The coefficient for the interaction term ends up being roughly 1/1000, which matches formulas discussed elsewhere and helps keep the error for our model very low.

如果我们检查模型的系数，我们将得到4个值的数组。尽管我们没有足够的数据来获得完全无噪声的公式，但我们发现，除了(击球等级)*(保龄球等级)项之外，所有项都随着我们产生的系数几乎为零。交互项的系数最终约为1/1000，与其他地方讨论的公式匹配，有助于使我们的模型的误差非常小。

结论 (Conclusion)

The all-rounder rating’s calculation is unique because of what it conveys. We can’t simply average a player’s batting and bowling rating to find their all-rounder rating because then a pure batsman or bowler would have a high all-rounder rating even though that player doesn’t have the skills of an all-rounder.

全能等级的计算是唯一的，因为它传达了什么。我们不能简单地对一个球员的击球和保龄球等级进行平均来找到他们的全能得分，因为那样的话，即使一个球员不具备全能的能力，一个纯净的击球手或保龄球手也会具有很高的全能得分。

Instead, by multiplying the batting and bowling ratings and then dividing by 1000, we avoid promoting stellar specialist players to the top of the all-rounder rankings. In this sense, the all-rounder rating does a good job of reflecting how well a player can bat and bowl.

相反，通过将击球和保龄球等级相乘然后除以1000，我们避免将一流的专业选手提升到全能选手排名的前列。从这个意义上说，全能选手的评分可以很好地反映出球员的击球和击球能力。

However, the all-rounder rating falls short because it doesn’t evaluate true “all-rounder” capabilities. For example, it doesn’t incorporate wicket-keeping or fielding skills, which are crucial aspects of a cricket match. The reason usually given for this deficiency is that there is no good way to measure these abilities. I suspect that will change as cricket becomes increasingly statistics-driven.

但是，由于无法评估真正的“全能”能力，因此对全能者的评价不高。例如，它不包含门球保持或守门技能，这是板球比赛的关键方面。通常会出现这种缺陷的原因是，没有很好的方法来衡量这些能力。我怀疑随着板球变得越来越受统计驱动，这种情况将会改变。

If you are interested in seeing or running the code, you can find it here. The approach I went over can be used to scrape cricket data for a different purpose or to analyze a different aspect of the rating system. I am curious to know what benefited you!

如果您对查看或运行代码感兴趣，可以在此处找到。我所采用的方法可用于出于不同目的抓取板球数据或分析评分系统的不同方面。我很想知道是什么使您受益！