python 描述性统计
The field of statistics is often misunderstood, but it plays an essential role in our everyday lives. Statistics, done correctly, allows us to extract knowledge from the vague, complex, and difficult real world. Wielded incorrectly, statistics can be used to harm and mislead. A clear understanding of statistics and the meanings of various statistical measures is important to distinguishing between truth and misdirection.
统计领域经常被误解,但在我们的日常生活中起着至关重要的作用。 正确完成的统计数据使我们能够从模糊,复杂和困难的现实世界中提取知识。 错误地使用统计信息可能会造成伤害和误导。 清楚地了解统计数据和各种统计方法的含义对于区分真相和误导很重要。
We will cover the following in this article:
我们将在本文中介绍以下内容:
- defining statistics
- descriptive statistics
- measures of central tendency
- measures of spread
- 定义统计
- 描述性统计
- 集中趋势的量度
- 传播措施
先决条件: (Prerequisites:)
This article assumes no prior knowledge of statistics, but does require at least a general knowledge of Python. If you are uncomfortable with for
loops and lists, I recommend covering them briefly before progressing.
本文假定您没有统计学的先验知识,但至少需要具备Python的一般知识。 如果您对for
循环和列表不满意 ,建议您在进行操作之前简要介绍一下它们。
载入我们的数据 (Loading in our data)
We will root our discussion of statistics in real-world data, taken from Kaggle’s Wine Reviews data set. The data itself comes from a scraper that scoured the Wine Enthusiast site.
我们将对统计数据的讨论植根于来自Kaggle的Wine Reviews数据集的真实数据。 数据本身来自刮擦酒爱好者网站的刮板。
For the sake of this article, let’s say that you are a sommelier-in-training, a new wine taster. You found this interesting data set on wines, and you would like to compare and contrast different wines. You’ll use statistics to describe the wines in the data set and derive some insights for yourself. Perhaps we can start our training with a cheap set of wines, or the most highly rated ones?
就本文而言,假设您是一位培训侍酒师,是一名新的葡萄酒品尝师。 您找到了有关葡萄酒的有趣数据集,并且想要比较和对比不同的葡萄酒。 您将使用统计数据来描述数据集中的葡萄酒,并为自己得出一些见解。 也许我们可以从便宜的葡萄酒或评级最高的葡萄酒开始我们的培训?
The code below loads in the data set wine-data.csv
into a variable wines
as list of lists. We’ll perfrom statistics on wines
throughout the article. You can use this code to follow along on your own computer.
下面的代码将数据集wine-data.csv
装入列表中的变量wines
中。 在整篇文章中,我们将对wines
进行统计。 您可以使用此代码在自己的计算机上继续学习。
import csv with open("wine-data.csv", "r", encoding="latin-1") as f: wines = list(csv.reader(f))
import csv with open("wine-data.csv", "r", encoding="latin-1") as f: wines = list(csv.reader(f))
Let’s have a brief look at the first five rows of the data in table, so we can see what kinds of values we’re working with.
让我们简要看一下表中数据的前五行,这样我们就可以看到我们正在使用哪种类型的值。
index | 指数 | country | 国家 | description | 描述 | designation | 指定 | points | 点数 | price | 价钱 | province | 省 | region_1 | region_1 | region_2 | region_2 | variety | 品种 | winery | 酒厂 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | US | 我们 | “This tremendous 100%…” | “这真是百分百……” | Martha’s Vineyard | 玛莎葡萄园岛 | 96 | 96 | 235 | 235 | California | 加利福尼亚州 | Napa Valley | 纳帕谷 | Napa | 纳帕 | Cabernet Sauvignon | 赤霞珠 | Heitz | 海兹 |
1 | 1个 | Spain | 西班牙 | “Ripe aromas of fig… | “无花果的成熟香气…… | Carodorum Selecci Especial Reserva | Carodorum Selecci特别储备 | 96 | 96 | 110 | 110 | Northern Spain | 西班牙北部 | Toro | 托罗 | Tinta de Toro | Tinta de Toro | Bodega Carmen Rodriguez | Bodega卡门·罗德里格斯(Bodega Carmen Rodriguez) | ||
2 | 2 | US | 我们 | “Mac Watson honors… | “ Mac Watson荣幸…… | Special Selected Late Harvest | 特别精选晚收 | 96 | 96 | 90 | 90 | California | 加利福尼亚州 | Knights Valley | 骑士谷 | Sonoma | 索诺玛 | Sauvignon Blanc | 长相思 | Macauley | 麦考利 |
3 | 3 | US | 我们 | “This spent 20 months… | “这花了20个月…… | Reserve | 保留 | 96 | 96 | 65 | 65 | Oregon | 俄勒冈州 | Willamette Valley | 威拉米特山谷 | Willamette Valley | 威拉米特山谷 | Pinot Noir |