Python 30 天：第 25 天 -- Pandas_30天 python 26天-CSDN博客

本文链接：https://blog.csdn.net/qq_62599387/article/details/129768147

Pandas是Python中用于数据分析的强大库，提供Series和DataFrame数据结构。Series是一维数据，DataFrame是二维表格数据。文章介绍了如何安装Pandas，创建Series和DataFrame，以及数据框的操作，包括读取CSV文件、数据探索和修改。此外，文章还展示了计算BMI、处理数据类型和填充异常值等实践例子。

摘要由CSDN通过智能技术生成

<< 第 24 天 || 第 26 天 >>

第 25 天

Pandas

Pandas 是一种用于 Python 编程语言的开源、高性能、易于使用的数据结构和数据分析工具。Pandas 添加了数据结构和工具，旨在处理类表数据，即Series和Data Frames。Pandas 提供了数据操作工具：

重塑
合并
排序
切片
聚合
归责。如果您使用的是 anaconda，则无需安装 pandas。

安装Pandas

对于 MAC

pip install conda
conda install pandas

对于 Windows：

pip install conda
pip install pandas

Pandas 数据结构基于Series和DataFrames。

series是一列，而 DataFrame 是由series的集合组成的多维表。为了创建 pandas 系列，我们应该使用 numpy 创建一维数组或 python 列表。让我们看一个系列的例子：

命名 Pandas 系列

国家系列

城市系列

如您所见，pandas 系列只是一列数据。如果我们想要多列，我们使用数据框。下面的示例显示了 pandas DataFrame。

让我们看一个 pandas 数据框的例子：

数据框是行和列的集合。看下表；它比上面的例子有更多的列：

接下来，我们将看到如何导入 pandas 以及如何使用 pandas 创建 Series 和 DataFrames

导入Pandas

import pandas as pd # importing pandas as pd
import numpy  as np # importing numpy as np

使用默认索引创建 Pandas 系列

nums = [1, 2, 3, 4,5]
s = pd.Series(nums)
print(s)

    0    1
    1    2
    2    3
    3    4
    4    5
    dtype: int64

使用自定义索引创建 Pandas 系列

nums = [1, 2, 3, 4, 5]
s = pd.Series(nums, index=[1, 2, 3, 4, 5])
print(s)

    1    1
    2    2
    3    3
    4    4
    5    5
    dtype: int64

fruits = ['Orange','Banana','Mango']
fruits = pd.Series(fruits, index=[1, 2, 3])
print(fruits)

    1    Orange
    2    Banana
    3    Mango
    dtype: object

从字典创建 Pandas 系列

dct = {'name':'Asabeneh','country':'Finland','city':'Helsinki'}

s = pd.Series(dct)
print(s)

    name       Asabeneh
    country     Finland
    city       Helsinki
    dtype: object

创建一个 Constant Pandas 系列

s = pd.Series(10, index = [1, 2, 3])
print(s)

    1    10
    2    10
    3    10
    dtype: int64

使用 Linspace 创建 Pandas 系列

s = pd.Series(np.linspace(5, 20, 10)) # linspace(starting, end, items)
print(s)

    0     5.000000
    1     6.666667
    2     8.333333
    3    10.000000
    4    11.666667
    5    13.333333
    6    15.000000
    7    16.666667
    8    18.333333
    9    20.000000
    dtype: float64

数据框

可以用不同的方式创建 Pandas 数据框。

从列表的列表中创建 DataFrame

data = {'Name': ['Asabeneh', 'David', 'John'], 'Country':[
    'Finland', 'UK', 'Sweden'], 'City': ['Helsiki', 'London', 'Stockholm']}
df = pd.DataFrame(data)
print(df)

从字典列表创建 DataFrame

data = [
    {'Name': 'Asabeneh', 'Country': 'Finland', 'City': 'Helsinki'},
    {'Name': 'David', 'Country': 'UK', 'City': 'London'},
    {'Name': 'John', 'Country': 'Sweden', 'City': 'Stockholm'}]
df = pd.DataFrame(data)
print(df)

使用 Pandas 读取 CSV 文件

要下载 CSV 文件，本例中需要什么，控制台/命令行就足够了：

curl -O https://raw.githubusercontent.com/Asabeneh/30-Days-Of-Python/master/data/weight-height.csv

将下载的文件放在您的工作目录中。

import pandas as pd

df = pd.read_csv('weight-height.csv')
print(df)

数据探索

让我们使用 head() 只读取前 5 行

print(df.head()) # give five rows we can increase the number of rows by passing argument to the head() method

让我们还使用 tail() 方法探索数据帧的最后记录。

print(df.tail()) # tails give the last five rows, we can increase the rows by passing argument to tail method

如您所见，csv 文件包含三行：性别、身高和体重。如果 DataFrame 有很长的行，就很难知道所有的列。因此，我们应该使用一种方法来了解列。我们不知道行数。让我们使用形状方法。

print(df.shape) # as you can see 10000 rows and three columns

(10000, 3)

让我们使用 columns 获取所有列。

print(df.columns)

Index(['Gender', 'Height', 'Weight'], dtype='object')

现在，让我们使用列键获取特定列

heights = df['Height'] # this is now a series

print(heights)

    0       73.847017
    1       68.781904
    2       74.110105
    3       71.730978
    4       69.881796
              ...    
    9995    66.172652
    9996    67.067155
    9997    63.867992
    9998    69.034243
    9999    61.944246
    Name: Height, Length: 10000, dtype: float64

weights = df['Weight'] # this is now a series

print(weights)

    0       241.893563
    1       162.310473
    2       212.740856
    3       220.042470
    4       206.349801
               ...    
    9995    136.777454
    9996    170.867906
    9997    128.475319
    9998    163.852461
    9999    113.649103
    Name: Weight, Length: 10000, dtype: float64

print(len(heights) == len(weights))

True

describe() 方法提供数据集的描述性统计值。

print(heights.describe()) # give statisical information about height data

    count    10000.000000
    mean        66.367560
    std          3.847528
    min         54.263133
    25%         63.505620
    50%         66.318070
    75%         69.174262
    max         78.998742
    Name: Height, dtype: float64

print(weights.describe())

    count    10000.000000
    mean       161.440357
    std         32.108439
    min         64.700127
    25%        135.818051
    50%        161.212928
    75%        187.169525
    max        269.989699
    Name: Weight, dtype: float64

print(df.describe())  # describe can also give statistical information from a dataFrame

与 describe() 类似，info() 方法也提供有关数据集的信息。

修改数据框

修改 DataFrame： * 我们可以创建一个新的 DataFrame * 我们可以创建一个新列并将其添加到 DataFrame， * 我们可以从 DataFrame 中删除现有列， * 我们可以修改 DataFrame 中的现有列， * 我们可以更改 DataFrame 中列值的数据类型

创建数据框

与往常一样，首先我们导入必要的包。现在，让我们导入 pandas 和 numpy，这两个最好的朋友。

import pandas as pd
import numpy as np
data = [
    {"Name": "Asabeneh", "Country":"Finland","City":"Helsinki"},
    {"Name": "David", "Country":"UK","City":"London"},
    {"Name": "John", "Country":"Sweden","City":"Stockholm"}]
df = pd.DataFrame(data)
print(df)

向 DataFrame 添加列就像向字典添加键。

首先让我们使用前面的示例来创建一个 DataFrame。创建 DataFrame 后，我们将开始修改列和列值。

添加新列

让我们在 DataFrame 中添加一个权重列

weights = [74, 78, 69]
df['Weight'] = weights
df

让我们在 DataFrame 中也添加一个高度列

heights = [173, 175, 169]
df['Height'] = heights
print(df)

正如您在上面的 DataFrame 中看到的，我们确实添加了新的列，即 Weight 和 Height。让我们通过使用他们的质量和身高计算他们的 BMI，添加一个名为 BMI（身体质量指数）的额外列。BMI 是质量除以身高的平方（以米为单位）- 体重/身高 * 身高。

如您所见，高度以厘米为单位，因此我们应该将其更改为米。让我们修改高度行。

修改列值

df['Height'] = df['Height'] * 0.01
df

# Using functions makes our code clean, but you can calculate the bmi without one
def calculate_bmi ():
    weights = df['Weight']
    heights = df['Height']
    bmi = []
    for w,h in zip(weights, heights):
        b = w/(h*h)
        bmi.append(b)
    return bmi
    
bmi = calculate_bmi()

df['BMI'] = bmi
df

格式化 DataFrame 列

DataFrame 的 BMI 列值是浮点数，小数点后有许多有效数字。让我们在点后将其更改为一位有效数字。

df['BMI'] = round(df['BMI'], 1)
print(df)

DataFrame 中的信息似乎还不完整，让我们添加出生年份和当前年份列。

birth_year = ['1769', '1985', '1990']
current_year = pd.Series(2020, index=[0, 1,2])
df['Birth Year'] = birth_year
df['Current Year'] = current_year
df

检查 Column 值的数据类型

print(df.Weight.dtype)

    dtype('int64')

df['Birth Year'].dtype # it gives string object , we should change this to number

df['Birth Year'] = df['Birth Year'].astype('int')
print(df['Birth Year'].dtype) # let's check the data type now

    dtype('int32')

现在与当年相同：

df['Current Year'] = df['Current Year'].astype('int')
df['Current Year'].dtype

    dtype('int32')

现在，出生年份和当前年份的列值为整数。我们可以计算年龄。

ages = df['Current Year'] - df['Birth Year']
ages

0    251
1     35
2     30
dtype: int32

df['Ages'] = ages
print(df)

第一排的人至今活了 251 岁。一个人不可能活这么久。要么是打字错误，要么是数据被篡改了。因此，让我们用列的平均值填充该数据，而不包括离群值。

均值 = (35 + 30)/ 2

mean = (35 + 30)/ 2
print('Mean: ',mean)	#it is good to add some description to the output, so we know what is what

   Mean:  32.5

print(df[df['Ages'] < 120])

练习：第 25 天

从数据目录中读取 hacker_news.csv 文件
获取前五行
获取最后五行
获取标题列作为熊猫系列
计算行数和列数
- 过滤包含python的标题
- 过滤包含 JavaScript 的标题
- 探索数据并理解它

🎉恭喜！🎉

<< 第 24 天 || 第 26 天 >>