Gitmodel|数据分析 Task1

Pandas 动手学入门

本课程基于DataWhale开源教程《Joyful-Pandas》开源项目 ajcr/100-pandas-puzzles中译版开发. 在数据分析的实践中,Pandas是一个非常重要的库,它集成了非常多的统计分析功能,《Joyful-Pandas》是一份非常详尽,并且适合新手的Pandas中文教程,同时它也是Pandas官方认证的中文教程!但笔者学习过程中,始终觉得编程学习“纸上得来终觉浅”,对于刚接触一个新的库的“躬行”笔者认为最好的方式就是对照着官方文档以及QuickStart吧例子自己尝试一遍,并且在实践中学会如何查文档,这是一个开发者必备的技能.因此本教程就是基于这样的思想,给出对应的操作以及对应的文档地址,学习者需要自行查阅文档,完成下面的实践.

餐前准备

在这部分中,我们主要学习如何导入Pandas以及查看版本信息,难度是比较简单的.

请你阅读Joyful-Pandas第二章,完成下方练习.
http://joyfulpandas.datawhale.club/Content/ch2.html

1.pd 别名导入 pandas 库

import pandas as pd

2. 打印出pandas 库的版本信息

print('pandas库的版本信息是:{}'.format(pd.__version__))
pandas库的版本信息是:1.2.4

3. 打印 pandas 依赖包及其版本信息

pd.show_versions(True)
{'system': {'commit': '2cb96529396d93b46abab7bbc73a208e708c642e', 'python': '3.8.8.final.0', 'python-bits': 64, 'OS': 'Windows', 'OS-release': '10', 'Version': '10.0.19041', 'machine': 'AMD64', 'processor': 'Intel64 Family 6 Model 165 Stepping 5, GenuineIntel', 'byteorder': 'little', 'LC_ALL': None, 'LANG': None, 'LOCALE': {'language-code': 'Chinese (Simplified)_China', 'encoding': '936'}}, 'dependencies': {'pandas': '1.2.4', 'numpy': '1.22.3', 'pytz': '2021.1', 'dateutil': '2.8.1', 'pip': '21.0.1', 'setuptools': '52.0.0.post20210125', 'Cython': '0.29.23', 'pytest': '6.2.3', 'hypothesis': None, 'sphinx': '4.0.1', 'blosc': None, 'feather': None, 'xlsxwriter': '1.3.8', 'lxml.etree': '4.6.3', 'html5lib': '1.1', 'pymysql': None, 'psycopg2': None, 'jinja2': '2.11.3', 'IPython': '7.22.0', 'pandas_datareader': None, 'bs4': '4.9.3', 'bottleneck': '1.3.2', 'fsspec': '0.9.0', 'fastparquet': None, 'gcsfs': None, 'matplotlib': '3.3.4', 'numexpr': '2.7.3', 'odfpy': None, 'openpyxl': '3.0.7', 'pandas_gbq': None, 'pyarrow': None, 'pyxlsb': None, 's3fs': None, 'scipy': '1.6.2', 'sqlalchemy': '1.4.7', 'tables': '3.6.1', 'tabulate': None, 'xarray': None, 'xlrd': '2.0.1', 'xlwt': '1.3.0', 'numba': '0.53.1'}}

备齐食材

在这部分中,主要介绍了Pandas一个重要的数据结构DataFrame

请你阅读Joyful-Pandas第二章,完成下方练习.
http://joyfulpandas.datawhale.club/Content/ch2.html

4. 使用数据 data 和行索引 labels 创建一个 DataFrame df

import numpy as np

有下面这样的一个数据字典 data 以及列表格式的标签数据 labels:

data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(data, index=labels)
df
animalagevisitspriority
acat2.51yes
bcat3.03yes
csnake0.52no
ddogNaN3yes
edog5.02no
fcat2.03no
gsnake4.51no
hcatNaN1yes
idog7.02no
jdog3.01no

5. 显示该 DataFrame 及其数据相关的基本信息(提示:DataFrame 直接调用的方法

df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, a to j
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   animal    10 non-null     object 
 1   age       8 non-null      float64
 2   visits    10 non-null     int64  
 3   priority  10 non-null     object 
dtypes: float64(1), int64(1), object(2)
memory usage: 400.0+ bytes
df.describe()
agevisits
count8.00000010.000000
mean3.4375001.900000
std2.0077970.875595
min0.5000001.000000
25%2.3750001.000000
50%3.0000002.000000
75%4.6250002.750000
max7.0000003.000000

6. 返回 DataFrame df 的前4行数据

df.head(4)
animalagevisitspriority
acat2.51yes
bcat3.03yes
csnake0.52no
ddogNaN3yes

7. 从 DataFrame df 选择标签为 animalage 的列

df[['animal','age']]
animalage
acat2.5
bcat3.0
csnake0.5
ddogNaN
edog5.0
fcat2.0
gsnake4.5
hcatNaN
idog7.0
jdog3.0

8.[3, 4, 8] 行中,列为 ['animal', 'age'] 的数据

df.iloc[[3,4,8]][['animal','age']]
animalage
ddogNaN
edog5.0
idog7.0
df.loc[df.index[[3,4,8]],['animal','age']]
animalage
ddogNaN
edog5.0
idog7.0

9. 选择列visits大于 1 的行

df[df['visits']>1]
animalagevisitspriority
bcat3.03yes
csnake0.52no
ddogNaN3yes
edog5.02no
fcat2.03no
idog7.02no
df.loc[df['visits']>1]
animalagevisitspriority
bcat3.03yes
csnake0.52no
ddogNaN3yes
edog5.02no
fcat2.03no
idog7.02no

10. 选择 age 为缺失值的行

df[df['age'].isnull()]
animalagevisitspriority
ddogNaN3yes
hcatNaN1yes
df[df['age'].isna()]
animalagevisitspriority
ddogNaN3yes
hcatNaN1yes

11. 选择 animal 是cat且age 小于 3 的行

df[(df['animal']=='cat') & (df['age']<3)]
animalagevisitspriority
acat2.51yes
fcat2.03no
df.query('animal=="cat" and age<3')
animalagevisitspriority
acat2.51yes
fcat2.03no

12. 选择 age 在 2 到 4 之间的数据(包含边界值)

df[df['age'].between(2,4)]
animalagevisitspriority
acat2.51yes
bcat3.03yes
fcat2.03no
jdog3.01no
df.query('2<=age<=4')
animalagevisitspriority
acat2.51yes
bcat3.03yes
fcat2.03no
jdog3.01no

13. 将 ‘f’ 行的 age 改为 1.5

df.loc['f','age']=1.5
df
animalagevisitspriority
acat2.51yes
bcat3.03yes
csnake0.52no
ddogNaN3yes
edog5.02no
fcat1.53no
gsnake4.51no
hcatNaN1yes
idog7.02no
jdog3.01no

14.visits 列的数据求和

df['visits'].sum()
19

15. 计算每种 animal age 的平均值

df.groupby('animal')['age'].mean()
animal
cat      2.333333
dog      5.000000
snake    2.500000
Name: age, dtype: float64

16. 新增一行数据 k,数据自定义,然后再删除新追加的 k 行

df.loc['k'] = [6.0,'snake',2,'no']
df
animalagevisitspriority
acat2.51yes
bcat3.03yes
csnake0.52no
ddogNaN3yes
edog5.02no
fcat1.53no
gsnake4.51no
hcatNaN1yes
idog7.02no
jdog3.01no
k6.0snake2no
df.drop('k',inplace=True)

17. 统计每种 animal 的个数

df['animal'].value_counts()
cat      4
dog      4
snake    2
Name: animal, dtype: int64
df.groupby('animal')['animal'].count()
animal
cat      4
dog      4
snake    2
Name: animal, dtype: int64

18. 先根据 age 降序排列,再根据 visits 升序排列(结果 i 列在前面,d 列在最后面)

df.sort_values(by=['age','visits'], ascending=[False, True], inplace=True)
df
animalagevisitspriority
idog7.02no
edog5.02no
gsnake4.51no
jdog3.01no
bcat3.03yes
acat2.51yes
fcat1.53no
csnake0.52no
hcatNaN1yes
ddogNaN3yes

19.priority 列的 yesnoTrueFalse 替换

df['priority'] = df['priority'].map({'yes':True,'no':False})
df
animalagevisitspriority
idog7.02False
edog5.02False
gsnake4.51False
jdog3.01False
bcat3.03True
acat2.51True
fcat1.53False
csnake0.52False
hcatNaN1True
ddogNaN3True

20.animal 列的 snakepython 替换

df['animal'] = df['animal'].replace('snake','python')

21. 对于每种 animalvisit,求出平均年龄。换句话说,每一行都是动物,每一列都是访问次数,其值是平均年龄(提示:使用数据透视表)

df.pivot_table(index='animal', columns='visits', values='age', aggfunc='mean')
visits123
animal
cat2.5NaN2.5
dog3.06.0NaN
python4.50.5NaN

22. 读取data文件夹下的boston.csv文件,并尝试利用上面的方法自行分析,得出一些简单的结论.

df2 = pd.read_csv('./data/boston.csv')
df2
CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATMEDV
00.0063218.02.3100.5386.57565.24.09001296.015.3396.904.9824.0
10.027310.07.0700.4696.42178.94.96712242.017.8396.909.1421.6
20.027290.07.0700.4697.18561.14.96712242.017.8392.834.0334.7
30.032370.02.1800.4586.99845.86.06223222.018.7394.632.9433.4
40.069050.02.1800.4587.14754.26.06223222.018.7396.905.3336.2
.............................................
5010.062630.011.9300.5736.59369.12.47861273.021.0391.999.6722.4
5020.045270.011.9300.5736.12076.72.28751273.021.0396.909.0820.6
5030.060760.011.9300.5736.97691.02.16751273.021.0396.905.6423.9
5040.109590.011.9300.5736.79489.32.38891273.021.0393.456.4822.0
5050.047410.011.9300.5736.03080.82.50501273.021.0396.907.8811.9

506 rows × 14 columns

df2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    int64  
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    int64  
 9   TAX      506 non-null    float64
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    506 non-null    float64
 13  MEDV     506 non-null    float64
dtypes: float64(12), int64(2)
memory usage: 55.5 KB
df2.describe()
CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATMEDV
count506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000
mean3.61352411.36363611.1367790.0691700.5546956.28463468.5749013.7950439.549407408.23715418.455534356.67403212.65306322.532806
std8.60154523.3224536.8603530.2539940.1158780.70261728.1488612.1057108.707259168.5371162.16494691.2948647.1410629.197104
min0.0063200.0000000.4600000.0000000.3850003.5610002.9000001.1296001.000000187.00000012.6000000.3200001.7300005.000000
25%0.0820450.0000005.1900000.0000000.4490005.88550045.0250002.1001754.000000279.00000017.400000375.3775006.95000017.025000
50%0.2565100.0000009.6900000.0000000.5380006.20850077.5000003.2074505.000000330.00000019.050000391.44000011.36000021.200000
75%3.67708312.50000018.1000000.0000000.6240006.62350094.0750005.18842524.000000666.00000020.200000396.22500016.95500025.000000
max88.976200100.00000027.7400001.0000000.8710008.780000100.00000012.12650024.000000711.00000022.000000396.90000037.97000050.000000

从df2.info()可以看出14列数据种不存在缺失值,从de2.describe()中可以看出数据最小值为0,也就是数据不存在负值,另外诸如ZH列、TAX列以及B列数据分布范围较广且方差较大,可能存在异常值,需要通过统计分析进一步进行处理。

至此,相信你已经可以通过阅读文档,掌握了pandas中DataFrame的一些基本操作,下面我们来讲点更进一步的操作,它是上述这些操作的一些巧妙结合

食材搭配

在这个部分中,我们主要介绍对于DataFrame中的数据的一些操作:去重、数据计算、取值······

请你阅读Joyful-Pandas第二章,完成下方练习. http://joyfulpandas.datawhale.club/Content/ch2.html

df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})
df
A
01
12
22
33
44
55
65
75
86
97
107

23. 请用pandas操作输出上面的dfA列出现的元素的唯一值(即:出现过的所有元素的集合)

df['A'].unique()
array([1, 2, 3, 4, 5, 6, 7], dtype=int64)

24. 输出df中的唯一值及其对应出现的频数

df['A'].value_counts()
5    3
2    2
7    2
1    1
3    1
4    1
6    1
Name: A, dtype: int64

25.df进行数据降重

df.drop_duplicates('A')
A
01
12
33
44
55
86
97

26. 给定一组随机数据

df = pd.DataFrame(np.random.random(size=(5, 3))) 
df
012
00.5802410.6320420.354775
10.1225040.7848290.920090
20.2317890.5972950.149788
30.5680110.1359240.011110
40.3534840.8352160.300393

使每个元素减去所在行的平均值?

df.sub(df.mean(1),axis=0)
012
00.0578880.109689-0.167577
1-0.4866370.1756880.310949
2-0.0945020.271004-0.176502
30.329663-0.102424-0.227239
4-0.1428810.338852-0.195971

27. 返回下列df数字总和最小那列的标签

df = pd.DataFrame(np.random.random(size=(5, 10)), columns=list('abcdefghij'))
df
abcdefghij
00.6814720.3754330.1578660.8446100.5151490.8431950.4828670.0193930.1468550.819014
10.7181390.8290710.2409660.9955750.3883200.1786180.3797830.4181370.5495390.939299
20.2991620.3364600.4455970.9353200.9647800.6695800.1500170.9835470.0287380.398523
30.1555460.8609100.7193350.9891130.0326040.1804920.8674430.0213310.5593670.466884
40.1811800.0625310.8130160.9820590.1136490.4121840.8570990.5643380.4204190.322087
df.sum(axis=0).sort_values().index[0]
'i'
df.sum(axis=0).idxmin()
'i'

28. 计算一个 DataFrame 有多少不重复的行?

df = pd.DataFrame(np.random.randint(0, 2, size=(10, 3)))
df
012
0100
1101
2101
3001
4011
5111
6010
7111
8011
9111
df.drop_duplicates().shape[0]
6

29. DataFrame 数据如下,A 和 B 都是 0-100 之间(包括边界值)的数值,对 A 进行分段分组(i.e. (0, 10], (10, 20], …),求每组内 B 的和。输出应该和下述一致:

A
(0, 10]      635
(10, 20]     360
(20, 30]     315
(30, 40]     306
(40, 50]     750
(50, 60]     284
(60, 70]     424
(70, 80]     526
(80, 90]     835
(90, 100]    852
df = pd.DataFrame(np.random.RandomState(8765).randint(1, 101, 
                  size=(100, 2)), 
                  columns = ["A", "B"])
df
AB
04629
17522
24963
33343
47175
.........
956087
965740
978619
985056
999794

100 rows × 2 columns

df.groupby(pd.cut(df['A'], bins=range(0, 110, 10), right=True))['B'].sum()
A
(0, 10]      635
(10, 20]     360
(20, 30]     315
(30, 40]     306
(40, 50]     750
(50, 60]     284
(60, 70]     424
(70, 80]     526
(80, 90]     835
(90, 100]    852
Name: B, dtype: int32

用pandas进行数据清洗

下面的df是我们用到的数据集

df = pd.DataFrame({'From_To': ['LoNDon_paris', 'MAdrid_miLAN', 'londON_StockhOlm', 
                               'Budapest_PaRis', 'Brussels_londOn'],
              'FlightNumber': [10045, np.nan, 10065, np.nan, 10085],
              'RecentDelays': [[23, 47], [], [24, 43, 87], [13], [67, 32]],
                   'Airline': ['KLM(!)', '<Air France> (12)', '(British Airways. )', 
                               '12. Air France', '"Swiss Air"']})
df
From_ToFlightNumberRecentDelaysAirline
0LoNDon_paris10045.0[23, 47]KLM(!)
1MAdrid_miLANNaN[]<Air France> (12)
2londON_StockhOlm10065.0[24, 43, 87](British Airways. )
3Budapest_PaRisNaN[13]12. Air France
4Brussels_londOn10085.0[67, 32]"Swiss Air"

30. FlightNumber列中的某些值缺失(它们是NaN)。这些数字是有规律的,即每行增加 10,因此NaN需要放置 10055 和 10075。修改df以填充这些缺失的数字并使该列成为整数列(而不是浮点列)

df['FlightNumber'] = df['FlightNumber'].interpolate().astype(int)  # 插值函数为linear
df
From_ToFlightNumberRecentDelaysAirline
0LoNDon_paris10045[23, 47]KLM(!)
1MAdrid_miLAN10055[]<Air France> (12)
2londON_StockhOlm10065[24, 43, 87](British Airways. )
3Budapest_PaRis10075[13]12. Air France
4Brussels_londOn10085[67, 32]"Swiss Air"

31. From_To列作为两个单独的列会更好!拆分下划线分隔符_前后的每个字符串. 将其拆分成两列,存放在一个名为“temp”的临时 DataFrame,将列名 ‘From’ 和 ‘To’ 分配给这个临时DataFrame.

temp = df['From_To'].str.split('_',expand=True)
temp.columns = ['From','To']
temp
FromTo
0LoNDonparis
1MAdridmiLAN
2londONStockhOlm
3BudapestPaRis
4BrusselslondOn

32. 注意城市名称的大小写是混合在一起的。标准化字符串,以便只有第一个字母是大写的(例如“londON”应该变成“London”。)

temp['From'] = temp['From'].str.capitalize()
temp['To'] = temp['To'].str.capitalize()
temp
FromTo
0LondonParis
1MadridMilan
2LondonStockholm
3BudapestParis
4BrusselsLondon

33.From_To列从df中删去,将temp处理好的数据合并到df

df = df.drop('From_To', axis=1)
df2 = pd.concat([df, temp], axis=1)
df2
FlightNumberRecentDelaysAirlineFromTo
010045[23, 47]KLM(!)LondonParis
110055[]<Air France> (12)MadridMilan
210065[24, 43, 87](British Airways. )LondonStockholm
310075[13]12. Air FranceBudapestParis
410085[67, 32]"Swiss Air"BrusselsLondon

**34.**在AirLine列中,您可以看到航空公司名称周围出现了一些额外的符号。只提取航空公司名称。例如’(British Airways. )‘应该变成’British Airways’.

df2['Airline'] = df2['Airline'].str.extract('([a-zA-Z\s]+)', expand=False).str.strip()

df2
FlightNumberRecentDelaysAirlineFromTo
010045[23, 47]KLMLondonParis
110055[]Air FranceMadridMilan
210065[24, 43, 87]British AirwaysLondonStockholm
310075[13]Air FranceBudapestParis
410085[67, 32]Swiss AirBrusselsLondon

35. 在 RecentDelays 列中,值已作为列表输入到 DataFrame 中。我们希望每个第一个值在它自己的列中,每个第二个值在它自己的列中,依此类推。如果没有第 N 个值,则该值应为 NaN。

将 Series 列表展开为名为 的 DataFrame delays,重命名列delay_1,delay_2等等,并将不需要的 RecentDelays 列替换df为delays

df3=df2['RecentDelays'].apply(pd.Series)
df3
012
023.047.0NaN
1NaNNaNNaN
224.043.087.0
313.0NaNNaN
467.032.0NaN
df3.columns = ['delay_{}'.format(n) for n in range(1, len(df3.columns)+1)]  
df2 = df2.drop('RecentDelays', axis=1).join(df3) 
df2
FlightNumberAirlineFromTodelay_1delay_2delay_3
010045KLMLondonParis23.047.0NaN
110055Air FranceMadridMilanNaNNaNNaN
210065British AirwaysLondonStockholm24.043.087.0
310075Air FranceBudapestParis13.0NaNNaN
410085Swiss AirBrusselsLondon67.032.0NaN

43, 87]
British Airways
London
Stockholm


3
10075
[13]
Air France
Budapest
Paris


4
10085
[67, 32]
Swiss Air
Brussels
London

35. 在 RecentDelays 列中,值已作为列表输入到 DataFrame 中。我们希望每个第一个值在它自己的列中,每个第二个值在它自己的列中,依此类推。如果没有第 N 个值,则该值应为 NaN。

将 Series 列表展开为名为 的 DataFrame delays,重命名列delay_1,delay_2等等,并将不需要的 RecentDelays 列替换df为delays

df3=df2['RecentDelays'].apply(pd.Series)
df3
012
023.047.0NaN
1NaNNaNNaN
224.043.087.0
313.0NaNNaN
467.032.0NaN
df3.columns = ['delay_{}'.format(n) for n in range(1, len(df3.columns)+1)]  
df2 = df2.drop('RecentDelays', axis=1).join(df3) 
df2
FlightNumberAirlineFromTodelay_1delay_2delay_3
010045KLMLondonParis23.047.0NaN
110055Air FranceMadridMilanNaNNaNNaN
210065British AirwaysLondonStockholm24.043.087.0
310075Air FranceBudapestParis13.0NaNNaN
410085Swiss AirBrusselsLondon67.032.0NaN
  • 1
    点赞
  • 22
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值