9 Pandas之DateFrame&数据可视化

@一夜看尽长安花

于 2024-08-11 11:10:02 发布

阅读量1k

点赞数 31

分类专栏： PythonAI # Python科学计算和可视化文章标签： pandas 信息可视化 python dataframe pandas绘图

本文链接：https://blog.csdn.net/ta683280/article/details/141102869

版权

PythonAI 同时被 2 个专栏收录

65 篇文章 0 订阅

订阅专栏

Python科学计算和可视化

8 篇文章 0 订阅

订阅专栏

欢迎来到@一夜看尽长安花博客，您的点赞和收藏是我持续发文的动力

对于文章中出现的任何错误请大家批评指出，一定及时修改。有任何想要讨论的问题可联系我：3329759426@qq.com 。发布文章的风格因专栏而异，均自成体系，不足之处请大家指正。

专栏：

java全栈
C&C++
PythonAI
PCB设计
Linux云计算&运维

文章概述：对 Pandas之DateFrame&数据可视化的介绍

关键词：Pandas之DateFrame&数据可视化

本文目录：

DataFrames

Pandas之DataFrame取值和切片

indexing, selecting, slicing

conditional selection (boolean arrays)

丢弃数据

操作广播机制

DataFrame操作之添加列、修改行列名称、inplace参数

renaming columns

DataFrame操作之添加行、根据已有列创建新的列、设置某列为索引、head、tail、describe

添加

通过其它列创建新的列

查看头部和尾部信息

统计信息

Pandas读取本地CSV文件

reading external data

添加自己的columns

Pandas数据可视化

pandas 绘图

调节画布大小

行列限定绘制内容并添加标题

绘制其他图形

DataFrame按值或按索引排序、apply函数

sorting and functions

DataFrame数据框merge整合、dropna与fillna函数

join、merge

处理缺失值

DataFrames

import numpy as np
import pandas as pd

df = pd.DataFrame({
  'Population': [35.467, 63.951, 80.940, 60.665, 127.061, 64.511, 318.523],
  'GDP': [
    1785387,
    2833687,
    3874437,
    2167744,
    4602367,
    2950039,
    17348075
   ],
  'Surface Area': [
    9984670,
    640679,
    357114,
    301336,
    377930,
    242495,
    9525067
   ],
  'HDI': [
    0.913,
    0.888,
    0.916,
    0.873,
    0.891,
    0.907,
    0.915
   ],
  'Continent': [
    'America',
    'Europe',
    'Europe',
    'Europe',
    'Asia',
    'Europe',
    'America'
   ]
}, columns=['Population', 'GDP', 'Surface Area', 'HDI', 'Continent'])


print(df) # 有行有列，像table表一样的，dataframe的每一列就是一个series，也就是说我们可以把dataframe看成是一系列series的组合


# 我们可以像之前一样去给上index
df.index = [
  'canada',
  'France',
  'Germany',
  'Italy',
  'Japan',
  'United Kingdom',
  'United States'
]

print("-----------------------------------------------------------------------------------------")
print(df)
print("-----------------------------------------------------------------------------------------")
print(df.columns)
print("-----------------------------------------------------------------------------------------")
print(df.index)
print("-----------------------------------------------------------------------------------------")
print(df.info()) # 会告诉我们每列的数据类型，还会告诉我们有没有空值，有助于我们去做数据清洗data clean
print("-----------------------------------------------------------------------------------------")
print(df.size)
print("-----------------------------------------------------------------------------------------")
print(df.shape)
print("-----------------------------------------------------------------------------------------")
print(df.describe()) # 给出可以统计的列的统计值
print("-----------------------------------------------------------------------------------------")
print(df.dtypes) #查看每一columns的数据类型
print("-----------------------------------------------------------------------------------------")
print(df.dtypes.value_counts())

Pandas之DataFrame取值和切片

indexing, selecting, slicing

print(df)
print(df.loc['canada']) # 选择整行
print(df.iloc[-1]) # 选择最后一整行
print(df['Population']) # 选择整列
# 但是不管选择一行一列，返回给我们的都是series
print(df['Population'].to_frame())#又将一维数组series转换成了表格
#multiple indexing
print(df[['Population', 'GDP']])
print(df[1:3])
print(df.loc['Italy'])
print(df.loc['France': 'Italy'])
print("---------------------------------------------------------")
#通常使用 loc和iloc
#行和列
# 同时操作两个维度
print(df.loc['France':'Italy', 'Population'])
print(df.loc['France':'Italy', ['Population','GDP']])

# 对于iloc也是一样
print(df)
print(df.iloc[0])
print(df.iloc[-1])

print(df.iloc[[0, 1, -1]])
print(df.iloc[1:3])
print(df.iloc[1:3, 3])
print(df.iloc[1:3, [0,3]])
print(df.iloc[1:3, 1:3])

DataFrame操作之布尔取值、丢弃数据、广播机制

conditional selection (boolean arrays)

DataFrame操作之布尔取值、丢弃数据、广播机制print(df)
 print(df['Population']>70)
 print(df.loc[df['Population']>70])
 print(df.loc[df['Population']>70, 
'Population'])
 print(df.loc[df['Population']>70, 
['Population', 'GDP']])

丢弃数据

 print(df)
 print(df.drop('canada')) #丢弃行
 print(df.drop(['canada', 'Japan'])) #丢弃两行
 print(df.drop(columns=['Population', 'HDI'])) #丢弃列
 print(df.drop(['Italy', 'canada'], axis=0))  #丢弃行，更清楚的写法
 print(df.drop(['Population', 'HDI'], axis=1)) #丢弃列
 print(df.drop(['Italy', 'canada'], 
axis='rows')) #丢弃行
 print(df.drop(['Population', 'HDI'], 
axis='columns')) #丢弃列'

# 可以永久删除某一个行
df.drop('China', inplace=True)
print(df)

操作广播机制

 print(df)
 print(df[['Population', 'GDP']])
 print(df[['Population', 'GDP']] / 100)
 print("-----------------------------------------")
 # 广播机制
crisis = pd.Series([-1_000_000, -0.3], index=['GDP', 'HDI'])
print(crisis)
print(df[['GDP','HDI']] + crisis)
print(df)

修改dataframe，之前的操作统统都会返回一个新的dataframe

DataFrame操作之添加列、修改行列名称、inplace参数

添加列

# 添加新的一列
langs = pd.Series(['French', 'German', 'Italian'],index=['France', 'Germany', 'Italy'],name='Language')
df['Language']=langs#列创建
print(df) # 虽然language只有几种但是没有关系，NaN意味着空
df['Language']='English' # 将改变这一列所有
print(df)

renaming columns

print(df)
print("-----------------------------------------")
print(df.rename(columns={'HDI':'Human Development Index' ,'Anual Popcorn Consumption': 'APC'}, 
index={'United States': 'USA','United Kingdom': 'UK','Argentina': 'AR'}))
print("-----------------------------------------")
print(df)  # 不存在的就改不了，而且是返回新的dataframe，之前的dataframe并没有变
print("-----------------------------------------")
print(df.rename(index=str.upper))
print("----------------------------------------

DataFrame操作之添加行、根据已有列创建新的列、设置某列为索引、head、tail、describe

添加

# 添加值,会返回一个新的DataFrame
print(df._append(pd.Series({
    'Population':3,
    'GDP':5
}, name='China')))
print(df)
print("-----------------------------------------------")
# 可以直接设置新的index和values
df.loc['China'] = pd.Series({'Population':
  1_400_000_000, 'Continent':'Asia'})
print(df)

Population GDP Surface Area HDI Continent

0 35.467 1785387 9984670.0 0.913 America

1 63.951 2833687 640679.0 0.888 Europe

2 80.940 3874437 357114.0 0.916 Europe

3 60.665 2167744 301336.0 0.873 Europe

4 127.061 4602367 377930.0 0.891 Asia

5 64.511 2950039 242495.0 0.907 Europe

6 318.523 17348075 9525067.0 0.915 America

China 3.000 5 NaN NaN NaN

Population GDP Surface Area HDI Continent

0 35.467 1785387 9984670 0.913 America

1 63.951 2833687 640679 0.888 Europe

2 80.940 3874437 357114 0.916 Europe

3 60.665 2167744 301336 0.873 Europe

4 127.061 4602367 377930 0.891 Asia

5 64.511 2950039 242495 0.907 Europe

6 318.523 17348075 9525067 0.915 America

-----------------------------------------------

Population GDP Surface Area HDI Continent

0 3.546700e+01 1785387.0 9984670.0 0.913 America

1 6.395100e+01 2833687.0 640679.0 0.888 Europe

2 8.094000e+01 3874437.0 357114.0 0.916 Europe

3 6.066500e+01 2167744.0 301336.0 0.873 Europe

4 1.270610e+02 4602367.0 377930.0 0.891 Asia

5 6.451100e+01 2950039.0 242495.0 0.907 Europe

6 3.185230e+02 17348075.0 9525067.0 0.915 America

China 1.400000e+09 NaN NaN NaN Asia

在 Pandas 中，reset_index 和 set_index 方法都返回新的 DataFrame，原始 DataFrame 不会被修改，除非使用 inplace=True 参数。要彻底改变 DataFrame 的索引，需要将这两个方法的结果保存到一个新的 DataFrame 变量中，或者使用 inplace=True 直接在原始 DataFrame 上进行修改。

以下是如何使用这两个方法并打印结果的示例：

import pandas as pd

# 创建一个示例 DataFrame
df = pd.DataFrame({
    'Population': [1, 2, 3],
    'GDP': [1000, 2000, 3000]
}, index=['A', 'B', 'C'])

print("原始 DataFrame:")
print(df)

# 使用 reset_index 恢复原始索引，并创建新的 DataFrame
df_reset = df.reset_index()
print("\n使用 reset_index 后的 DataFrame:")
print(df_reset)

# 使用 set_index 更改索引，并创建新的 DataFrame
df_set_index = df_reset.set_index('Population')
print("\n使用 set_index 后的 DataFrame:")
print(df_set_index)

在这个示例中：

df_reset 是将索引重置后的 DataFrame，其中原来的索引被恢复为默认的整数索引。

df_set_index 是将 'Population' 列设置为新的索引后的 DataFrame。

如果想直接在原始 DataFrame 上进行修改，可以这样做：

df.reset_index(inplace=True)
df.set_index('Population', inplace=True)
#这将直接在 df 上应用修改，而不需要创建新的 DataFrame。

通过其它列创建新的列

print(df)
print(df[['Population', 'GDP']])
print(df['GDP']/df['Population'])
print("-----------------------------------------")
#用等号赋值，进行了真正的改变
df['GDP Per capita'] = df['GDP'] / df['Population']

print(df)

查看头部和尾部信息

print(df.head(n=3))
print(df.tail(n=10))

统计信息

print(df)
print(df.describe())

population = df['Population']
population.min()
population.max()
population.mean()
population.std()#标准差
population.median()#中位数
population.describe()
population.quantile(.25)#
population.quantile([.2, .4, .6, .8, 1])

Pandas读取本地CSV文件

reading external data

 import numpy as np
 import pandas as pd
 df = pd.read_csv('BTC-Daily.csv')
 print(df.head())

通过结果分析可得，它默认把0301的一行当成了column

这是不对的，我对它进行优化

header=None：表示数据文件的第一行不包含列名。Pandas 将不会将第一行数据视为列名，而是将其视为数据的一部分。
默认行为：如果不指定 header 参数，Pandas 会默认将文件的第一行作为列名。这适用于文件中的第一行包含列标签的情况。

 import numpy as np
 import pandas as pd
 df = pd.read_csv('BTC-Daily.csv'，header=None)
 print(df.head())

添加自己的columns

df.columns=['unix','date','symbol','open','high','low','close','Volumn BTC','Volumn USD']
print(df.head())

# 日期格式是object，而不是日期类型
df.columns = ['Timestamp', 'Price']
print(df.head())
print(df.shape)

df.info() #datetime 是object  相当于字符串 ，不需要print打印

#pandas中有自己的日期类型，我们将他转化一下(也可以创建一个新的column添加)
df['date']=pd.to_datetime(df['date']).head()
print(df.head())
print("--------------------------------------")
df.info()

df.info()

# 将 datetime 对象转换回字符串格式
df['date_str'] = df['date'].dt.strftime('%Y-%m-%d %H:%M:%S')

# 打印转换后的字符串格式
print("\nString format:")
print(df['date_str'].head())

#重新添加一项新的column
df['Timestamp']=pd.to_datetime(df['date_str'])
print(df.head())
print(df.dtypes)
df.set_index('Timestamp', inplace=True)
print(df.head())
# 这样去做的好处是可以方便进行查询数据

print(df.loc['2022-02-28'])

# 有更好的方式一行搞定上面的代码
df = pd.read_csv(
  'BTC-Daily.csv',
  header=None,
  names=['unix','date','symbol','open','high','low','close','Volumn BTC','Volumn USD'],
  parse_dates=True ,#自动将date转换成datetime的形式
  index_col=1 # 第二列（date）作为index
)
print(df.head())

Pandas数据可视化

pandas 绘图

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# 有更好的方式一行搞定上面的代码
df = pd.read_csv(
  'BTC-Daily.csv',
  header=None,
  names=['unix','date','symbol','open','high','low','close','Volumn BTC','Volumn USD'],
  parse_dates=True ,#自动将date转换成datetime的形式
  index_col=1 # 第二列（date）作为index
)
print(df.head())
#指定绘制的线
df[['open','high','low','close']].plot()
plt.show()

调节画布大小

df[['open','high','low','close']].plot(figsize=(12,6))  #调节画布大小

行列限定绘制内容并添加标题

#df.index.is_monotonic_increasing 用于检查索引是否按升序排列
if not df.index.is_monotonic_increasing:
    df = df.sort_index()

df.loc['2019-01-01':'2022-12-31',['open','high','low','close']].plot(figsize=(12,6),
  title='BitCoin Price 2019-2022')

绘制其他图形

df.plot.hist()  #频次直方图
df.plot.pie()	#饼图
df.plot.bar()	#柱状图
df.plot.barh()	#横向的柱状图

DataFrame按值或按索引排序、apply函数

sorting and functions

import pandas as pd
import numpy as np
# 有更好的方式一行搞定上面的代码
df = pd.read_csv(
  'BTC-Daily.csv',
  header=None,
  names=['unix','date','symbol','open','high','low','close','Volumn BTC','Volumn USD'],
  parse_dates=True ,#自动将date转换成datetime的形式
  index_col=1 # 第二列（date）作为index
)


#指定计算的值
print(df[['open','high','low','close']].apply(np.sqrt))
print("--------------------------------------------------------")
print(df[['open','high','low','close']].apply(np.sum,axis=0))
print("--------------------------------------------------------")
print(df[['open','high','low','close']].apply(lambda x:x/10))
print("--------------------------------------------------------")
#sort
df.sort_index(inplace=True)
print(df)
print("--------------------------------------------------------")
df.sort_values(by=['open','close'])

DataFrame数据框merge整合、dropna与fillna函数

join、merge

import pandas as pd

names = {
  'SSN': [ 2,5,7,8],
  'Name': ['Anna','Bob','John','Mike']
}
df1 = pd.DataFrame(names)

ages = {
'SSN': [1,2,3,4],
'Age': [28, 34, 45, 62]
}

df2 = pd.DataFrame(ages)
df = pd.merge(df1, df2, on='SSN', how='outer')  # 默认是交集 left (第一个参数的全部保留)right
df.set_index('SSN', inplace=True)
print(df)

处理缺失值

 import numpy as np
 import pandas as pd
 
 #isna
print(df.isna()) #NaN的地方显示为True

#dropna
print(df.dropna())  #有NaN的一行就直接删掉
print(df.dropna(axis='columns'))  #有NaN的一栏就直接删掉
print(df.dropna(how='all')) #默认相当于是any

#fill
print(df.fillna(-999)) #将有NaN的地方默认填充为-999