说明
pandas 是基于NumPy 的一种工具,该工具是为了解决数据分析任务而创建的。Pandas 纳入了大量库和一些标准的数据模型,提供了高效地操作大型数据集所需的工具。pandas提供了大量能使我们快速便捷地处理数据的函数和方法。你很快就会发现,它是使Python成为强大而高效的数据分析环境的重要因素之一。
结构
Series:---------------------------一维数组,与Numpy中的一维array类似。二者与Python基本的数据结构List也很相近。Series如今能保存不同种数据类型,字符串、boolean值、数字等都能保存在Series中。
Time- Series:------------------以时间为索引的Series。
DataFrame:--------------------二维的表格型数据结构。很多功能与R中的data.frame类似。可以将DataFrame理解为Series的容器。
Panel :---------------------------三维的数组,可以理解为DataFrame的容器。
总览
- DataFrame() ---------创建一个DataFrame对象
- df.values --------------返回ndarray类型的对象
- df.index ----------------获取行索引
- df.columns ------------获取列索引
- df.axes ------------------获取行及列索引
- df.T -----------------------行与列对调
- df. info() ----------------打印DataFrame对象的信息
- df.head(i) --------------显示前 i 行数据
- df.tail(i) -----------------显示后 i 行数据
- df.describe() ----------查看数据按列的统计信息
- df.drop()----------------删除dataframe列
- df.copy()-----------------copy()传的是副本
- df[“feature_1”].value_counts()-------------------------统计Series值出现次数
- df.sample(frac=0.33)---------df.sample(n=1)-------- dataframe样本采样
详情注解如下:
import pandas as pd
import numpy as np
df = pd.DataFrame(
[[1,2,3],
[4,5,6],
[7,8,9]],columns=['one','two','three'])
print(df)
'''
one two three
0 1 2 3
1 4 5 6
2 7 8 9
'''
#删除dataframe列
# del df['one']
# print(df.drop(columns=['one','two']))
'''
three
0 3
1 6
2 9
'''
# 修改dataframe列名
# df.columns = ['a','b','c'] #暴力
# a = df.rename(columns={'one':1,'two':2,'three':3}) #较好的方法
# print(a)
'''
1 2 3
0 1 2 3
1 4 5 6
2 7 8 9
'''
# 查看dataframe字段信息
# print(df.info())
'''
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
one 3 non-null int64
two 3 non-null int64
three 3 non-null int64
dtypes: int64(3)
memory usage: 152.0 bytes
None
'''
# 修改dataframe列类型
# df['one'] = df['one'].astype(str)
# print(df.info())
'''
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
one 3 non-null object
two 3 non-null int64
three 3 non-null int64
dtypes: int64(2), object(1)
memory usage: 152.0+ bytes
None
'''
# 查看dataframe统计信息
# print(df.describe())
'''
one two three
count 3.0 3.0 3.0
mean 4.0 5.0 6.0
std 3.0 3.0 3.0
min 1.0 2.0 3.0
25% 2.5 3.5 4.5
50% 4.0 5.0 6.0
75% 5.5 6.5 7.5
max 7.0 8.0 9.0
'''
# 获取dataframe部分列
# print(df['one']) #一维 Series
'''
0 1
1 4
2 7
Name: one, dtype: int64
'''
# print(df[['one']]) #二维 DataFrame
'''
one
0 1
1 4
2 7
'''
#
# print(df[['one','two']])
'''
one two
0 1 2
1 4 5
2 7 8
'''
#
# print(df.loc[:,'one']) #只能以列名查找
'''
0 1
1 4
2 7
Name: one, dtype: int64
'''
# print(df.iloc[:,1]) #只能以下标索引查找
'''
0 2
1 5
2 8
Name: two, dtype: int64
'''
# 获取dataframe列名
# for i in df.columns:
# print(i)
'''
one
two
three
'''
# 合并dataframe
# 横向
# print(pd.concat([df,df],axis=1))
'''
one two three one two three
0 1 2 3 1 2 3
1 4 5 6 4 5 6
2 7 8 9 7 8 9
'''
# 纵向
# print(pd.concat([df,df],axis=0))
'''
one two three
0 1 2 3
1 4 5 6
2 7 8 9
0 1 2 3
1 4 5 6
2 7 8 9
'''
import pandas as pd
df1 = pd.DataFrame(
[[1,2,3],
['%',5,6],
[1,2,3]],columns=['one','two','three'])
# df1.drop_duplicates(inplace=True)
# print(df1)
'''
one two three
0 1 2 3
1 % 5 6
'''
# df1.replace('%','',inplace = True, regex = True)#两个参数固定写法
# print(df1)
'''
one two three
0 1 2 3
1 5 6
'''
df = pd.DataFrame([[1,2,3],
[4,2,6],
[1,8,9]],columns = ["feature_1", "feature_2", "label"],index=["a","b","c"])
# df.loc[df[0]==1,0]=0 #如果第一列得数为1把它变成0
# df.loc[df.feature_2==2,'feature_2']=0 #如果第2列得数为2把它变成0
print(df)
'''
feature_1 feature_2 label
a 0 2 3
b 4 2 6
c 0 8 9
'''
b = df.copy()
b.drop(columns=('feature_1'),inplace=True)
print(df) #copy()传的是副本 赋值传的是地址
'''
feature_1 feature_2 label
a 1 2 3
b 4 2 6
c 1 8 9
'''
# 统计Series值出现次数
print(df["feature_1"].value_counts())
'''
1 2
4 1
Name: feature_1, dtype: int64
'''
df = pd.DataFrame([[1,2,3],
[4,2,6],
['?',8,9]],columns = ["feature_1", "feature_2", "label"],index=["a","b","c"])
# 异常数据处理
# 删除
# print(df.replace('?', np.nan).dropna(how = 'any'))
'''
Name: feature_1, dtype: int64
feature_1 feature_2 label
a 1.0 2 3
b 4.0 2 6
'''
a = pd.DataFrame([[1,2,3],
[4,5,6],
[1,8,9]],columns = ["feature_1", "feature_2", "label"])
# dataframe样本采样
# df = a.sample(frac=0.33) #三分之一
df = a.sample(n=1) #行数
print(df)
'''
Name: feature_1, dtype: int64
feature_1 feature_2 label
0 1 2 3
2 1 8 9
'''