没想到 pandas来写excel是需要安装openpyxl库(.xlsx文件)的 而且是直接覆盖写的 就是说只能将东西都在dataframe里面全都写好,然后写入excel。括号里要加engine='openpyxl',不然就会变为xlrd,会遇到一些坑。
https://pandas.pydata.org/pandas-docs/version/0.22.0/10min.html
Series: 类似于list
s = pd.Series([1,3,5,np.nan,6,8])
DataFrame:类似于一个表格,是最重要的一种数据类型
第一种 np.array形式,index为行标,columns为列标
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
A B C D 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 2013-01-05 -0.424972 0.567020 0.276232 -1.087401 2013-01-06 -0.673690 0.113648 -1.478427 0.524988
第二种
Dataframe可以分别按照列的方向来定义,参数用字典的形式,字典里面,键对应列标,值对应该列的内容。
df2 = pd.DataFrame({
'A' : 1.,
'B' : pd.Timestamp('20130102'),
'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
'D' : np.array([3] * 4,dtype='int32'),
'E' : pd.Categorical(["test","train","test","train"]),
'F' : 'foo' })
注意:如果一个列都是同一个数,可以像A那样写同一个数,也可以像D那样使用数组
A B C D E F 0 1.0 2013-01-02 1.0 3 test foo 1 1.0 2013-01-02 1.0 3 train foo 2 1.0 2013-01-02 1.0 3 test foo 3 1.0 2013-01-02 1.0 3 train foo
dtypes:The main types stored in pandas objects are float
, int
, bool
, datetime64[ns]
and datetime64[ns, tz](时区)
(in >= 0.17.0), timedelta[ns]
, category
and object
. In addition these dtypes have item sizes, e.g. int64
and int32
.
Viewing Data:一个DataFrame类,里面有这些类型的属性:index,conlumns,values
有这些方法:df.head(x)头x行 df.tail(x)尾x行 df.describe()展示一些个数、平均、方差之类的属性描述
df.T进行转置,
按照行/列标排序df.sort_index(axis=1, ascending=False)
按照值排序df.sort_values(by='B')
对DataFrame进行操作
df
A B C D 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 2013-01-05 -0.424972 0.567020 0.276232 -1.087401 2013-01-06 -0.673690 0.113648 -1.478427 0.524988
取列:df['A']或df.A,对应于上面定义时候的字典形式
取行:df[0:3],df['20130102':'20130104'],对应于上面定义时候的np.array形式
按label取行:df.loc[dates[0]]或:df.loc['2013-01-01']
A 0.469112 B -0.282863 C -1.509059 D -1.135632 Name: 2013-01-01 00:00:00, dtype: float64
按label取列:df.loc[:,['A','B']]
按两个label取:print(df.loc[dates[0:3],'A':'C'])或者df.loc['20130102':'20130104',['A','B']]
其实,加了loc,后面中括号里,就是跟python的list的切片一样使用了
如果列/行其中之一为1, 返回的类型就是Series,都不为一,则是DataFrame。都为一,则根据dataframe里面的类型决定,不是DataFrame或Sereis
都为一:df.at[dates[0],'A']或df.loc[dates[0],'A']
刚才的loc方法并不非常像python list的切片方法。
为了更像一些,提供了.iloc[关键是加了i] 括号里是纯数字
df.iloc[3]//df.iloc[3:5,0:2]//df.iloc[[1,2,4],[0,2]]// df.iloc[1:3,:]//df.iat[1,1]
Bool运算
df[df.A > 0]
A B C D 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 2013-01-04 0.721555 -0.706771 -1.039575 0.271860
df[df > 0]
A B C D 2013-01-01 0.469112 NaN NaN NaN 2013-01-02 1.212112 NaN 0.119209 NaN 2013-01-03 NaN NaN NaN 1.071804 2013-01-04 0.721555 NaN NaN 0.271860 2013-01-05 NaN 0.567020 0.276232 NaN 2013-01-06 NaN 0.113648 NaN 0.524988
df2 = df.copy()
df2['E'] = ['one', 'one','two','three','four','three']#像之前字典那样定义
A B C D E 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 one 2013-01-02 1.212112 -0.173215 0.119209 -1.044236 one 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 two 2013-01-04 0.721555 -0.706771 -1.039575 0.271860 three 2013-01-05 -0.424972 0.567020 0.276232 -1.087401 four 2013-01-06 -0.673690 0.113648 -1.478427 0.524988 three
df2[df2['E'].isin(['two','four'])]
A B C D E 2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 two 2013-01-05 -0.424972 0.567020 0.276232 -1.087401 four
对DataFrame设置值:
settings
df.loc[:,'D'] = np.array([5] * len(df))#len表示行数
直接在loc/iloc的基础上进行赋值即可。
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])#可以对df进行重新的形状设置,增/删 行/列
对NaN进行操作:
df1.dropna(how='any')#去掉带nan的行
df1.fillna(value=5)把nan的地方设置为5
pd.isna(df1)#相当于print(df[df=nan])但这个显然不可以这么写
对数据进行操作:(默认不带nan)
df.mean()#每列求平均
df.mean(1)#每行求平均
In [63]: s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2)#shift的意思就是,整体向下移动2个单位 In [64]: s Out[64]: 2013-01-01 NaN 2013-01-02 NaN 2013-01-03 1.0 2013-01-04 3.0 2013-01-05 5.0 2013-01-06 NaN Freq: D, dtype: float64 In [65]: df.sub(s, axis='index')#有sub那就还有add Out[65]: A B C D F 2013-01-01 NaN NaN NaN NaN NaN 2013-01-02 NaN NaN NaN NaN NaN 2013-01-03 -1.861849 -3.104569 -1.494929 4.0 1.0 2013-01-04 -2.278445 -3.706771 -4.039575 2.0 0.0 2013-01-05 -5.424972 -4.432980 -4.723768 0.0 -1.0 2013-01-06 NaN NaN NaN NaN NaN
df.apply(传入函数的名称)
df: A B C D
2013-01-01 -1.292577 0.292979 -1.036204 -0.032626
2013-01-02 -2.083671 -0.053030 0.281592 1.412437
2013-01-03 1.019190 -0.922485 -0.571476 1.914043
2013-01-04 0.754306 -0.673051 -0.252833 0.509505
2013-01-05 0.289606 0.908545 0.227111 0.738909
2013-01-06 -2.157600 -0.213822 -0.324352 0.691240
df.apply(np.cumsum) A B C D
2013-01-01 -1.292577 0.292979 -1.036204 -0.032626
2013-01-02 -3.376248 0.239949 -0.754612 1.379810
2013-01-03 -2.357057 -0.682536 -1.326088 3.293853
2013-01-04 -1.602751 -1.355587 -1.578920 3.803358
2013-01-05 -1.313145 -0.447041 -1.351810 4.542267
2013-01-06 -3.470745 -0.660863 -1.676162 5.233507进行纵向的累加 传入的函数都是进行纵向的列的计算的
统计频率:
s.value_counts()
字符串方法:
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s.str.lower()
concat函数:
pieces = [df[:3], df[3:7], df[7:]]
pd.concat(pieces)
merge函数:
In [82]: left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]}) In [83]: right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]}) In [84]: left Out[84]: key lval 0 foo 1 1 bar 2 In [85]: right Out[85]: key rval 0 foo 4 1 bar 5 In [86]: pd.merge(left, right, on='key') Out[86]: key lval rval 0 foo 1 4 1 bar 2 5
对相同的key列的值合并行
append函数
df.append(s, ignore_index=True)#将s作为下一行添加在下面
groupby函数:按照某列相同的进行分组
df.groupby(['A','B']).sum()#一定要打sum,否则无法输出
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two',
'one', 'two', 'one', 'two']]))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
MultiIndex([('bar', 'one'), ('bar', 'two'), ('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')], names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
A B
first second
bar one -0.583097 0.652977
two 0.495036 -1.013380
baz one -0.316055 0.245251
two 0.245004 0.642539
foo one -0.170401 -1.944883
two 0.244160 -0.245936
qux one -1.033248 0.601506
two -1.223583 0.416897
stack()压缩列数 一个维度,unstack()可以还原
first second
bar one A -0.583097
B 0.652977
two A 0.495036
B -1.013380
baz one A -0.316055
B 0.245251
two A 0.245004
B 0.642539
foo one A -0.170401
B -1.944883
two A 0.244160
B -0.245936
qux one A -1.033248
B 0.601506
two A -1.223583
B 0.416897
dtype: float64
数据透视表
pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])#values按哪一列来作为“值”,index是行分类的依据,columns是列分类的依据。分别对应了Excel的数据透视表这三栏。
作图:
df.plot()#生成一个matplotlib图像
读写文件:
1、csv:
df.to_csv('foo.csv')#写
pd.read_csv('foo.csv')#读
2、hdf5:
df.to_hdf('foo.h5','df')
pd.read_hdf('foo.h5','df')
3、Excel
df.to_excel('foo.xlsx', sheet_name='Sheet1')
pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])
最后附上来自于网络的一张图https://blog.csdn.net/qq_34617032/article/details/80433939