数据分析Pandas库学习笔记（2）

最新推荐文章于 2024-06-02 19:37:58 发布

Hack.

最新推荐文章于 2024-06-02 19:37:58 发布

阅读量135

点赞数

分类专栏： python 笔记文章标签： python 数据分析

本文链接：https://blog.csdn.net/endlessfc/article/details/112553680

版权

python 同时被 2 个专栏收录

16 篇文章 0 订阅

订阅专栏

笔记

12 篇文章 0 订阅

订阅专栏

4.文件读写操作

(1)CSV文件

CSV称为逗号分隔值，有时也称字符分隔值，其文件以纯文本形式储存表格数据

pd.read_csv(filepath_or_buffer,sep=',',*args,**kwargs)
"""
parameters:
filepath_or_buffer:文件地址或文件对象
sep:分隔符，默认为'，'
header:以整数表示该行作为列标签
skiprows:忽略某行,skiprows=0即忽略第一行数据
na_values:用于替换NA的值
index_col:以整数或字符串指定某一列或多列作为索引
nrows:指定需要读取的行数
names:当header=None时,以列表作为列标签
chunksize:指定文件块的大小，分块读入，得到TextFileReader对象可迭代
"""

(2)Excel文件

读取Excel文件

pd.read_excel(sheet_name=0,header=0,names=None,**kwargs)
parameters:
sheet_name:指定读取表单，可以是索引0代表第一个表单，可以是表单名

读取excel返回的是DataFrame

data=pd.read_excel('test.xls')
print(type(data))   #<class 'pandas.core.frame.DataFrame'>
print(data.columns) # Index(['name', 'age', 'grade'], dtype='object')
print(data.iloc[0])
"""
name     zx
age      20
grade     3
"""
print(data.iloc[0,0])   #zx

5.处理缺失数据

缺失数据为一个数据

(1)Numpy中的缺失数据

None:空

np.nan:一个浮点数，用来标记缺失数据

a=np.array([1,np.nan,5,7,9])
print(a.dtype)
print(np.sum(a))
print(np.nansum(a))

结果：

dtype('float64')
nan
22.0

(2)Pandas处理缺失数据

NaN:Not a Number,表示未定义或不可表示的数据；计算时，pd也会屏蔽NaN

s=pd.Series([1,np.nan,3,None,7,9])
print(s)
print(s.sum())
print(s.dropna())
#删除所有NaN数据，并返回不含NaN的新对象

结果：

0    1.0
1    NaN
2    3.0
3    NaN
4    7.0
5    9.0
dtype: float64
20.0
0    1.0
2    3.0
4    7.0
5    9.0
dtype: float64

pandas有4个针对缺失数据处理的函数;

s.isnull()
s.notnull()
s.dropna(axis=0):沿着某axis删除带有缺失数据的行或列
s.fillna(value,method=None,axis=0):将value值覆盖数据中的缺失值

6.规整数据

1.轴向连接

pd.concat(objs,axis=0,join='outer',join_axes=None,ignore_index=False,keys=None,levels=None,names=None,verify_integrity=False,copy=False)
"""
objs : a sequence or mapping of Series or DataFrame objects
If a mapping is passed, the sorted keys will be used as the `keys` argument, unless it is passed, in which case the values will be selected (see below). Any None objects will be dropped silently unless they are all None in which case a ValueError will be raised.
The axis to concatenate along.

join : {'inner', 'outer'}, default 'outer' 对索引取交或者并集，默认为并集outer
How to handle indexes on other axis (or axes).

ignore_index : bool, default False 重置索引
If True, do not use the index values along the concatenation axis. The resulting axis will be labeled 0, ..., n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information. Note the index values on the other axes are still respected in the join.

keys : sequence, default None 设置多级索引
If multiple levels passed, should contain tuples. Construct
hierarchical index using the passed keys as the outermost level.

levels : list of sequences, default None
Specific levels (unique values) to use for constructing a
MultiIndex. Otherwise they will be inferred from the keys.

names : list, default None 为标签索引，列索引取名
Names for the levels in the resulting hierarchical index.

verify_integrity : bool, default False
Check whether the new concatenated axis contains duplicates. This can be very expensive relative to the actual data concatenation.

sort : bool, default False
Sort non-concatenation axis if it is not already aligned when `join` is 'outer'.
This has no effect when ``join='inner'``, which already preserves the order of the non-concatenation axis.
"""

总结：

objs:序列，参与连接的对象，对象为字典字典的键为二级索引
axis:指定沿着哪个轴连接，默认为0
keys:用于设置多级索引
names:多级索引的名字
join:默认为'outer',其他轴索引按并集连接；'inner'，其他轴索引按交集连接
ignore_index:默认为'None',忽略原有索引

举例：

df1=pd.DataFrame([[110,120,130],[210,220,230],[310,320,330]])
df2=pd.DataFrame([[11,12,13],[21,22,23],[31,32,33]])
print(pd.concat([df1,df2]))
df1.index=['a','c','d']
df2.index=['b','c','d']
print(pd.concat([df1,df2],axis=0,keys=['df1','df2'])) #多级索引
print(pd.concat([df1,df2],axis=1)) #'outer'并集索引
print(pd.concat([df1,df2],axis=1,join='inner')) #'inner' 交集索引

结果：

     0    1    2
0  110  120  130
1  210  220  230
2  310  320  330
0   11   12   13
1   21   22   23
2   31   32   33
         0    1    2
df1 a  110  120  130
    c  210  220  230
    d  310  320  330
df2 b   11   12   13
    c   21   22   23
    d   31   32   33
       0      1      2     0     1     2
a  110.0  120.0  130.0   NaN   NaN   NaN
c  210.0  220.0  230.0  21.0  22.0  23.0
d  310.0  320.0  330.0  31.0  32.0  33.0
b    NaN    NaN    NaN  11.0  12.0  13.0
     0    1    2   0   1   2
c  210  220  230  21  22  23
d  310  320  330  31  32  33

'''
Series.append : Concatenate Series.
DataFrame.append : Concatenate DataFrames.
仅限于axis=0的连接,且不改变原有对象
'''
print(df1.append(df2))
'''
     0    1    2
a  110  120  130
c  210  220  230
d  310  320  330
b   11   12   13
c   21   22   23
d   31   32   33
'''

2.合并数据

pd.merge(left,right,how='inner',on=None,left_on=None,right_on=None,left_index=False,right_index=False,sort=False,suffixes=('_x','_y'),indicator=False)
'''
left、right:要合并的两个DataFrame对象
how:默认为'inner',可选参数'outer','left、right':以left/right的指定字段以并集方式合并数据
on:字符串或者列表,用于合并的字段名称，若为指定，默认两个DataFrame对象交集
indicator:注释每行数据的来源
left_on、right_on:从left、right所引用对象的索引为键进行合并
left_index/right_index:默认为False,如果为True,则以left/right所引用对象的索引为键合并数据
'''

merge和concat的区别:

merge操作是针对DataFrame对象取并集合并

concat操作是针对索引不对数据进行处理连接

3.组合数据

df1.combine_first(df2)

df:DataFrame对象

用df2中数据补充df1

inf1=pd.DataFrame({"name":['zx','hack',np.nan],
                   "age":[np.nan,19,20]})
inf2=pd.DataFrame({"name":['zx','hack','mary','aaron'],
                   "age":[18,np.nan,21,20]})
print(inf1.combine_first(inf2))

4.数据转换

行列转换

stack(), unstack()方法

dropnan参数默认为True，即不保留数据中的NaN

透视表

df.pivot(index=None,columns=None,values=None)
"""
Return reshaped DataFrame organized by given index / column values.

Reshape data (produce a "pivot" table) based on column values. Uses unique values from specified `index` / `columns` to form axes of the resulting DataFrame. This function does not support data  aggregation, multiple values will result in a MultiIndex in the columns. See the :ref:`User Guide <reshaping>` for more on reshaping.
让所给的参数重组数据，不支持数据聚合
"""

df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two', 'two'],
                    'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
                    'baz': [1, 2, 3, 4, 5, 6],
                    'zoo': ['x', 'y', 'z', 'q', 'w', 't']})
print(df.pivot(index='foo',columns='bar',values='baz'))
print(df.pivot(index='foo',columns='bar')['baz'])#跟上一语句作用相同
print(df.pivot(index='foo',columns='bar',values=['baz','zoo']))

结果：

bar  A  B  C
foo         
one  1  2  3
two  4  5  6

bar  A  B  C
foo         
one  1  2  3
two  4  5  6

  baz       zoo      
bar   A  B  C   A  B  C
foo                    
one   1  2  3   x  y  z
two   4  5  6   q  w  t

当出现同一位置两个不同值时会报错

df.pivot_table(index=None,columns=None,values=None,aggfunc='mean',dropnan=True)
"""
aggfunc : function, list of functions, dict, default numpy.mean
"""

Hack.

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
数据分析Pandas库学习笔记（2）

4.文件读写操作(1)CSV文件CSV称为逗号分隔值，有时也称字符分隔值，其文件以纯文本形式储存表格数据pd.read_csv(filepath_or_buffer,sep=',',*args,**kwargs)"""parameters:filepath_or_buffer:文件地址或文件对象sep:分隔符，默认为'，'header:以整数表示该行作为列标签skiprows:忽略某行,skiprows=0即忽略第一行数据na_values:用于替换NA的值index_col:以整数或
复制链接

扫一扫