利用Python进行数据分析笔记-pandas基础

这篇博客介绍了Pandas库的基础知识,包括Series和DataFrame的使用。内容涵盖重复值处理、汇总统计、索引选择、数据对齐、排序以及DataFrame与Numpy的转换。文章详细阐述了数据操作的关键功能,如reindexing、drop、算术运算、函数应用和排序,为Python数据分析打下基础。
摘要由CSDN通过智能技术生成
import pandas as pd
from pandas import Series, DataFrame
import numpy as np

Series基础

obj = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj
d    4
b    7
a   -5
c    3
dtype: int64
# 命名
obj.name = 'Series Name'
obj.index.name = 'index'
obj
index
d    4
b    7
a   -5
c    3
Name: Series Name, dtype: int64
print('左边的index:', obj.index)
print('右边的values:', obj.values)
print('索引a:', obj['a'])
print('判断:', 'b' in obj)

print('\n索引b/d/c: \n', obj[['b','d','c']])
print('\n大于0的:\n',obj[obj > 0])
左边的index: Index(['d', 'b', 'a', 'c'], dtype='object', name='index')
右边的values: [ 4  7 -5  3]
索引a: -5
判断: True

索引b/d/c: 
 index
b    7
d    4
c    3
Name: Series Name, dtype: int64

大于0的:
 index
d    4
b    7
c    3
Name: Series Name, dtype: int64
# isnul 及 notnull
obj[obj>0].isnull()
index
d    False
b    False
c    False
Name: Series Name, dtype: bool

DataFrame基础

data = {
  'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'], 
        'year': [2000, 2001, 2002, 2001, 2002, 2003], 
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
frame
pop state year
0 1.5 Ohio 2000
1 1.7 Ohio 2001
2 3.6 Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002
5 3.2 Nevada 2003
# 自定义排序
df = pd.DataFrame(frame, columns=['year', 'state', 'pop'])
# 返回前面多少行  head()默认5行
df.head(3)
year state pop
0 2000 Ohio 1.5
1 2001 Ohio 1.7
2 2002 Ohio 3.6
# 返回尾部多少行  tail()默认5行
df.tail(3)
year state pop
3 2001 Nevada 2.4
4 2002 Nevada 2.9
5 2003 Nevada 3.2
# 导入一个不存在的列名,会显示为缺失数据NaN
# 更改 index
df2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'], 
                      index=['one', 'two', 'three', 'four', 'five', 'six'])
df2
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN
six 2003 Nevada 3.2 NaN
# 查看某列数据 方法一
df2['year'].head(3)
one      2000
two      2001
three    2002
Name: year, dtype: int64
# 查看某列数据 方法二
df2.year
one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64


注意:df2[column]能应对任何列名,但df2.column的情况下,列名必须是有效的python变量名才行。

# 查看某行数据
df2.loc['three']
year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object
# 查看某行数据,利用index
df2.iloc[2:5]
year state pop debt
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN
# 每隔两个元素选一个元素
df2.iloc[::2]
year state pop debt
one 2000 Ohio 1.5 NaN
three 2002 Ohio 3.6 NaN
five 2002 Nevada 2.9 NaN
# 给debt列赋值
df2['debt'] = np.arange(6.)
df2
year state pop debt
one 2000 Ohio 1.5 0.0
two 2001 Ohio 1.7 1.0
three 2002 Ohio 3.6 2.0
four 2001 Nevada 2.4 3.0
five 2002 Nevada 2.9 4.0
six 2003 Nevada 3.2 5.0
# 给不存列赋值会新增列
df2['eastern'] = df2.state == 'Ohio'
df2
year state pop debt eastern
one 2000 Ohio 1.5 0.0 True
two 2001 Ohio 1.7 1.0 True
three 2002 Ohio 3.6 2.0 True
four 2001 Nevada 2.4 3.0 False
five 2002 Nevada 2.9 4.0 False
six 2003 Nevada 3.2 5.0 False
# 删除列
del df2['eastern']
df.columns # 注意:columns返回的是一个view,而不是新建了一个copy。因此,任何对series的改变,会反映在DataFrame上
Index(['year', 'state', 'pop'], dtype='object')
# DataFrame可以像numpy数组一样做转置
df2.T
one two three four five six
year 2000 2001 2002 2001 2002 2003
state Ohio Ohio Ohio Nevada Nevada Nevada
pop 1.5 1.7 3.6 2.4 2.9 3.2
debt 0 1 2 3 4 5
# 给DataFrame的index和column命名
df2.index.name = '索引名'
df2.columns.name = '行名'
df2
行名 year state pop debt
索引名
one 2000 Ohio 1.5 0.0
two 2001
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值