pandas基础

最新推荐文章于 2024-04-23 22:17:34 发布

helloMkzhang

最新推荐文章于 2024-04-23 22:17:34 发布

阅读量738

点赞数

分类专栏：数据分析文章标签： python

本文链接：https://blog.csdn.net/nefax/article/details/79345807

版权

数据分析专栏收录该内容

1 篇文章 0 订阅

订阅专栏

pandas基础

先了解python的基础数据结构

1. 字典Dict

dict是python的hash table实现，以key:value的形式存在，查找速度快，大多数json接口的返回值就是这种结构，python可以直接读取，我们来看一下dict的基本用法

构造和打印dict内容

d = {'01':'正常', '02':'关注', '03':'次级', '04':'可疑', '05':'损失'}
print(d['01'], d['02'], d['03'], d['04'], d['05'])

增加dict

d = {'01':'正常', '02':'关注', '03':'次级', '04':'可疑', '05':'损失'}
d['06'] = '异常'
print(d)

如果访问了不存在的key会报错,可以使用dict的get方法替换

print(d['08'])

用get方法获取value,如果key不存在返回-1，默认返回None

print(d.get('08'),-1)

删除key

d.pop('08')

遍历字典

Dict在3.6前是无序的，3.6后改为有序，但要实现有序的字典建议使用collection类参考： https://stackoverflow.com/questions/39980323/dictionaries-are-ordered-in-python-3-6

for k,v in d.items():
    print(k,v)

2.列表list

列表基本操作 # 创建list # 在末尾增加 # 删 # 插入 #遍历 #同时遍历下标

l = [101,8,0,-100,'可以是混合类型','aaa',['可以是','另一个list']] 
print(l)
l.append(1) 
print(l)
l.pop(2)
print(l)
l.insert(2,'新')
print(l)
for item in l:
    print(item)
for item,index in enumerate(l):
    print(item,index)

3.元组tuple

tuple和list一样是有序集合，list可变，tuple不可变，因此tuple更安全

t = ('低风险','中风险','高风险')
for item in t:
    print(item)

4.set

定义set 往set中传入list，会把重复的元素过滤掉

in_d = ['低风险','中风险','高风险','高风险']
s = set(in_d)
print(s)

set可以看成是没有value只有key的dict，利用set的位运算可以高效比较不重复的序列

def file2minusfile1(file1,file2,fileout):
    diffoutfile = fileout

    file_in_1 = open(file1,encoding='GBK').readlines()
    file_in_2 = open(file2,encoding='GBK').readlines()

    f1 = set(map(lambda x:x.strip(),file_in_1))
    f2 = set(map(lambda x:x.strip(),file_in_2))

    #g = (s for s in f2 if s not in f1 )
    g = list(set(f2)^set(f1))

    with open(diffoutfile, 'w', encoding='GBK') as file_out:
        for x in g:
            file_out.write(x+'\n')

pandas的数据结构

1. Series

可以设定index，如果不设定默认按0开始的数字序列

S = pd.Series(np.random.randn(5))
S = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

pandas的index可以重复，会在需要唯一index的操作时报错，这种懒加载机制是为了性能考虑

S = pd.Series(np.random.randn(5), index=['a', 'a', 'c', 'd', 'e'])

传入的参数必须是和index的长度相同，不然会报错

S = pd.Series(np.random.randn(4), index=['a', 'a', 'c', 'd', 'e'])

ValueError: Wrong number of items passed 2, placement implies 5

但是当传入的数据只有一个时，会自动复制成index的长度个

S = pd.Series(np.random.randn(1), index=['a', 'a', 'c', 'd', 'e'])

构造好了Series后可以对其进行numpy的操作

S = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
print(S)
print("S[0]\n"+str(S[0]))
print("S['a']\n"+str(S['a']))
print("S[:3]\n"+str(S[:3]))
print("S[[True,True,False,True,False,True,False,True]]\n"+str(S[[True,True,False,True,False,True,False,True]]))

2. DataFrame

把数字内容Series作为map的value值，构造出DataFrame，pandas会把相同的index放在同一行，空缺的值为NaN,可以用index参数确定需要保留的行，columns定义表头

d = { 
    'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
    'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd']) 
    }
df = pd.DataFrame(d)
print(df)
df = pd.DataFrame(d, index = ['c', 'd'])
print(df)
df = pd.DataFrame(d, index = ['c', 'd'], columns=['one', 'two'])
print(df)

也可以使用list和tuple直接构造df，请参考pandas doc http://pandas.pydata.org/pandas-docs/stable/dsintro.html

pandas数据操作

读入数据

注意分隔符如果是双字符，pandas会切换回python引擎，读入性能会降低，如果读入的内容存在中文，需要指定UTF-8或gbk

df_my = pd.read_csv("input/TM_LOAN_REG_HST.csv"
                        , sep="::"
                        , na_filter=False
                        , dtype=str
                        , index_col=False,encoding='UTF-8')

过滤

相当于sql的where条件查询,中括号中的内容可以理解成之前的[True,True,False,True]的list

df[df['LOAN_TYPE']=='MCEI']

聚合

T.groupby('mcht_cd')['login_mobile'].sum()

用groupby和apply实现窗口操作1

max_lbd_records = my_data[my_data['LOAN_TYPE']=='MCEI'].groupby('MCHT_CD').apply(lambda t: t[t['ID'] == t['ID'].max()])

窗口操作2

gp_res = c_m.groupby('cid').apply(lambda x: pd.Series(dict(
        cnt=(1),
        had_return_cnt=(x.debit_amt.sum() > 0),
        had_return_cnt_0_30=(x[x.dys < 30].debit_amt.sum() > 0),
        had_return_cnt_30_60=(x[(x.dys >= 30) & (x.dys < 60)].debit_amt.sum() > 0),
        had_return_cnt_60_90=(x[(x.dys >= 60) & (x.dys < 90)].debit_amt.sum() > 0),
        had_return_cnt_90_up=(x[x.dys >= 90].debit_amt.sum() > 0),
        setl_cnt=(x.setl_flag.sum() > 0),
        setl_cnt_0_30=(x[x.dys < 30].setl_flag.sum() > 0),
        setl_cnt_30_60=(x[(x.dys >= 30) & (x.dys < 60)].setl_flag.sum() > 0),
        setl_cnt_60_90=(x[(x.dys >= 60) & (x.dys < 90)].setl_flag.sum() > 0),
        setl_cnt_90_up=(x[(x.dys >= 90)].setl_flag.sum() > 0),
        sum_all=x.debit_amt.sum(),
        sum_0_30=x[x.dys < 30].debit_amt.sum(),
        sum_30_60=(x[(x.dys >= 30) & (x.dys < 60)].debit_amt).sum(),
        sum_60_90=(x[(x.dys >= 60) & (x.dys < 90)].debit_amt).sum(),
        sum_90_up=(x[x.dys >= 90].debit_amt).sum(),
    )))

关联

res_detail = pd.merge(ori_res, gp_df, how='left', left_on='cid', right_on='cid')

数据透视表

res_stat = pd.pivot_table(res_detail, index=['CORP', 'aip_corp', 'update_dt'],
                              aggfunc='sum', margins=False)
res_stat = res_stat.reset_index()

helloMkzhang

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
pandas基础

pandas基础先了解python的基础数据结构1. 字典Dict dict是python的hash table实现，以key:value的形式存在，查找速度快，大多数json接口的返回值就是这种结构，python可以直接读取，我们来看一下dict的基本用法构造和打印dict内容d = {'01':'正常', '02':'关注', '03':'次级',...
复制链接

扫一扫