numpy和pandas常用操作

前言

import numpy as np
import pandas as pd

一、numpy操作

1.numpy数组中对所有元素进行加减乘除:

v = np.array([1,2,3,4,5])
加法:v+1
结果是:array([2, 3, 4, 5, 6])
减法:v-1
结果是:array([0, 1, 2, 3, 4])
乘法:v*2
结果是:array([ 2,  4,  6,  8, 10])
除法:v/10
结果是:array([0.1, 0.2, 0.3, 0.4, 0.5])

2.生成指定形状的numpy数组

https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.rand.html

生成3行2列的随机数数组:
np.random.rand(3, 2)
结果是:
array([[0.98776843, 0.33816671],
       [0.2125091 , 0.63436048],
       [0.54017721, 0.37424256]])

生成用1填充的指定形状指定数据类型的数组:
生成的是3行1列的数组
np.ones((3, 1), dtype=np.float32)
结果是:
array([[1.],
       [1.],
       [1.]], dtype=float32)

在声明数组时指定不同的数据类型:
import numpy as np
dtype = [('Name', 'S10'), ('Height', float), ('Age', int)]
values = [('Li', 1.8, 41), ('Wang', 1.9, 38),('Duan', 1.7, 38)]
# [('Li', 1.8, 41), ('Wang', 1.9, 38), ('Duan', 1.7, 38)]
a = np.array(values, dtype=dtype)
# array([(b'Li', 1.8, 41), (b'Wang', 1.9, 38), (b'Duan', 1.7, 38)], dtype=[('Name', 'S10'), ('Height', '<f8'), ('Age', '<i4')])
排序的时候就可以指定列了:np.sort(a, order='Height')
参考链接:https://blog.csdn.net/haiyang_duan/article/details/79221458

3.矩阵乘法

4.获得数组中数值等于某个值的位置

label1=np.array([1,2,3,4,5,6,7,1,2,3,4,5,6,7])
label1==2
返回的数据为:
array([False,  True, False, False, False, False, False, False,  True,
       False, False, False, False, False])

5.矩阵行/列求和

https://blog.csdn.net/lwplwf/article/details/55506896

test1 = np.array([[5, 10, 15],
            [20, 25, 30],
            [35, 40, 45]])
test1.sum(axis=1)
# 输出 array([30, 75, 120])

test1 = np.array([[5, 10, 15],
            [20, 25, 30],
            [35, 40, 45]])
test1.sum(axis=0)
# 输出 array([60, 75, 90])

二、pandas操作

1.读取csv/excel,从dict中构建dataframe

1)读取csv/excel

https://blog.csdn.net/sinat_29957455/article/details/79054126

df = pd.read_csv('XXX.csv')
df = pd.read_excel('excel.xls')
df = pd.DataFrame.from_dict({'source':keysarr,'target':valuesarr})

2)写入csv

df5.to_csv("C:/Users/17646/Desktop/result5.csv",sep=',')

2. 读取mongodb数据到dataframe中

1)从mongodb中读取数据

import pandas as pd
import pymongo
client = pymongo.MongoClient("127.0.0.1:27017")
# client = pymongo.MongoClient('localhost',27017)
db = client['lvsuo']
collection = db["total2"]
data = pd.DataFrame(list(collection .find()))

2)将dataframe数据插入mongodb中

我试的时候对于从mongodb里面取出的_id字段是不能够转换成json里面去的,后来我就自己指定了需要的那几个字段保存进了mongodb。

db["withIndex"].insert_many(json.loads(df1.T.to_json()).values())

3.查看数据

查看前5行数据,可以传递参数指定查看多少行数据:df.head()
查看数据有多少行多少列:df.shape
查看所有列名:df.columns
查看所有信息:df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 20 columns):
State                     3333 non-null object
Account length            3333 non-null int64
Area code                 3333 non-null int64
International plan        3333 non-null object
Voice mail plan           3333 non-null object
Number vmail messages     3333 non-null int64
Total day minutes         3333 non-null float64
Total day calls           3333 non-null int64
Total day charge          3333 non-null float64
Total eve minutes         3333 non-null float64
Total eve calls           3333 non-null int64
Total eve charge          3333 non-null float64
Total night minutes       3333 non-null float64
Total night calls         3333 non-null int64
Total night charge        3333 non-null float64
Total intl minutes        3333 non-null float64
Total intl calls          3333 non-null int64
Total intl charge         3333 non-null float64
Customer service calls    3333 non-null int64
Churn                     3333 non-null bool
dtypes: bool(1), float64(8), int64(8), object(3)
memory usage: 498.1+ KB
None

查看数据的基础统计特征:

df.describe()查看的是float64和int64的特征的统计特征:
Account lengthArea codeNumber vmail messagesTotal day minutesTotal day callsTotal day chargeTotal eve minutesTotal eve callsTotal eve chargeTotal night minutesTotal night callsTotal night chargeTotal intl minutesTotal intl callsTotal intl chargeCustomer service callsChurn
count3333.0000003333.0000003333.0000003333.0000003333.0000003333.0000003333.0000003333.0000003333.0000003333.0000003333.0000003333.0000003333.0000003333.0000003333.0000003333.000000
mean101.064806437.1824188.099010179.775098100.43564430.562307200.980348100.11431117.083540200.872037100.1077119.03932510.2372944.4794482.7645811.562856
std39.82210642.37129013.68836554.46738920.0690849.25943550.71384419.9226254.31066850.57384719.5686092.2758732.7918402.4612140.7537731.315491
min1.000000408.0000000.0000000.0000000.0000000.0000000.0000000.0000000.00000023.20000033.0000001.0400000.0000000.0000000.0000000.000000
25%74.000000408.0000000.000000143.70000087.00000024.430000166.60000087.00000014.160000167.00000087.0000007.5200008.5000003.0000002.3000001.000000
50%101.000000415.0000000.000000179.400000101.00000030.500000201.400000100.00000017.120000201.200000100.0000009.05000010.3000004.0000002.7800001.000000
75%127.000000510.00000020.000000216.400000114.00000036.790000235.300000114.00000020.000000235.300000113.00000010.59000012.1000006.0000003.2700002.000000
max243.000000510.00000051.000000350.800000165.00000059.640000363.700000170.00000030.910000395.000000175.00000017.77000020.00000020.0000005.4000009.000000

查看非数值型数据的基础统计:

df.describe(include=['object', 'bool'])
StateInternational planVoice mail plan
count33333333
unique512
topWVNo
freq1063010

对于object和bool类型的特征,查看指定列中不同值的各项出现的数量:

df['Churn'].value_counts()
0    2850
1     483
Name: Churn, dtype: int64

采用df['Churn'].value_counts(normalize=True)可以把数值转换成分数:
0    0.855086
1    0.144914
Name: Churn, dtype: float64

4.数据类型操作

具体有哪些数据类型呢?

将指定列转换成int64:df['Churn'] = df['Churn'].astype('int64')

5. 数据过滤

查看指定数列数据:df[['col1','col1','col1']],它仍然是一个dataframe

1)数值类型大小比较:

!!!!!!!!在进行比较的时候一定要看数据类型是否相同!!!!!!!
df[df['dists']==123]或者df[df['dists']>=123]或者df[df['dists']>123]或者...
获得最小值所在的行:df[df['dists']==df['dists'].min()]

如果返回的列要求指定某几列的话:
df.ix[df['fromId']=='9c91082ab89ab3d157a4d0174b9c7ffa',['fromName','toName','money','minFormatDt']]
或者df[df['fromId']=='9c91082ab89ab3d157a4d0174b9c7ffa'][['fromName','toName','money','minFormatDt']]
多个条件进行过滤的话采用&、|,条件分别用括号括起来,例如
data1[(data1['fromId']=='004574fe986ed418066a581792a20368') & (data1['minFormatDt']>=1504546860)]

2)去掉重复数据

df.drop_duplicates(['name','address','area'])
df.drop_duplicates(subset=None, keep='first', inplace=False)
1 data.drop_duplicates()#data中一行元素全部相同时才去除
2 data.drop_duplicates(['a','b'])#data根据’a','b'组合列删除重复项,默认保留第一个出现的值组合。传入参数keep='last'则保留最后一个
来自 <https://www.cnblogs.com/zenan/p/8616508.html>

3)获取某一行

按照index的序值取:df.iloc[[2]],也有df.iloc[2]这种写法
按照具体值index的值取:df.loc[[2]]
In [2]: df
Out[2]: 
          A         B
0  1.068932 -0.794307
2 -0.470056  1.192211
4 -0.284561  0.756029
6  1.037563 -0.267820
8 -0.538478 -0.800654

In [5]: df.iloc[[2]]
Out[5]: 
          A         B
4 -0.284561  0.756029

In [6]: df.loc[[2]]
Out[6]: 
          A         B
2 -0.470056  1.192211

4)填充nan或者Null值

对所有的nan填充0:
data.fillna(0)
对不同列进行不同的填充:
data.fillna({'age':0,'money':1000})

6.采用pandas进行统计

1)使用groupby语句进行统计

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.DataFrameGroupBy.agg.html

df.groupby(['fromId','toId']).agg({'money':['count','sum']}).reset_index()

对df的结果输出之后的形式如下:

fromId toId money
countsum
0XXXXX153000000.00

另外一种写法得到的结果如下,和上面唯一的差别在于我在这儿只对money列进行统计:

df2 = df.groupby(['fromId','toId'])['money'].agg(['count','sum']).reset_index()
fromId toId countsum
0XXXXX153000000.00

7.对指定列应用函数,apply

df.apply(function,axis)

import time
def appl(row):
    rq = row['jyrq']
    rq = row['jyrq'] + '000000'
    rq=rq.replace('t','0')[0:14]
    ts = int(time.mktime(time.strptime(rq, "%Y%m%d%H%M%S")))
    return ts

8.构建dataframe

列表创建

d1 = DataFrame([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
])

9.行列转置

dataframe.T这一句话就可以完成dataframe行列转置了。

10.dataframe的join操作

记得用set_index函数,不然的话默认是用的是原始的索引。

df7 = df6[["id","rank"]].join(df5.set_index("id"),how="left",on="id")
how : {‘left’, ‘right’, ‘outer’, ‘inner’}, default: ‘left’

http://pandas.pydata.org/pandas-docs/version/0.19.2/generated/pandas.DataFrame.join.html

10.排序

sort_values

ids.sort_values(by="index",axis=0,ascending =True)

https://blog.csdn.net/wendaomudong_l2d4/article/details/80648633

11.取样

# 选取k个数据
df.sample(n=k)
#按照比例取样
df.sample(frac=0.5)

将时间转换成时间戳,对行进行操作

df3[‘ndate’]=df3.apply(appl,axis=1)

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值