Pandas实用小功能

Pandas实用小功能

import pandas as pd
import numpy as np

1、读取时抽样百分比

data = pd.read_csv(r'..\test.csv',encoding='utf-8',
                  skiprows = lambda x: x>0 and np.random.rand() > 0.5)
data.head()
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
08942Myles, Mr. Thomas Francismale62.0002402769.6875NaNQ
18963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female22.011310129812.2875NaNS
28973Svensson, Mr. Johan Cervinmale14.00075389.2250NaNS
38983Connolly, Miss. Katefemale30.0003309727.6292NaNQ
49003Abrahim, Mrs. Joseph (Sophie Halaut Easu)female18.00026577.2292NaNC

2、Pandas 空值检查

  • 哪列有空值
data.isnull().any()
PassengerId    False
Pclass         False
Name           False
Sex            False
Age             True
SibSp          False
Parch          False
Ticket         False
Fare            True
Cabin           True
Embarked       False
dtype: bool
  • 有多少空值
data.isnull().sum()
PassengerId      0
Pclass           0
Name             0
Sex              0
Age             39
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          146
Embarked         0
dtype: int64

3、replace 做清洗

data['Name'] = data['Name'].replace('[()]','',regex=True).astype('str')
data.head()
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
08942Myles, Mr. Thomas Francismale62.0002402769.6875NaNQ
18963Hirvonen, Mrs. Alexander Helga E Lindqvistfemale22.011310129812.2875NaNS
28973Svensson, Mr. Johan Cervinmale14.00075389.2250NaNS
38983Connolly, Miss. Katefemale30.0003309727.6292NaNQ
49003Abrahim, Mrs. Joseph Sophie Halaut Easufemale18.00026577.2292NaNC

4、使用 apply(type) 做类型检查

data['Age'].apply(type)
0      <class 'float'>
1      <class 'float'>
2      <class 'float'>
3      <class 'float'>
4      <class 'float'>
            ...       
189    <class 'float'>
190    <class 'float'>
191    <class 'float'>
192    <class 'float'>
193    <class 'float'>
Name: Age, Length: 194, dtype: object

5、重新排序列

  • 方法1
data1 = data[['Name','Age','Sex']]
data1.head()
NameAgeSex
0Myles, Mr. Thomas Francis62.0male
1Hirvonen, Mrs. Alexander Helga E Lindqvist22.0female
2Svensson, Mr. Johan Cervin14.0male
3Connolly, Miss. Kate30.0female
4Abrahim, Mrs. Joseph Sophie Halaut Easu18.0female
  • 方法2
cols = data.columns[[2,4,3]]
data2 = data[cols]
data2.head()
NameAgeSex
0Myles, Mr. Thomas Francis62.0male
1Hirvonen, Mrs. Alexander Helga E Lindqvist22.0female
2Svensson, Mr. Johan Cervin14.0male
3Connolly, Miss. Kate30.0female
4Abrahim, Mrs. Joseph Sophie Halaut Easu18.0female

6、标签和位置选择数据

  • iloc
# 选择前两行
data.iloc[:2]
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
08942Myles, Mr. Thomas Francismale62.0002402769.6875NaNQ
18963Hirvonen, Mrs. Alexander Helga E Lindqvistfemale22.011310129812.2875NaNS
# 选择两列
data.iloc[:,1:3]
PclassName
02Myles, Mr. Thomas Francis
13Hirvonen, Mrs. Alexander Helga E Lindqvist
23Svensson, Mr. Johan Cervin
33Connolly, Miss. Kate
43Abrahim, Mrs. Joseph Sophie Halaut Easu
.........
1893Riordan, Miss. Johanna Hannah""
1903Peacock, Miss. Treasteall
1913Naughton, Miss. Hannah
1921Minahan, Mrs. William Edward Lillian E Thorpe
1933Henriksson, Miss. Jenny Lovisa

194 rows × 2 columns

# 特定区域
data.iloc[2:5,2:5]
NameSexAge
2Svensson, Mr. Johan Cervinmale14.0
3Connolly, Miss. Katefemale30.0
4Abrahim, Mrs. Joseph Sophie Halaut Easufemale18.0

7、找到出现频率最高的值TOPn–nlargest(n)

data['Title'] = data['Name'].str.split(',',expand=True)[1].str.split('.',expand=True)[0]
data['Title'].value_counts().nlargest(3)
 Mr      107
 Miss     40
 Mrs      34
Name: Title, dtype: int64

8、结合使用 where 和 isin

称谓名称top3

top3 = data['Title'].value_counts().nlargest(3).index
top3
Index([' Mr', ' Miss', ' Mrs'], dtype='object')

其他的改为【others】

data1 = data.where(data['Title'].isin(top3),other='others')
data1['Title'].value_counts()
 Mr       107
 Miss      40
 Mrs       34
others     13
Name: Title, dtype: int64

9、小分类值的替换

  • 统计频次,并归一化
dt = data['Title'].value_counts(normalize=True)
dt
 Mr        0.551546
 Miss      0.206186
 Mrs       0.175258
 Master    0.051546
 Ms        0.005155
 Rev       0.005155
 Dr        0.005155
Name: Title, dtype: float64
flag = 0.05
dt_others = dt[dt<flag].index
dt_others
Index([' Ms', ' Rev', ' Dr'], dtype='object')
  • 替换频次低的类别
data['Title'] = data['Title'].replace(dt_others,'others')
data['Title'].value_counts()
 Mr        107
 Miss       40
 Mrs        34
 Master     10
others       3
Name: Title, dtype: int64

10、生成时间序列的数据集

pd.util.testing.makeTimeDataFrame(10)
ABCD
2000-01-03-1.6401772.2869630.6638830.730500
2000-01-04-0.9629630.4687830.420607-2.049439
2000-01-051.1761480.3363260.275701-1.546738
2000-01-06-0.3469420.9880840.6386902.542611
2000-01-071.464977-0.4253120.4528350.555940
2000-01-101.4685640.102388-1.6941360.758442
2000-01-11-0.741469-0.0372440.422585-0.300642
2000-01-12-0.7491060.286984-0.0236010.093908
2000-01-13-2.0533040.408734-0.6020612.245386
2000-01-140.2872701.0103021.3197850.667926

11、时间数据下采样

df = pd.DataFrame(np.random.randint(1,10,size=(100000,3)),columns = ['商品编码','商品销量','商品库存'])
df.index = pd.util.testing.makeDateIndex(100000,freq='H')
df.head()
商品编码商品销量商品库存
2000-01-01 00:00:00791
2000-01-01 01:00:00967
2000-01-01 02:00:00681
2000-01-01 03:00:00761
2000-01-01 04:00:00325

使用 resample 方法,合并为天(D):

day_df = df.resample("D")["商品销量"].sum().to_frame()
day_df.head()
商品销量
2000-01-01135
2000-01-02117
2000-01-03124
2000-01-04106
2000-01-05121
m_df = df.resample("M")['商品销量'].sum().to_frame()
m_df.head()
商品销量
2000-01-313767
2000-02-293473
2000-03-313818
2000-04-303538
2000-05-313674
y_df = df.resample("Y")['商品销量'].sum().to_frame()
y_df.head()
商品销量
2000-12-3143720
2001-12-3143705
2002-12-3143642
2003-12-3144268
2004-12-3143867

12、转 datetime

df['datetime'] = pd.to_datetime(df.index,format='%Y/%m/%d')
df.head()
商品编码商品销量商品库存datetime
2000-01-01 00:00:007912000-01-01 00:00:00
2000-01-01 01:00:009672000-01-01 01:00:00
2000-01-01 02:00:006812000-01-01 02:00:00
2000-01-01 03:00:007612000-01-01 03:00:00
2000-01-01 04:00:003252000-01-01 04:00:00
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 100000 entries, 2000-01-01 00:00:00 to 2011-05-29 15:00:00
Freq: H
Data columns (total 4 columns):
 #   Column    Non-Null Count   Dtype         
---  ------    --------------   -----         
 0   商品编码      100000 non-null  int32         
 1   商品销量      100000 non-null  int32         
 2   商品库存      100000 non-null  int32         
 3   datetime  100000 non-null  datetime64[ns]
dtypes: datetime64[ns](1), int32(3)
memory usage: 2.7 MB

13、map 做特征工程

d = {
"sex":["male", "female", "male","female"], 
"color":["red", "green", "blue","green"], 
"age":[25, 30, 15, 32]
}

df = pd.DataFrame(d)
df
sexcolorage
0malered25
1femalegreen30
2maleblue15
3femalegreen32
d = {"red": 0, "green": 1,'blue':2}
d
{'red': 0, 'green': 1, 'blue': 2}
df['color2'] = df['color'].map(d) 
df
sexcoloragecolor2
0malered250
1femalegreen301
2maleblue152
3femalegreen321

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值