python生成唯一字符串,使用python获取数据框中每列的唯一字符串值列表

here I go with another question

I have a large dataframe about 20 columns by 400.000 rows. In this dataset I can not have string since the software that will process the data only accepts numeric and nulls.

So they way I am thinking it might work is following.

1. go thru each column

2. Get List of unique strings

3. Replace each string with a value from 0 to X

4. repeat the process for the next column

5. Repeat for the next dataframe

This is how the dataframe looks like

DATE TIME FNRHP306H FNRHP306HC FNRHP306_2MEC_MAX

7-Feb-15 0:00:00 NORMAL NORMAL 1050

7-Feb-15 0:01:00 NORMAL NORMAL 1050

7-Feb-15 0:02:00 NORMAL HIGH 1050

7-Feb-15 0:03:00 HIGH NORMAL 1050

7-Feb-15 0:04:00 LOW NORMAL 1050

7-Feb-15 0:05:00 NORMAL LOW 1050

This is the result expected

DATE TIME FNRHP306H FNRHP306HC FNRHP306_2MEC_MAX

7-Feb-15 0:00:00 0 0 1050

7-Feb-15 0:01:00 0 0 1050

7-Feb-15 0:02:00 0 1 1050

7-Feb-15 0:03:00 1 0 1050

7-Feb-15 0:04:00 2 0 1050

7-Feb-15 0:05:00 0 2 1050

cb696eae6a6c6ea0ac5dc2419059ee06.png

I am using python 3.5 and the latest version of Pandas

Thanks in advance

JV

解决方案

Solution:

# try to convert all columns to numbers...

df = df.apply(lambda x: pd.to_numeric(x, errors='ignore'))

cols = df.filter(like='FNR').select_dtypes(include=['object']).columns

st = df[cols].stack().to_frame('name')

st['cat'] = pd.factorize(st.name)[0]

df[cols] = st['cat'].unstack()

del st

Demo:

In [233]: df

Out[233]:

DATE TIME FNRHP306H FNRHP306HC FNRHP306_2MEC_MAX

0 7-Feb-15 0:00:00 NORMAL NORMAL 1050

1 7-Feb-15 0:01:00 NORMAL NORMAL 1050

2 7-Feb-15 0:02:00 NORMAL HIGH 1050

3 7-Feb-15 0:03:00 HIGH NORMAL 1050

4 7-Feb-15 0:04:00 LOW NORMAL 1050

5 7-Feb-15 0:05:00 NORMAL LOW 1050

first we stack all object (string) columns:

In [235]: cols = df.filter(like='FNR').select_dtypes(include=['object']).columns

In [236]: st = df[cols].stack().to_frame('name')

now we can factorize stacked column:

In [238]: st['cat'] = pd.factorize(st.name)[0]

In [239]: st

Out[239]:

name cat

0 FNRHP306H NORMAL 0

FNRHP306HC NORMAL 0

1 FNRHP306H NORMAL 0

FNRHP306HC NORMAL 0

2 FNRHP306H NORMAL 0

FNRHP306HC HIGH 1

3 FNRHP306H HIGH 1

FNRHP306HC NORMAL 0

4 FNRHP306H LOW 2

FNRHP306HC NORMAL 0

5 FNRHP306H NORMAL 0

FNRHP306HC LOW 2

assign unstacked result back to original DF (to object columns):

In [241]: df[cols] = st['cat'].unstack()

In [242]: df

Out[242]:

DATE TIME FNRHP306H FNRHP306HC FNRHP306_2MEC_MAX

0 7-Feb-15 0:00:00 0 0 1050

1 7-Feb-15 0:01:00 0 0 1050

2 7-Feb-15 0:02:00 0 1 1050

3 7-Feb-15 0:03:00 1 0 1050

4 7-Feb-15 0:04:00 2 0 1050

5 7-Feb-15 0:05:00 0 2 1050

Explanation:

In [248]: df.filter(like='FNR')

Out[248]:

FNRHP306H FNRHP306HC FNRHP306_2MEC_MAX

0 NORMAL NORMAL 1050

1 NORMAL NORMAL 1050

2 NORMAL HIGH 1050

3 HIGH NORMAL 1050

4 LOW NORMAL 1050

5 NORMAL LOW 1050

In [249]: df.filter(like='FNR').select_dtypes(include=['object'])

Out[249]:

FNRHP306H FNRHP306HC

0 NORMAL NORMAL

1 NORMAL NORMAL

2 NORMAL HIGH

3 HIGH NORMAL

4 LOW NORMAL

5 NORMAL LOW

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值