考虑数据帧df,其中随机数目的字符串由逗号分隔。在np.random.seed([3,1415])
k = 10
df = pd.DataFrame(
np.random.choice(list('ABCD,'), (k, 20))
).sum(1).str.strip(',').str.replace(',+', ',').to_frame('col1')
df
col1
0 ADCDCCDCDACAA,ACCA,B
1 DC,DDD,DBDA,CCAC
2 A,B,CCAC,DB,C,CD,D
3 ADDBAA,DA,BD,C,AACA
4 DADBB,D,DBD,ADCAADB
5 CBCBA,CA,B,AA,CDCBDB
6 BD,D,DDB,AC,B,C,ABBA
7 C,CABBBADCD,DBCC,ACD
8 CC,A,BCAAAACBBA,BD
9 AC,A,ADBBD,BDCCDDABD
我喜欢使用numpy的功能进行拆分
^{pr2}$
小数据快速%timeit df.assign(col1=np.core.defchararray.split(df.col1.values.astype(str), ','))
1000 loops, best of 3: 204 µs per loop
%timeit df.assign(col1=df['col1'].str.split(','))
1000 loops, best of 3: 327 µs per loop
%timeit df.assign(col1=[x.split(',') for x in df['col1'].values.tolist()])
1000 loops, best of 3: 210 µs per loop
对于大数据不如np.random.seed([3,1415])
k = 10000
df = pd.DataFrame(
np.random.choice(list('ABCD,'), (k, 100))
).sum(1).str.strip(',').str.replace(',+', ',').to_frame('col1')
%timeit df.assign(col1=np.core.defchararray.split(df.col1.values.astype(str), ','))
10 loops, best of 3: 19.6 ms per loop
%timeit df.assign(col1=df['col1'].str.split(','))
100 loops, best of 3: 13.5 ms per loop
%timeit df.assign(col1=[x.split(',') for x in df['col1'].values.tolist()])
100 loops, best of 3: 11.5 ms per loop