gt.csv
301 234 ['ad','bd','cd']
301 235 ['a','b','c']
301 237 ['af','bf','cf']
301 239 ['a2','b2','c2']
302 236 ['a1','b1','c1']
303 238 ['a3','b3','c3']
303 2323 ['a7','b7','c7']
304 230 ['a9','b9','c9']
需求:
针对gt.csv,按第一列的值划分后,随意取出不同的值及对应的行
import pandas as pd
ground_truth = './gt.csv'
ground_truth_data = pd.read_csv(ground_truth,
names=['queryID', 'termID', 'Context'],
delim_whitespace=True)
group = ground_truth_data.groupby('queryID')
def get_partData(group, *args):
# 前提是已经读入ground_truth_data,并且进行groupby操作
args_dict = {}
for i in range(len(args)):
args_dict[i] = args[i]
all_lt = [[] for _ in range(len(args))]
for g in group:
queryID = g[0]
for key, value in args_dict.items():
if queryID in args_dict[key]:
all_lt[key].append(g[1])
return [pd.concat(x) for x in all_lt]
train_lt = [301]
dev_lt = [302]
test_lt = [303, 304]
train_data = get_partData(group, train_lt)
print('First:')
print(train_data)
print('\n')
train_data, dev_data = get_partData(group, train_lt, dev_lt)
print('Second:')
print(train_data)
print(dev_data)
print('\n')
train_data, dev_data, test_data = get_partData(group, train_lt, dev_lt, test_lt)
print('Third:')
print(train_data)
print(dev_data)
print(test_data)
print('\n')
>>>
First:
[ queryID termID Context
0 301 234 ['ad','bd','cd']
1 301 235 ['a','b','c']
3 301 237 ['af','bf','cf']
5 301 239 ['a2','b2','c2']]
Second:
queryID termID Context
0 301 234 ['ad','bd','cd']
1 301 235 ['a','b','c']
3 301 237 ['af','bf','cf']
5 301 239 ['a2','b2','c2']
queryID termID Context
2 302 236 ['a1','b1','c1']
Third:
queryID termID Context
0 301 234 ['ad','bd','cd']
1 301 235 ['a','b','c']
3 301 237 ['af','bf','cf']
5 301 239 ['a2','b2','c2']
queryID termID Context
2 302 236 ['a1','b1','c1']
queryID termID Context
4 303 238 ['a3','b3','c3']
6 303 2323 ['a7','b7','c7']
7 304 230 ['a9','b9','c9']
注意:
1.生成多个空列表的方法是
all_lt = [[] for _ in range(len(args))]
结果是[[],[],[]]
而不是下面这种方法
all_lt = [[] * len(args)]
这个的结果是[[]]
2.在return部分,如果返回的值确定是大于一个
的话,可以用
return (pd.concat(x) for x in all_lt])
否则当返回的值只有一个时,返回的是生成器对象,而不是具体的值