I'm trying to remove entries from a data frame which occur less than 100 times.
The data frame data looks like this:
pid tag
1 23
1 45
1 62
2 24
2 45
3 34
3 25
3 62
Now I count the number of tag occurrences like this:
bytag = data.groupby('tag').aggregate(np.count_nonzero)
But then I can't figure out how to remove those entries which have low count...
解决方案
Edit: Thanks to @WesMcKinney for showing this much more direct way:
data[data.groupby('tag').pid.transform(len) > 1]
import pandas
import numpy as np
data = pandas.DataFrame(
{'pid' : [1,1,1,2,2,3,3,3],
'tag' : [23,45,62,24,45,34,25,62],
})
bytag = data.groupby('tag').aggregate(np.count_nonzero)
tags = bytag[bytag.pid >= 2].index
print(data[data['tag'].isin(tags)])
yields
pid tag
1 1 45
2 1 62
4 2 45
7 3 62