I have created a co-occurrence matrix as follows using pandas.
import pandas as pd
import numpy as np
lst = [
['a', 'b'],
['b', 'c', 'd', 'e', 'e'],
['a', 'd', 'e'],
['b', 'e']
]
u = (pd.get_dummies(pd.DataFrame(lst), prefix='', prefix_sep='')
.groupby(level=0, axis=1)
.sum())
v = u.T.dot(u)
v.values[(np.r_[:len(v)], ) * 2] = 0
print(v)
The output is as follows.
a b c d e
a 0 1 0 1 1
b 1 0 1 1 3
c 0 1 0 1 2
d 1 1 1 0 3
e 1 3 2 3 0
I would like to convert the above mentioned dataframe into (x,y) pairs. As you can see the output matrix is symmetric (i.e the upper part from the diagonal and lower part from the diagonal is similar). Therefore, I am happy to only get the (x,y) pairs from one part of them (e.g., only using upper part).
So, in the above matrix the ouput should be (i.e. (x,y) pairs whose value is greater than zero >0);
[('a','b'), ('a', 'd'), ('a','e'), ('b', 'c'), ('b', 'd'), ('b', 'e'),
('c', 'd'), ('c', 'e'), ('d', 'e')]
Is it possible to perform this in pandas?
I am happy to provide more details if needed.
解决方案
You can try np.where:
arr = np.where(v>=1)
corrs = [(v.index[x], v.columns[y]) for x, y in zip(*arr)]
corrs
[('a', 'b'),
('a', 'd'),
('a', 'e'),
('b', 'a'),
('b', 'c'),
('b', 'd'),
('b', 'e'),
('c', 'b'),
('c', 'd'),
('c', 'e'),
('d', 'a'),
('d', 'b'),
('d', 'c'),
('d', 'e'),
('e', 'a'),
('e', 'b'),
('e', 'c'),
('e', 'd')]
Then you can filter the list:
final_arr = []
for x, y in corrs:
if (y,x) not in final_arr:
final_arr.append((x,y))
final_arr
[('a', 'b'),
('a', 'd'),
('a', 'e'),
('b', 'c'),
('b', 'd'),
('b', 'e'),
('c', 'd'),
('c', 'e'),
('d', 'e')]