I am representing each XML document as a feature matrix in a csr_matrix format. Now that I have around 3000 XML documents, I got a list of csr_matrices. I want to flatten each of these matrices to become feature vectors, then I want to combine all of these feature vectors to form one csr_matrix representing all the XML documents as one, where each row is a document and each column is a feature.
One way to achieve this is through this code
X= csr_matrix([a.toarray().ravel().tolist() for a in ls])
where ls is the list of csr_matrices, however, this is highly inefficient, as with 3000 documents, this simply crashes!
In other words, my question is, how to flatten each csr_matrix in that list 'ls' without having to turn it into an array, and how to append the flattened csr_matrices into another csr_matrix.
Please note that I am using python with Scipy
Thanks in advance!
解决方案
Why you use csr_matrix for each XML, maybe it's better to use lil, lil_matrix support reshape method, here is an example:
N, M, K = 100, 200, 300
matrixs = [sparse.rand(N, M, format="csr") for i in xrange(K)]
matrixs2 = [m.tolil().reshape((1, N*M)) for m in matrixs]
m1 = sparse.vstack(matrixs2).tocsr()
# test with dense array
#m2 = np.vstack([m.toarray().reshape(-1) for m in matrixs])
#np.allclose(m1.toarray(), m2)