PCA or LSA on a very large dataset

3 visualizaciones (últimos 30 días)

Eugene Kogan el 30 de Ag. de 2012

0
Enlazar

Enlace directo a esta pregunta

https://la.mathworks.com/matlabcentral/answers/47020-pca-or-lsa-on-a-very-large-dataset

I am trying to run LSA or PCA on a very large dataset, 50k docs by 300k terms, to reduce the dimensionality of the words.

My system runs out of memory and grinds to a halt. I am using this code on the TFIDF matrix to compute the LSA, and it gets stuck on SVD:

   % SVD decomposition of tf-idf matrix
   [ U S V ] = svd(tfidfmatrix);
   % Generate new rank reduced matrix of rank k
   Sk = S(1:K,1:K);
   Uk = U(:,1:K);
   output = inv(Sk)*Uk';

and using PRINCOMP for PCA. In both cases, it seems I have too many terms for my system to handle.

Is there a more efficient way to do dimensionality reduction for two words? The end goal is to visualize the documents in 2d or 3d where they are grouped by similarity to each other.