Fit a Latent Semantic Analysis model to a collection of documents.
Load the example data. The file sonnetsPreprocessed.txt contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.
filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);
Create a bag-of-words model using bagOfWords.
bag = bagOfWords(documents)
bag =
bagOfWords with properties:
Counts: [154x3092 double]
Vocabulary: [1x3092 string]
NumWords: 3092
NumDocuments: 154
Fit an LSA model with 20 components.
numComponents = 20;
mdl = fitlsa(bag,numComponents)
mdl =
lsaModel with properties:
NumComponents: 20
ComponentWeights: [1x20 double]
DocumentScores: [154x20 double]
WordScores: [3092x20 double]
Vocabulary: [1x3092 string]
FeatureStrengthExponent: 2
Transform new documents into lower dimensional space using the LSA model.
newDocuments = tokenizedDocument([
"what's in a name? a rose by any other name would smell as sweet."
"if music be the food of love, play on."]);
dscores = transform(mdl,newDocuments)
dscores = 2×20
0.1338 0.1623 0.1680 -0.0541 -0.2464 -0.0134 -0.2604 -0.0205 -0.1127 0.0627 0.3311 -0.2327 0.1689 -0.2695 0.0228 0.1241 0.1198 0.2535 -0.0607 0.0305
0.2547 0.5576 -0.0095 0.5660 -0.0643 -0.1236 0.0082 0.0522 0.0690 -0.0330 0.0385 0.0803 -0.0373 0.0384 -0.0005 0.1943 0.0207 0.0278 0.0001 -0.0469