How Classification is used
Classifying text is used for a number of purposes:
- Spam detection
- Authorship attribution
- Sentiment analysis
- Age and gender identification
- Determining the subject of a document
- Language identification
Understanding sentiment analysis
With sentiment analysis, we are concerned with who holds what type of feeling about a specific product or topic.
Sentiment analysis can be applied to a sentence, a clause, or an entire document.
Further complicating the process, within a single sentence or document, different sentiments could be expressed against different topics.
Text classifying techniques
two basic techniques:
- Rule-based
- Supervised Machine Learning
Rule-based classification uses a combination of words and other attributes organized around expert crafted rules. These can be very effective but creating them is a time-consuming process.
Supervised Machine Learning (SML) takes a collection of annotated training documents to create a model. The model is normally called the classifier. There are many different machine learning techniques including Naive Bayes, Support-Vector Machine (SVM),and k-nearest neighbor.
Process
//Opennlp
//Train a classifier
DoccatModel model = null;
try (InputStream dataIn = new FileInputStream("en-animal.train");
OutputStream dataOut = new FileOutputStream("en-animal.model");)
{
ObjectStream<String> lineStream = new PlainTextByLineStream(dataIn, "UTF-8");
ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream);
model = DocumentCategorizerME.train("en", sampleStream);
OutputStream modelOut = null;
modelOut = new BufferedOutputStream(dataOut);
model.serialize(modelOut);
}
catch (IOException e)
{
// Handle exceptions
}
//Use the model above to classify doc
try (InputStream modelIn = new FileInputStream(new File("en-animal.model"));)
{
DoccatModel model = new DoccatModel(modelIn);
DocumentCategorizerME categorizer = new DocumentCategorizerME(model);
double[] outcomes = categorizer.categorize(inputText);
for (int i = 0;i<categorizer.getNumberOfCategories(); i++)
{
String category = categorizer.getCategory(i);
System.out.println(category + " - " + outcomes[i]);
}
}
catch (IOException ex)
{
// Handle exceptions
}
//Stanfordnlp
//train a classifier
ColumnDataClassifier cdc = new ColumnDataClassifier("box.prop");
Classifier<String, String> classifier = cdc.makeClassifier(cdc.readTrainingExamples("box.train"));
//test the classifier
for (String line : ObjectBank.getLineIterator("box.test", "utf-8"))
{
Datum<String, String> datum = cdc.makeDatumFromLine(line);
System.out.println("Datum: {" + line + "]\tPredicted Category: " + classifier.classOf(datum));
}
//predict
String sample[] = {"", "6.90", "9.8", "15.69"};
Datum<String, String> datum = cdc.makeDatumFromStrings(sample);
System.out.println("Category: " + classifier.classOf(datum));
//Stanford nlp pipline
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, parse, sentiment");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Annotation annotation = new Annotation("Text String");
pipeline.annotate(annotation);
String[] sentimentText = {"Very Negative", "Negative", "Neutral", "Positive", "Very Positive"};
for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class))
{
Tree tree = sentence.get(SentimentCoreAnnotations.AnnotatedTree.class);
int score = RNNCoreAnnotations.getPredictedClass(tree);
System.out.println(sentimentText[score]);
}
Using LingPipe to classify text
I actually used LingPipe befoer. APIs on its website are quite clear. Good experience in processing English but not Chinese.
//LingPipge
String[] categories = {"soc.religion.christian", "talk.religion.misc","alt.atheism","misc.forsale"};
int nGramSize = 6;
//Initial DynamicLMCClassifier
DynamicLMClassifier<NGramProcessLM> classifier = DynamicLMClassifier.createNGramProcess(categories, nGramSize);
String directory = ".../demos";
File trainingDirectory = new File(directory
+ "/data/fourNewsGroups/4news-train");
for (int i = 0; i < categories.length; ++i)
{
File classDir = new File(trainingDirectory, categories[i]);
String[] trainingFiles = classDir.list();
// Inner for-loop
for (int j = 0; j < trainingFiles.length; ++j)
{
try
{
File file = new File(classDir, trainingFiles[j]);
String text = Files.readFromFile(file, "ISO-8859-1");
Classification classification = new Classification(categories[i]);
Classified<CharSequence> classified = new Classified<>(text, classification);
classifier.handle(classified);
}
catch (IOException ex)
{
// Handle exceptions
}
}
try
{
AbstractExternalizable.compileTo( (Compilable) classifier, new File("classifier.model"));
}
catch (IOException ex)
{
// Handle exceptions
}
}
//LingPipe sentiment
categories = new String[2];
categories[0] = "neg";
categories[1] = "pos";
nGramSize = 8;
classifier = DynamicLMClassifier.createNGramProcess(
categories, nGramSize);
String directory = "...";
File trainingDirectory = new File(directory, "txt_sentoken");
for (int i = 0; i < categories.length; ++i)
{
Classification classification = new Classification(categories[i]);
File file = new File(trainingDirectory,categories[i]);
File[] trainingFiles = file.listFiles();
for (int j = 0; j < trainingFiles.length; ++j)
{
try
{
String review = Files.readFromFile(
trainingFiles[j], "ISO-8859-1");
Classified<CharSequence> classified =
new Classified<>(review,classification);
classifier.handle(classified);
}
catch (IOException ex)
{
ex.printStackTrace();
}
}
}