Classifying Texts and Documents

How Classification is used

Classifying text is used for a number of purposes:

  • Spam detection
  • Authorship attribution
  • Sentiment analysis
  • Age and gender identification
  • Determining the subject of a document
  • Language identification

Understanding sentiment analysis

With sentiment analysis, we are concerned with who holds what type of feeling about a specific product or topic.

Sentiment analysis can be applied to a sentence, a clause, or an entire document.

Further complicating the process, within a single sentence or document, different sentiments could be expressed against different topics.

Text classifying techniques

two basic techniques:

  • Rule-based
  • Supervised Machine Learning

Rule-based classification uses a combination of words and other attributes organized around expert crafted rules. These can be very effective but creating them is a time-consuming process.

Supervised Machine Learning (SML) takes a collection of annotated training documents to create a model. The model is normally called the classifier. There are many different machine learning techniques including Naive Bayes, Support-Vector Machine (SVM),and k-nearest neighbor.

Process

//Opennlp


//Train a classifier
DoccatModel model = null;
try (InputStream dataIn = new FileInputStream("en-animal.train");
OutputStream dataOut = new FileOutputStream("en-animal.model");) 
{
    ObjectStream<String> lineStream = new PlainTextByLineStream(dataIn, "UTF-8");
    ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream);
    model = DocumentCategorizerME.train("en", sampleStream);


    OutputStream modelOut = null;
    modelOut = new BufferedOutputStream(dataOut);
    model.serialize(modelOut);
} 
catch (IOException e) 
{
// Handle exceptions
}

//Use the model above to classify doc
try (InputStream modelIn = new FileInputStream(new File("en-animal.model"));) 
{
    DoccatModel model = new DoccatModel(modelIn);
    DocumentCategorizerME categorizer = new DocumentCategorizerME(model);

    double[] outcomes = categorizer.categorize(inputText);
    for (int i = 0;i<categorizer.getNumberOfCategories(); i++) 
    {
        String category = categorizer.getCategory(i);
        System.out.println(category + " - " + outcomes[i]);
    }
} 
catch (IOException ex) 
{
// Handle exceptions
}
//Stanfordnlp
//train a classifier
ColumnDataClassifier cdc = new ColumnDataClassifier("box.prop");
Classifier<String, String> classifier = cdc.makeClassifier(cdc.readTrainingExamples("box.train"));

//test the classifier
for (String line : ObjectBank.getLineIterator("box.test", "utf-8")) 
{
    Datum<String, String> datum = cdc.makeDatumFromLine(line);
    System.out.println("Datum: {" + line + "]\tPredicted Category: " + classifier.classOf(datum));

}

//predict
String sample[] = {"", "6.90", "9.8", "15.69"};
Datum<String, String> datum = cdc.makeDatumFromStrings(sample);
System.out.println("Category: " + classifier.classOf(datum));
//Stanford nlp pipline
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, parse, sentiment");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

Annotation annotation = new Annotation("Text String");
pipeline.annotate(annotation);
String[] sentimentText = {"Very Negative", "Negative", "Neutral", "Positive", "Very Positive"};

for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) 
{
    Tree tree = sentence.get(SentimentCoreAnnotations.AnnotatedTree.class);

    int score = RNNCoreAnnotations.getPredictedClass(tree);

    System.out.println(sentimentText[score]);
}

Using LingPipe to classify text

I actually used LingPipe befoer. APIs on its website are quite clear. Good experience in processing English but not Chinese.

//LingPipge

String[] categories = {"soc.religion.christian", "talk.religion.misc","alt.atheism","misc.forsale"};
int nGramSize = 6;
//Initial DynamicLMCClassifier
DynamicLMClassifier<NGramProcessLM> classifier = DynamicLMClassifier.createNGramProcess(categories, nGramSize);

String directory = ".../demos";
File trainingDirectory = new File(directory
+ "/data/fourNewsGroups/4news-train");

for (int i = 0; i < categories.length; ++i) 
{
    File classDir = new File(trainingDirectory, categories[i]);
    String[] trainingFiles = classDir.list();
    // Inner for-loop
    for (int j = 0; j < trainingFiles.length; ++j) 
    {
        try 
        {
            File file = new File(classDir, trainingFiles[j]);
           String text = Files.readFromFile(file, "ISO-8859-1");
           Classification classification = new Classification(categories[i]);
            Classified<CharSequence> classified = new Classified<>(text, classification);
            classifier.handle(classified);
        } 
        catch (IOException ex) 
        {
           // Handle exceptions
        }
    }

    try 
    {
        AbstractExternalizable.compileTo( (Compilable) classifier, new File("classifier.model"));
    } 
    catch (IOException ex) 
    {
        // Handle exceptions
    }
}
//LingPipe sentiment
categories = new String[2];
categories[0] = "neg";
categories[1] = "pos";
nGramSize = 8;
classifier = DynamicLMClassifier.createNGramProcess(
categories, nGramSize);


String directory = "...";
File trainingDirectory = new File(directory, "txt_sentoken");
for (int i = 0; i < categories.length; ++i) 
{
    Classification classification = new Classification(categories[i]);
    File file = new File(trainingDirectory,categories[i]);
    File[] trainingFiles = file.listFiles();
    for (int j = 0; j < trainingFiles.length; ++j) 
    {
        try 
        {
            String review = Files.readFromFile(
            trainingFiles[j], "ISO-8859-1");
            Classified<CharSequence> classified =
            new Classified<>(review,classification);
            classifier.handle(classified);
        } 
        catch (IOException ex) 
        {
            ex.printStackTrace();
        }
    }
}
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值