Combined Approaches

最新推荐文章于 2024-07-15 20:51:51 发布

HoiDev

最新推荐文章于 2024-07-15 20:51:51 发布

阅读量314

点赞数

分类专栏： NLP

本文链接：https://blog.csdn.net/qq_33938256/article/details/52763833

版权

NLP 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

Preparing data

Text extraction is an early step in most NLP tasks. Here, we will quickly cover how text extraction can be performed for HTML, Word, and PDF documents. Although there are several APIs that support these tasks, we will use:
- Boilerpipe (https://code.google.com/p/boilerpipe/) for HTML
- POI (http://poi.apache.org/index.html) for Word
- PDFBox (http://pdfbox.apache.org/) for PDF

Some APIs support the use of XML for input and output. For example, the Stanford XMLUtils class provides support for reading XML files and manipulating XML data. The LingPipe’s XMLParser class will parse XML text.

xstream也可以解析XML。在数据清洗/预处理阶段，自己也处理过XML，在数据量大时，很容易OOM，原因在于StringBuilder/StringBuffer底层的byte[]数组复制。Stanford XMLUtils,LingPipe XMLParser在数据量大时，能否处理好，目前自己未有研究

//Boilerpipe API
try 
{
    URL url = new URL("http://en.wikipedia.org/wiki/Berlin");
    HTMLDocument htmlDoc = HTMLFetcher.fetch(url);
InputSource is = htmlDoc.toInputSource();
    TextDocument document = new BoilerpipeSAXInput(is).getTextDocument();
    System.out.println(document.getText(true, true));
} 
catch (MalformedURLException ex) 
{
// Handle exceptions
} 
catch (BoilerpipeProcessingException | SAXException | IOException ex) 
{
// Handle exceptions
}

try 
{
    FileInputStream fis = new FileInputStream("TestDocument.docx");
    POITextExtractor textExtractor =
    ExtractorFactory.createExtractor(fis);
    System.out.println(textExtractor.getText());

    /////////////First approach//////////////
    POITextExtractor metaExtractor = textExtractor.getMetadataTextExtractor();
    System.out.println(metaExtractor.getText());

    //////////Second///////////////
    fis = new FileInputStream("TestDocument.docx");
    POIXMLPropertiesTextExtractor properties =
    new POIXMLPropertiesTextExtractor(new XWPFDocument(fis));
    System.out.println(properties.getText());
} 
catch (IOException ex) 
{
// Handle exceptions
} 
catch (OpenXML4JException | XmlException ex) 
{
// Handle exceptions
}

//PDFBox API
try 
{
    File file = new File("TestDocument.pdf");
    PDDocument pdDocument = PDDocument.load(file);
    PDFTextStripper stripper = new PDFTextStripper();
    String text = stripper.getText(pdDocument);
    System.out.println(text);
    pdDocument.close();
} 
catch (IOException ex) 
{
// Handle exceptions
}

Creating a pipeline to search text

We need to:
1. Read the text from the file
2. Tokenize and find sentence boundaries
3. Remove stop words
4. Accumulate the index statistics
5. Write out the index file

There are several factors that influence the contents of an index file:

Removal of stop words
Case-sensitive searches
Finding synonyms
Using stemming and lemmatization
Allowing searches across sentence boundaries

HoiDev

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Combined Approaches

Preparing dataText extraction is an early step in most NLP tasks. Here, we will quickly cover how text extraction can be performed for HTML, Word, and PDF documents. Although there are several APIs tha
复制链接

扫一扫

专栏目录