《Natural Language Processing》斯坦福视频学习笔记——2.text processing

本篇涉及到的文本处理,主要包含以下内容:
  • Lemmatization
  • Stemming
  • Porter Stemming Algorithm
  • 判断一个单词是否是句尾单词
  • 判断句尾单词的扩展条件
具体的笔记内容如下:
  • Lemmatization:使单词、句子还原
  • Stemming:使有相同词根的词还原
  • Porter Stemming Algorithm:词根还原算法,可以实现对英文单词进行还原英文单词的词性、词形变化,去掉前缀、后缀等等功能

    [aeiou].*ing$:只有单词中存在aeiou才能删除结尾的ing,如king就不能删,而standing可以删。
    缺陷:如living之类的词,删掉后变成liv,并没有真正意义上的还原
  • 判断一个单词是否是句尾单词:
    (1)之后有大量空白
    (2)后面的标点是?!:
    (3)当之后是一个片段时,之后无如etc等的缩略词。
    决策树形式:


  • 判断句尾单词的扩展条件:
    (1)带.的单词的开头字母大小写、是否数字等
    (2).之后的单词开头字母大小写、是否数字等
    (3)带.的单词长度
    (4)带.的单词在句尾的概率
    (5).之后的单词在句首的概率
总结:
  • 词根还原,简单的做法就是直接删除类别,再进一步就是分析单词本身结构,如是否带元音、删除之后是否要在最后加e等
  • 单词、句子分析,不仅要分析其本身结构,还要考虑上下文特征
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Graphs are ubiquitous. There is hardly any domain in which objects and their relations cannot be intuitively represented as nodes and edges in a graph. Graph theory is a well-studied sub-discipline of mathematics, with a large body of results and a large number of efficient algorithms that operate on graphs. Like many other disciplines, the fields of natural language processing (NLP) and information retrieval (IR) also deal with data that can be represented as a graph. In this light, it is somewhat surprising that only in recent years the applicability of graph-theoretical frameworks to language technology became apparent and increasingly found its way into publications in the field of computational linguistics. Using algorithms that take the overall graph structure of a problem into account, rather than characteristics of single objects or (unstructured) sets of objects, graph-based methods have been shown to improve a wide range of NLP tasks. In a short but comprehensive overview of the field of graph-based methods for NLP and IR, Rada Mihalcea and Dragomir Radev list an extensive number of techniques and examples from a wide range of research papers by a large number of authors. This book provides an excellent review of this research area, and serves both as an introduction and as a survey of current graph-based techniques in NLP and IR. Because the few existing surveys in this field concentrate on particular aspects, such as graph clustering (Lancichinetti and Fortunato 2009) or IR (Liu 2006), a textbook on the topic was very much needed and this book surely fills this gap. The book is organized in four parts and contains a total of nine chapters. The first part gives an introduction to notions of graph theory, and the second part covers natural and random networks. The third part is devoted to graph-based IR, and part IV covers graph-based NLP. Chapter 1 lays the groundwork for the remainder of the book by introducing all necessary concepts in graph theory, including the notation, graph properties, and graph representations. In the second chapter, a glimpse is offered into the plethora of graph-based algorithms that have been developed independently of applications in NLP and IR. Sacrificing depth for breadth, this chapter does a great job in touching on a wide variety of methods, including minimum spanning trees, shortest-path algorithms, cuts and flows, subgraph matching, dimensionality reduction, random walks, spreading activation, and more. Algorithms are explained concisely, using examples, pseudo-code, and/or illustrations, some of which are very well suited for classroom examples. Network theory is presented in Chapter 3. The term network is here used to refer to naturally occurring relations, as opposed to graphs being generated by an automated process. After presenting the classical Erdo ̋s-Re ́nyi random graph model and showing its inadequacy to model power-law degree distri- butions following Zipf’s law, scale-free small-world networks are introduced. Further, several centrality measures, as well as other topics in network theory, are defined and exemplified. Establishing the connection to NLP, Chapter 4 introduces networks constructed from natural language. Co-occurrence networks and syntactic dependency networks are examined quantitatively. Results on the structure of semantic networks such as WordNet are presented, as well as a range of similarity networks between lexical units. This chapter will surely inspire the reader to watch out for networks in his/her own data. Chapter 5 turns to link analysis for the Web. The PageRank algorithm is de- scribed at length, variants for undirected and weighted graphs are introduced, and the algorithm’s application to topic-sensitive analysis and query-dependent link analysis is discussed. This chapter is the only one that touches on core IR, and this is also the only chapter with content that can be found in other textbooks (e.g., Liu 2011). Still, this chapter is an important prerequisite for the chapter on applications. It would have been possible to move the description of the algorithms to Chapter 2, however, omitting this part. The topic of Chapter 6 is text clustering with graph-based methods, outlining the Fiedler method, the Kernighan–Lin method, min-cut clustering, betweenness, and random walk clustering. After defining measures on cluster quality for graphs, spectral and non-spectral graph clustering methods are briefly introduced. Most of the chapter is to be understood as a presentation of general graph clustering methods rather than their application to language. For this, some representative methods for different core ideas were selected. Part IV on graph-based NLP contains the chapters probably most interesting to readers working in computational linguistics. In Chapter 7, graph-based methods for lexical semantics are presented, including detection of semantic classes, synonym detection using random walks on semantic networks, semantic distance on WordNet, and textual entailment using graph matching. Methods for word sense and name disambiguation with graph clustering and random walks are described. The chap- ter closes with graph-based methods for sentiment lexicon construction and subjectivity classification. Graph-based methods for syntactic processing are presented in Chapter 8: an unsupervised part-of-speech tagging algorithm based on graph clustering, minimum spanning trees for dependency parsing, PP-attachment with random walks over syn- tactic co-occurrence graphs, and coreference resolution with graph cuts. In the final chapter, many of the algorithms introduced in the previous chapters are applied to NLP applications as diverse as summarization, passage retrieval, keyword extraction, topic identification and segmentation, discourse, machine translation, cross-language IR, term weighting, and question answering. As someone with a background in graph-based NLP, I enjoyed reading this book. The writing style is concise and clear, and the authors succeed in conveying the most important points from an incredibly large number of works, viewed from the graph- based perspective. I also liked the extensive use of examples—throughout, almost half of the space is used for figures and tables illustrating the methods, which some readers might perceive as unbalanced, however. With just under 200 pages and a topic as broad as this, it necessarily follows that many of the presented methods are exemplified and touched upon rather than discussed in great detail. Although this sometimes leads to the situation that some passages can only be understood with background knowledge, it is noteworthy that every chapter includes a section on further reading. In this way, the book serves as an entry point to a deeper engagement with graph-based methods for NLP and IR, and it encourages readers to see their NLP problem from a graph-based view. For a future edition, however, I have a few wishes: It would be nice if the figures and examples were less detached from the text and explained more thoroughly. At times, it would be helpful to present deeper insights and to connect the methodologies, rather than just presenting them next to each other. Also, some of the definitions in Chapter 2 could be less confusing and structured better. Because this book emphasizes graph-based aspects for language processing rather than aiming at exhaustively treating the numerous tasks that benefit from graph-based methods, it cannot replace a general introduction to NLP or IR: For students without prior knowledge in NLP and IR, a more guided and focused approach to the topic would be required. The target audience is, rather, NLP researchers and professionals who want to add the graph-based view to their arsenal of methods, and to become inspired by this rapidly growing research area. It is equally suited for people working in graph algorithms to learn about graphs in language as a field of application for their work. I will surely consult this volume in the future to supplement the preparation of lectures because of its comprehensive references and its richness in examples.
YOLO高分设计资源源码,详情请查看资源内容中使用说明 YOLO高分设计资源源码,详情请查看资源内容中使用说明 YOLO高分设计资源源码,详情请查看资源内容中使用说明 YOLO高分设计资源源码,详情请查看资源内容中使用说明YOLO高分设计资源源码,详情请查看资源内容中使用说明YOLO高分设计资源源码,详情请查看资源内容中使用说明YOLO高分设计资源源码,详情请查看资源内容中使用说明YOLO高分设计资源源码,详情请查看资源内容中使用说明YOLO高分设计资源源码,详情请查看资源内容中使用说明YOLO高分设计资源源码,详情请查看资源内容中使用说明YOLO高分设计资源源码,详情请查看资源内容中使用说明YOLO高分设计资源源码,详情请查看资源内容中使用说明YOLO高分设计资源源码,详情请查看资源内容中使用说明YOLO高分设计资源源码,详情请查看资源内容中使用说明YOLO高分设计资源源码,详情请查看资源内容中使用说明YOLO高分设计资源源码,详情请查看资源内容中使用说明YOLO高分设计资源源码,详情请查看资源内容中使用说明YOLO高分设计资源源码,详情请查看资源内容中使用说明YOLO高分设计资源源码,详情请查看资源内容中使用说明YOLO高分设计资源源码,详情请查看资源内容中使用说明YOLO高分设计资源源码,详情请查看资源内容中使用说明YOLO高分设计资源源码,详情请查看资源内容中使用说明YOLO高分设计资源源码,详情请查看资源内容中使用说明YOLO高分设计资源源码,详情请查看资源内容中使用说明YOLO高分设计资源源码,详情请查看资源内容中使用说明YOLO高分设计资源源码,详情请查看资源内容中使用说明

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值