笔记-2011-A New Unsupervised Approach to Word Segmentation

A New Unsupervised Approach to Word Segmenation

Hanshi Wang, Jian Zhu, Shiping Tang, XiaoZhong Fan

北京理工大学,2011 发在CL上

长度、频次、左右熵:无监督

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

这篇文章很长,大致结构如下:

头:引言中提到很多人的工作

中间: 选择此系统结构与特征的理论依据

                ESA模型的三方面: 理论+评价函数,筛选原则,调整原则

                实验最初设计,实验数据,算法伪代码,实验细节设计,实验结果与分析

                全文涉及的重要定义,算法时空分析

尾:结尾详细重现的前人的工作,而且全部与自己对比

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

ESA:Evaluation,Adjustment,Selection

A子串(可能多个字符)B子串(可能多个字符),评价指标有两个,IV,CV

IV是AB成为一个词的可能性;CV有IV(A)、IV(B),IRV(AB)组成,A、B分别成词。

IV(AB)=(当前AB串频/跟当前AB串长相同的所有串平均频次)^当前串AB的长度;

LRV(左串A,右串B)=(当前左串A的右熵*当前右串B的左)/(跟当前左串A长度相同的所有串的平均右熵*跟当前右串B长度相同的所有串的平均左熵)

CV=IV(A)IV(B)LRV(A,B);

如果,CV(A,B) > IV(AB)则AB分开。

子串的候选有要求:先用标点、数字、限制长度,LRV(阈值自己确定的)做了初选

一句话,子串的划分有n(n+1)/2种,在做的时候,说是用了动态规划的方法选择谁切分。

迭代过程看的不是十分明白,目前的理解是:

进行N轮迭代,当分词结果不再改变则终止。每轮迭代中都有N小轮迭代,这个N是人为给定的。

每小轮迭代中,对于每句话,每次只选择最确定的“分割点”(即每次只切1刀),N小轮迭代后一句话上有<N刀。

这样,这句话被切碎了,形成了许多新的子串,子串数量、频次、左右熵,重新计算。

从第二大轮起,单字、子串的计算方式与初始计算方式(用阈值卡的那些,只要连续字符串就算)就不一样了。

S*=argmax E(Si)  ;  E(Si)=IV(S0)i=0时;E(Si)=CV(Si),i>0;

最好的召回率在MSR上获得,结果是0.831 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Work with over 40 packages to draw inferences from complex datasets and find hidden patterns in raw unstructured data About This Book Unlock and discover how to tackle clusters of raw data through practical examples in R Explore your data and create your own models from scratch Analyze the main aspects of unsupervised learning with this comprehensive, practical step-by-step guide Who This Book Is For This book is intended for professionals who are interested in data analysis using unsupervised learning techniques, as well as data analysts, statisticians, and data scientists seeking to learn to use R to apply data mining techniques. Knowledge of R, machine learning, and mathematics would help, but are not a strict requirement. What You Will Learn Load, manipulate, and explore your data in R using techniques for exploratory data analysis such as summarization, manipulation, correlation, and data visualization Transform your data by using approaches such as scaling, re-centering, scale [0-1], median/MAD, natural log, and imputation data Build and interpret clustering models using K-Means algorithms in R Build and interpret clustering models by Hierarchical Clustering Algorithm's in R Understand and apply dimensionality reduction techniques Create and use learning association rules models, such as recommendation algorithms Use and learn about the techniques of feature selection Install and use end-user tools as an alternative to programming directly in the R console In Detail The R Project for Statistical Computing provides an excellent platform to tackle data processing, data manipulation, modeling, and presentation. The capabilities of this language, its freedom of use, and a very active community of users makes R one of the best tools to learn and implement unsupervised learning. If you are new to R or want to learn about unsupervised learning, this book is for you. Packed with critical information, this book will guide you through a conceptual explanation and practical examples programmed directly into the R console. Starting from the beginning, this book introduces you to unsupervised learning and provides a high-level introduction to the topic. We quickly move on to discuss the application of key concepts and techniques for exploratory data analysis. The book then teaches you to identify groups with the help of clustering methods or building association rules. Finally, it provides alternatives for the treatment of high-dimensional datasets, as well as using dimensionality reduction techniques and feature selection techniques. By the end of this book, you will be able to implement unsupervised learning and various approaches associated with it in real-world projects. Style and approach This book takes a step-by-step approach to unsupervised learning concepts and tools, explained in a conversational and easy-to-follow style. Each topic is explained sequentially, explaining the theory and then putting it into practice by using specialized R packages for each topic. Table of Contents Chapter 1. Welcome to the Age of Information Technology Chapter 2. Working with Data – Exploratory Data Analysis Chapter 3. Identifying and Understanding Groups – Clustering Algorithms Chapter 4. Association Rules Chapter 5. Dimensionality Reduction Chapter 6. Feature Selection Methods

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值