COMP90049

COMP90049
COMP90049 Project1 Report:spell correction

  1. Introduction
    There are many reasons for misspelling. The hypothesis about the cause of misspelling in this report is reduplication spelling mistake. In detail, It means that a single letter in a word is written as double or reduplication letters in a word are written as single. This report uses 2-gram distance method to test this hypothesis. Our purpose is to prove the existence of such a hypothesis, that is, there is such this type of mistake.

  2. Data-set sketch
    The data-sets we use in this report are dict,wiki_correct and wiki_misspell. Dict is a list of approximately 370K English entries, which should comprise the dictionary for our approximate string search method(s). This dictionary is a slightly-altered version of the data from:
    https://github.com/dwyl/english-words The format of this file is one entry per line, in alphabetical order. Wiki_misspell is a list of 4453 tokens that have been identified as common errors made by Wikipedia editors, It has been scraped from the following page: https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings, the format of this file is one misspelling per line, in alphabetical order. Wiki_correct is a list of the truly intended spellings of the corresponding misspelled tokens from wiki_misspell - again, one item per line.

  3. Brief summary of literature
    Language is one of the most important components of human life. It can be expressed in spoken or written language. As a text, language has become an important part of literature writing. Any errors in document writing can lead to incorrect information. Now most files are written on computers. In the process of writing, there may be some mistakes due to human errors. Errors or errors may be caused by letters from adjacent keyboards, mechanical failures, or sliding of hands or fingers. Errors often occur when compiling documents. Especially in today’s society, people are more and more conscious of indoctrinating ideas into articles, scientific journals, University assignments and other documents. For this reason, spelling correction is needed to solve any writing errors. The purpose of this study is to objectively spell out spelling correction in Indonesian texts in order to overcome non word errors. This system helps users to overcome document text writing errors, the system input is a text document, its output is a new document text, has corrected the writing error.
    FSA method is used to determine which letter caused error in a word. Levenshtein distance method is used to calculate the difference between the word error and the word suggestion. The word suggestion sequence is determined by the probability of N-gram.

  4. Overview method
    Before we are discussing this method, we need to study the calculation formula of 2-gram distance first. We can know the formula is
    2-gram distance=G1+G2-2*|G1∩G2|
    First, we assume that G1 is smaller than G2, then we can know that when the type of error is reduplication misspelling, that is, a single letter in a word is written as double or reduplication letters in a word are written as single, the result of 2*|G1∩G2| is 2G1. So the result of the whole formula is G2-G1, And we can know in this situation, G2-G1=Length2-Length1=1. So the result is that when we have this type of error, the result of 2-gram distance is 1.
    Next we need to argue that only when this type of error occurs, the result of 2-gram distance is 1. First we assume that G1<G2, and G2-G1=a, we can know a is integer and a>=1. We can make the result of |G1∩G2| as b, and we can know b<=G1. So 2-gram distance=G1+G1+a-2
    b=2*(G1-b)+a,because G1-b>=0, a>=1 and they are all integers, we can see only when a=1 and G1=b, the result can be 1, that means G1=|G1∩G2| and G2-G1=1.Only when a single letter in a word is written as double or reduplication letters in a word are written as single, this situation will occur. To sum up, we can arrive at a conclusion that only when the type of error is reduplication misspelling, the result of 2-gram distance can be 1.
    Our method is that we firstly use 2-gram distance to match wiki_misspell and dict, then we get a match list. We also need to get the result of 2-gram distance of each match. Then we need to filter out the value 1 of 2-gram distance from the list. There are three situations: no value 1, only one word matches value 1, two words match value 1. In the situation of two words, we use both two words and have two final match list. Then we use this list to match the wiki_correct and calculate two value of accuracy.
    5.Results and Discussion
    2-gram (1) 2-gram (2)
    Accuracy 12.6% 15.1%
    Precision 7.3% 12.5%
    Recall 14.3% 17.6%
    Max Predict 73 73
    We can have the results above about our match. As we can see, the precision is about the first match, it matches wiki_misspell and dict. Because we only select the value one of 2-gram distance, so we can see that the value of precision is very low. The value of recall is also vary low because we don’t take the values of other error type, but only the value of our hypothesis misspelling type. Because there are two possible cases, and we choose two results and finally get two probabilities which are both not zero, and the values are not very different. This has already proved that our hypothesis is valid, our hypothesis are exit and established.
    6.Conclusion
    In conclusion, through my test, we can see that the hypothesis about the cause of misspelling in this report which is reduplication spelling mistake, which means that a single letter in a word is written as double or reduplication letters in a word are written as single is established. Because after using the 2-gram method, we can confirm the existence of this mistake. The next step is to improve the work, and identify more error types by analyzing similar methods like global edit distance and N-gram distance.

References
Wikipedia contributors. n.d. Wikipedia:Lists of common misspellings. In Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Wikipedia:Lists_of_common_misspellings&oldid=813410985
Mawardi, V. C., Susanto, N., & Naga, D. S. (2018, April 23). Spelling Correction for Text Documents in Bahasa Indonesia Using Finite State Automata and Levinshtein Distance Method. Retrieved from https://www.matec-conferences.org/articles/matecconf/abs/2018/23/matecconf_icesti2018_01047/matecconf_icesti2018_01047.html

https://mvnrepository.com/artifact/org.renjin.cran/ngram

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值