衡量试卷难度信度_我们可以通过数字来衡量语言难度吗？

最新推荐文章于 2024-05-30 11:03:01 发布

weixin_26746401

最新推荐文章于 2024-05-30 11:03:01 发布

阅读量1.4k

点赞数

原文链接：https://towardsdatascience.com/can-we-measure-language-difficulty-by-the-numbers-3d591396934c

版权

衡量试卷难度信度

Without a doubt, the world is “growing smaller” in terms of our access to people and content from other countries and cultures. Even the COVID-19 pandemic, which has curtailed international travel, has led to increasing virtual interaction via the internet. Yet the barriers to fluent and proficient inter-language communications remain formidable.

毫无疑问，就我们接触其他国家和文化的人们和内容而言，世界正在“变得越来越小”。即使是减少国际旅行的COVID-19大流行，也导致通过互联网增加虚拟互动。然而，流畅和熟练的跨语言交流的障碍仍然巨大。

在线翻译与语言学习 (Online Translation Versus Language Learning)

The quality of machine translation has improved dramatically in recent years, thanks to the introduction of Artificial Intelligence methods such as neural networks to the task. The AI-driven optimization of translating has trickled down rapidly to consumer apps like Google Translate and Microsoft Translator, which simplify the usage of machine translators and improve the ability to convey meaning across linguistic frontiers.

近年来，由于将诸如神经网络之类的人工智能方法引入任务，机器翻译的质量得到了显着改善。人工智能驱动的翻译优化已Swift渗透到Google Translate和Microsoft Translator等消费类应用程序，这些应用程序简化了机器翻译的使用，并提高了跨语言边界传达含义的能力。

There’s a huge difference between translating a language via software and learning a new language. For most adults, learning a new language is hard. But some people love linguistic challenges: for them, the hardest languages to learn may be the most enjoyable to conquer. The neuroplasticity of young brains, of course, makes new language acquisition a relative snap for children. But few adults have it so easy.

通过软件翻译语言和学习新语言之间存在巨大差异。对于大多数成年人来说，学习新语言非常困难。但是有些人喜欢语言方面的挑战：对他们而言，最难学习的语言可能是最难克服的语言。当然，年轻大脑的神经可塑性使新语言习得成为儿童的一个相对习惯。但是很少有成年人这么容易。

在线语言学习及其挑战 (Online Language Learning and Its Challenges)

Online language learning, now a $582 billion/year industry according to the ICEF, has made more convenient and easier the learning of a new language for millions. English language learning accounts for most of this total. While the popularity of English may not be surprising — it is the most spoken language and the main language of business worldwide — proficient English speakers are branching out to additional languages at a rapid pace.

根据ICEF的统计，在线语言学习现在的年产值为 5820亿美元，它使数百万的新语言学习更加便捷。英语学习占总数的大部分。尽管英语的流行并不令人惊讶-它是全球最常用的语言和主要业务语言-但是精通英语的人正在Swift向其他语言扩展。

Rosetta Stone, a leading provider of language courses, reports that Spanish topped the list of languages that British people were most eager to take on in 2018, with 23.1% of its UK learners learning the language last year. Four other European languages — French, English, Italian and German — rounded off the top five. Perhaps surprisingly, Mandarin Chinese, the most popular native language, with more than a million, was not in the next tier.

领先的语言课程提供商Rosetta Stone报告说，西班牙语是英国人最渴望在2018年采用的语言，其英国学习者中有23.1％的人去年学习了该语言。排名前五位的还有其他四种欧洲语言-法语，英语，意大利语和德语。也许令人惊讶的是，拥有超过一百万种语言的最受欢迎的母语-普通话不在下一级。

No doubt the perception of that language’s difficulty played a role in its relatively low popularity ranking. Mandarin Chinese, poses major hardships for a non-Chinese speaker. And yet more than a 1.1 billion speaking, read, write and understand it fluently. So is it really hard? Or is it just unfamiliar to an English speaker? The question raises a major challenge: isn’t the perception of difficulty a totally relative matter, differing to some degree for each language learner, depending on background and education.

毫无疑问，对这种语言的困难的认识在其相对较低的流行度排名中起作用。中文普通话给不讲中文的人带来很大的困难。流利的说，读，写和理解能力超过11亿。那真的很难吗？还是只是不熟悉说英语的人？这个问题提出了一个重大挑战：对困难的理解不是一个完全相对的问题吗，取决于背景和教育程度，每种语言学习者在一定程度上有所不同。

The challenge that faces a data scientist, of course, is how can language difficulty be measured. If we wish to split hairs, there is a distinction between the difficulty of learning a language and its inherent difficulty of usage. But for purposes of this article, we will focus on the task of evaluating a way to measure a language’s degree of difficulty, if we can borrow a term from the language of gymnastics and other competitive sports.

当然，数据科学家面临的挑战是如何衡量语言难度。如果我们希望分开头发，则在学习语言的难度和其固有的使用难度之间会有区别。但是出于本文的目的，如果我们可以从体操和其他竞技体育的语言中借用一个术语，则我们将专注于评估一种衡量语言难度的方法的任务。

方法A：向外交部咨询 (Approach A: Ask the Foreign Service)

Nearly a decade ago, Voxy posted an infographic (shown below), sourced from the Foreign Service Institute, which breaks language difficulty for native English speakers into three neat categories: easy, medium, and hard. The basis for comparison was how long — in terms of calendar weeks and learning hours, attaining “proficiency” would be required for different languages. The site did qualify its findings by noting that difficulty depended on language complexity, how close it was to the learner’s own language (in this case, English), how many learning hours per week, and the language resources available. It appears from the chart that the basic assumption of 25 hours of learning per week.

大约十年前，Voxy发布了一个图表(如下所示)，该图表来自外交事务学院，该指南将以英语为母语的人的语言难度分为三个简单的类别：简单，中等和困难。比较的基础是多长时间-就日历周和学习时间而言，不同语言需要达到“熟练”水平。该站点通过指出难度取决于语言的复杂性，与学习者自己的语言的距离(在这种情况下为英语)，每周学习多少小时以及可用的语言资源，来验证其发现。从图表中可以看出，每周学习25个小时的基本假设。

Easy (22–23 weeks, 575–600 class hours): The Romance Languages (Spanish, Portuguese, French, Italian, and Romanian) all fell in this group, along with Dutch, Afrikaans, Norwegian and Swedish
轻松学习 (22-23周，575-600学时)：浪漫语言(西班牙语，葡萄牙语，法语，意大利语和罗马尼亚语)以及荷兰语，南非荷兰语，挪威语和瑞典语都属于这一类
Medium (44 weeks, 1110 class hours): Russian, Polish, Serbian, Finnish, Thai and Vietnamese, Greek, Hebrew, and Hindi.
中级 (44周，1110学时)：俄语，波兰语，塞尔维亚语，芬兰语，泰语和越南语，希腊语，希伯来语和北印度语。
Hard (88 weeks, 2220 class hours): Chinese, Japanese, Korean, Arabic
辛苦 (88周，2220课时)：中文，日语，韩语，阿拉伯语

While Voxy clearly intends the chart to be a teaching tool or subject of discussion, it’s not hard to pick apart weaknesses in its analytical method. First, who is to set the bar of “proficiency”? And how to measure the quality of instruction? How to account for factor-like second-language knowledge? For a data scientist, the results would appear disappointingly arbitrary.

尽管Voxy明确希望该图表成为教学工具或讨论的主题，但不难发现其分析方法中的缺点。首先，谁来设定“熟练”标准？以及如何衡量教学质量？如何解释类似因子的第二语言知识？对于数据科学家来说，结果似乎是令人失望的任意。

Image for post — Voxy on Voxy摄， What Are The Hardest Languages To Learn? 最难学习的语言是什么？

方法B：评分语言学习难度：多种语言的方法 (Approach B: Scoring Language Learning Difficulty: A Polyglot’s Approach)

A more intriguing approach to the problem, at least from a data science perspective, is offered by linguist Michael Campbell at Glossika. In a detailed blog post aptly titled “Language Difficulty,” he devised a scoring system for answering, numerically, the precise questions which intrigue us:

至少从数据科学的角度来看，语言学家Michael Campbell在Glossika上提供了一种更有趣的方法。在一个恰当的标题为“语言难度”的详细博客文章中，他设计了一种评分系统，以数字方式回答引起我们注意的精确问题：

Is there an objective method for measuring language difficulty?
是否有客观的方法来衡量语言难度？
What are the most difficult languages in the world?
世界上最困难的语言是什么？

Distinguishing Campbell’s approach is its relativistic data-based approach. Language difficulty is based on the relative similarity between any two languages according to various criteria of linguistic complexity. Perhaps counter-intuitively, this approach actually makes an objective assessment of language learning difficulty possible, because it is based on numerical criteria that can be objectively assessed. Among the criteria he offers are:

区分坎贝尔的方法是其相对论的基于数据的方法。语言难度是根据语言复杂性的各种标准，基于任何两种语言之间的相对相似性。也许与直觉相反，该方法实际上使对语言学习难度的客观评估成为可能，因为它基于可以客观评估的数字标准。他提供的标准包括：

词汇习得 (Vocabulary Acquisition)

This he considered with respect to how close the language is to the learner’s language.

他考虑到语言与学习者语言之间的接近程度。

Languages are divided into families, branches, and sub-branches. For example, English belongs to the Indo-European Proto-language, to which languages like Russian, Armenia, and Greek all belong. By contrast, Arabic, Chinese, and Japanese would be in a different family. Within the Indo-European grouping, that branch, English is a Germanic-Romance language, therefore closer to languages like German and French. In terms of similarity, English is closest in any way to German, despite grammatical differences. Similarly, Portuguese, Spanish and Italian would belong to the same sub-branch, making language-learning easier. Campbell assigns high importance to this criterion, with language-learning difficulty reflected in exponentially higher numbers. Same sub-branch branch: 0 points. Different sub-branch: 1 point. Different branches: 10 points. Different family: 100 points.

语言分为家庭，分支和分支。例如，英语属于印欧语系的原始语言，俄语，亚美尼亚和希腊语等语言均属于该语言。相比之下，阿拉伯文，中文和日文将属于另一个家庭。在该分支的印欧语组中，英语是日耳曼语-罗曼斯语，因此更接近德语和法语。就相似性而言，尽管在语法上有所不同，但英语在任何方面都与德语最接近。同样，葡萄牙语，西班牙语和意大利语将属于同一分支机构，从而使语言学习更加容易。坎贝尔(Campbell)对该标准给予了高度重视，语言学习的困难程度以指数级的高反映出来。同一个分支分支：0分。不同的支行：1分。不同的分支机构：10分。不同的家庭：100分。

流利的语法和语法 (Syntax and Grammar for Fluency)

Campbell, a linguist by profession. broke down into a list of factors, such as

坎贝尔，专业语言学家。分为一系列因素，例如

Language type
语言类型
Subject-Verb-Object order
主语-宾语-宾语顺序
Adjective-Noun order
形容词-名词顺序
Genitive (possessor) — Noun order
属格(宾语)—名词顺序
Determiner-Noun order
确定者名词顺序
Relative (clause) — Noun order
相对(从句)-名词顺序
Noun Declension
名词变格
Tenses
时态
Conjugation
共轭
Adposition
定位

For each of these criteria, Campbell assigns 1 point plus or minus if there is a difference between languages. The results of his calculation are rendered in a matrix:

对于这些标准中的每一个，如果语言之间存在差异，则Campbell会为其分配正负1点。他的计算结果呈现在一个矩阵中：

By comparing rows in this matrix, he can assign a score to the syntactical and grammatical differences between two languages and thus the difficulty of learning from a given language. The difficulty score for a German speaker learning French would be 6 points, a Japanese speaker learning Spanish 13 points, and a Chinese speaker learning Polish a whopping 34 points.

通过比较此矩阵中的行，他可以为两种语言之间的句法和语法差异分配分数，从而为从给定语言学习的难度分配分数。如果说德语的人说法语，那么他的难度得分将是6分；如果说日语的人说西班牙语，那么他的难度得分将是13分；如果说波兰语的话，中国人的难度得分将会高达34分。

音韵流利 (Phonology for Fluency)

Campbell’s calculations account for the difference in total phonemes (written sounds) and allophones (the sounds people say), considering 12 points of articulation and the number of vowels and intonations.

坎贝尔的计算考虑了12个发音点以及元音和语调的数量，从而说明了总音素(书面声音)和同音素(人们说的声音)之间的差异。

According to this matrix, comparing rows enables you to calculate language difficulty as related to these phonological criteria. The difficulty score for a German speaker learning French would be 1 point, a Japanese speaker learning Spanish 11 points, and a Chinese speaker learning Polish a whopping 15 points.

根据此矩阵，比较行使您能够计算与这些语音标准相关的语言难度。如果说德语的人会说法语，那么他的难度系数将是1分；如果说日语的人说西班牙语，那么难度系数将是11分；而如果说波兰语的话，汉语学习者的难度得分将达到15分。

Data scientists will note that the scores assigned for various parameters are arbitrary and subjective, but there is merit in the attempt to break down degrees of difficulty into component factors.

数据科学家将注意到，为各种参数分配的分数是任意的和主观的，但尝试将难度分为要素也有好处。

For example, for an English speaker, the following are the score assignments according to language family:

例如，对于说英语的人，以下是根据语言族的分数分配：

It is hard to reconcile a 0 score in German (So einfach ist das?) with a score of 5 in French or Spanish. And is Georgian really 10 times harder to acquire vocabulary than Polish? So the specific enumeration is certainly open to fine-turning, though the method is intriguing — if a bit rough around the edges.

很难用德语( So einfach ist das？ )的0分数与法语或西班牙语的5分数进行协调。格鲁吉亚语的词汇获取真的比波兰语难10倍吗？因此，尽管该方法很有趣，但是具体的枚举当然可以进行微调-如果边缘有些粗糙。

最后的推算：乌比赫有什么独特之处？ (The Final Reckoning: What’s Unique About Ubykh?)

His 2016 article concluded with a list of some of the most difficult languages. He mentioned, in this connection, the Romany language of European gypsies, which are not even written down, and Sentinelese, the language of the Pacific island where wannabe visitors are killed on arrival, polysynthetic languages like Greenlandic, and Ubykh, with no less than 84 consonants. Honorable mention goes to Bella Coola, a language is only written down by linguists to record the grammar.

他在2016年的文章中总结了一些最困难的语言。在这方面，他提到了甚至没有写下来的欧洲吉普赛人的罗曼语和太平洋岛屿上的塞纳蒂莱斯(Stinetineles)语言，那里是想要来访的游客被杀死的语言，包括格陵兰语和乌比克语等多合成语言，其中不少于84个辅音值得一提的是贝拉·库拉(Bella Coola)，该语言仅由语言学家写下才能记录语法。

Two years later, Campbell wrote a follow-up piece applying his scoring system and setting it against the FSI rankings.

两年后，坎贝尔撰写了一篇后续文章，运用了他的计分系统并将其与FSI排名进行比较。

Non-linguists may be nonplussed by the dismissive way the author chalks up Thai, Vietnamese, Turkish and Finnish as “easy” — except, he hastens to say, for their utterly unfamiliar vocabularies. He confesses surprise that, per his ranking system, Korean beats out Taiwanese in difficulty. But he credits Ubykh, an extinct Circassian language, as leaving even Korean in the dust.

笔者将泰国，越南语，土耳其语和芬兰语归为“轻松”，这是不屑一顾的方式，这使非语言学家不为所动。但他不得不说，因为他们完全不熟悉这些词汇。他承认，按照他的排名系统，韩国人在台湾方面的困难胜过台湾人。但他认为，已故的切尔克斯语乌比赫语，甚至使朝鲜人也陷入了尘土。

Here you can learn Ubykh numbers and listen to a tale of futility that should appeal to every data scientist — in any language.

在这里，您可以学习Ubykh数字并聆听一个徒劳无益的故事，该故事应该以任何一种语言吸引每位数据科学家。