乞力马扎罗山 海明威
I’ve been using the Hemingway App to try to improve my posts. At the same time I’ve been trying to find ideas for small projects. I came up with the idea of integrating a Hemingway style editor into a markdown editor. So I needed to find out how Hemingway worked!
我一直在使用海明威应用程序来尝试改善我的帖子。 同时,我一直在努力寻找小型项目的想法。 我想到了将海明威样式编辑器集成到markdown编辑器中的想法。 因此,我需要了解海明威的运作方式!
掌握逻辑 (Getting the Logic)
I had no idea how the app worked when I first started. It could have sent the text to a server to calculate the complexity of the writing, but I expected it to be calculated client side.
我不知道我第一次启动时该应用程序是如何工作的。 它可以将文本发送到服务器以计算编写的复杂程度,但我希望它可以在客户端进行计算。
Opening developer tools in Chrome ( Control + Shift + I or F12 on Windows/Linux, Command + Option + I on Mac) and navigating to Sources provided the answers. There, I found the file I was looking for: hemingway3-web.js.
在Chrome中打开开发人员工具(在Windows / Linux上为Control + Shift + I或F12,在Mac上为Command + Option + I),然后导航至Sources提供了答案。 在那里,我找到了要查找的文件: hemingway3-web.js。
This code is in a minified form, which is a pain to read and understand. To solve this, I copied the file into VS Code and formatted the document (Control+ Shift + I for VS Code). This changes a 3-line file into a 4859-line file with everything formatted nicely.
该代码采用最小化形式,难以阅读和理解。 为了解决这个问题,我将文件复制到VS Code并格式化了文档(VS Code为Control + Shift + I )。 这会将3行文件更改为4859行文件,所有文件的格式都很好。
探索代码 (Exploring the Code)
I started to look through the file for anything that I could make sense of. The start of the file contained immediately invoked function expressions. I had little idea of what was happening.
我开始浏览文件,以查找所有我可能理解的东西。 文件的开头包含立即调用的函数表达式。 我对发生的事情一无所知。
!function(e) {
function t(r) {
if (n[r])
return n[r].exports;
var o = n[r] = {
exports: {},
id: r,
loaded: !1
};
...
This continued for about 200 lines before I decided that I was probably reading the code to make the page run (React?). I started skimming through the rest of the code until I found something I could understand. (I missed quite a lot that I would later find through finding function calls and looking at the function definition).
这持续了大约200行,然后我决定我可能正在阅读使页面运行的代码(对吗?)。 我开始浏览其余的代码,直到发现我可以理解的内容。 (我错过了很多,以后会通过查找函数调用并查看函数定义来发现)。
The first bit of code I understood was all the way at line 3496!
我理解的第一部分代码一直在3496行!
getTokens: function(e) {
var t = this.getAdverbs(e),
n = this.getQualifiers(e),
r = this.getPassiveVoices(e),
o = this.getComplexWords(e);
return [].concat(t, n, r, o).sort(function(e, t) {
return e.startIndex - t.startIndex
})
}
And amazingly, all these functions were defined right below. Now I knew how the app defined adverbs, qualifiers, passive voice, and complex words. Some of them are very simple. The app checks each word against lists of qualifiers, complex words, and passive voice phrases. this.getAdverbs
filters words based on whether they end in ‘ly’ and then checks whether it’s in the list of non-adverb words ending in ‘ly’.
令人惊讶的是,所有这些功能都在下面定义。 现在,我知道了该应用程序如何定义副词,限定词,被动语态和复杂的单词。 其中一些非常简单。 该应用程序根据限定词,复杂词和被动语音短语列表检查每个词。 this.getAdverbs
根据是否以'ly'结尾的单词进行过滤,然后检查其是否在以'ly'结尾的非副词列表中。
The next bit of useful code was the implementation of highlighting words or sentences. In this code there is a line:
下一个有用的代码是突出显示单词或句子的实现。 这段代码中有一行:
e.highlight.hardSentences += h
‘hardSentences’ was something I could understand, something with meaning. I then searched the file for hardSentences
and got 13 matches. This lead to a line that calculated the readability stats:
“ hardSentences”是我能理解的,有意义的东西。 然后,我在文件中搜索hardSentences
并获得了13个匹配项。 这导致一行计算了可读性统计信息:
n.stats.readability === i.default.readability.hard && (e.hardSentences += 1),
n.stats.readability === i.default.readability.veryHard && (e.veryHardSentences += 1)
Now I knew that there was a readability
parameter in both stats
and i.default
. Searching the file, I got 40 matches. One of those matches was a getReadabilityStyle
function, where they grade your writing.
现在我知道在stats
和i.default
中都有一个readability
参数。 搜索文件,我找到40个匹配项。 其中一项匹配项是getReadabilityStyle
函数,可在其中对您的写作进行评分。
There are three levels: normal, hard and very hard.
分为三个级别:正常,困难和非常困难。
t = e.words;
n = e.readingLevel;
return t < 14
? i.default.readability.normal
: n >= 10 && n < 14
? i.default.readability.hard
: n >= 14 ? i.default.readability.veryHard
: i.default.readability.normal;
“Normal” is less than 14 words, “hard” is 10–14 words, and “very hard” is more than 14 words.
“正常”少于14个单词,“困难”为10-14个单词,“非常困难”大于14个单词。
Now to find how to calculate the reading level.
现在找到如何计算阅读水平。
I spent a while here trying to find any notion of how to calculate the reading level. I found it 4 lines above the getReadabilityStyle
function.
我在这里花了一段时间试图找到关于如何计算阅读水平的任何概念。 我在getReadabilityStyle
函数上方找到4行。
e = letters in paragraph;
t = words in paragraph;
n = sentences in paragraph;
getReadingLevel: function(e, t, n) {
if (0 === t
0 === n) return 0;
var r = Math.round(4.71 * (e / t) + 0.5 * (t / n) - 21.43);
return r <= 0 ? 0 : r;
}
That means your score is 4.71 * average word length + 0.5 * average sentence length -21.43. That’s it. That is how Hemingway grades each of your sentences.
这意味着您的分数是4.71 *平均单词长度+ 0.5 *平均句子长度-21.43。 而已。 这就是海明威为您的每个句子评分的方式。
我发现的其他有趣的东西 (Other Interesting Things I Found)
- The highlight commentary (information about your writing on the right hand side) is a big switch statement. Ternary statements are used to change the response based on how well you’ve written. 最重要的评论(关于您的写作的信息在右侧)是一个重要的声明。 三元语句用于根据您的写作水平来更改响应。
- The grading goes up to 16 before it’s classed as “Post-Graduate” level. 在被归类为“研究生”级别之前,该评分最高可达16。
我要怎么办 (What I’m going to do with this)
I am planning to make a basic website and apply what I’ve learned from deconstructing the Hemingway app. Nothing fancy, more as an exercise for implementing some logic. I’ve built a Markdown previewer before, so I might also try to create a writing application with the highlighting and scoring system.
我打算建立一个基本的网站,并运用我从解构海明威应用程序中学到的知识。 没什么,更像是实施一些逻辑的练习。 我之前已经构建了Markdown预览器,所以我也可以尝试使用突出显示和评分系统创建一个书写应用程序。
创建我自己的海明威应用程序 (Creating My Own Hemingway App)
Having figured out how the Hemingway app works, I then decided to implement what I had learnt to make a much simplified version.
在弄清楚了海明威应用程序的工作原理之后,我决定实施我学到的东西来制作一个简化得多的版本。
I wanted to make sure that I was keeping it basic, focusing on the logic more that the styling. I chose to go with a simple text box entry box.
我想确保自己保持基本状态,而不是仅关注样式逻辑。 我选择了一个简单的文本框输入框。
挑战性 (Challenges)
1. How to assure performance. Rescanning the whole document on every key press could be very computationally expensive. This could result in UX blocking which is obviously not what we want.
1.如何确保性能。 在每次按键时重新扫描整个文档可能在计算上非常昂贵。 这可能会导致UX阻止,这显然不是我们想要的。
2. How to split up the text into paragraphs, sentences and words for highlighting.
2.如何将文本分为段落,句子和单词以突出显示。
可能的解决方案 (Possible Solutions)
- Only rescan the paragraphs that change. Do this by counting the number of paragraphs and comparing that to the document before the change. Use this to find the paragraph that has changed or the new paragraph and only scan that one. 仅重新扫描更改的段落。 通过计算段落数并将其与更改前的文档进行比较来做到这一点。 使用它来查找已更改的段落或新段落,然后仅扫描该段落。
- Have a button to scan the document. This massively reduces the calls of the scanning function. 有一个按钮来扫描文档。 这大大减少了扫描功能的调用。
2. Use what I learnt from Hemingway — every paragraph is a <p> and any sentences or words that need highlighting are wrapped in an internal <span> with the necessary class.
2.使用我从海明威中学到的知识-每个段落都是一个<p>,任何需要突出显示的句子或单词都包装在带有必需类的内部<span>中。
构建应用 (Building the App)
Recently I’ve read a lot of articles about building a Minimum Viable Product (MVP) so I decided that I would run this little project the same. This meant keeping everything simple. I decided to go with an input box, a button to scan and an output area.
最近,我读了很多有关构建最低限度可行产品(MVP)的文章,因此我决定我将以同样的方式运行这个小项目。 这意味着保持一切简单。 我决定带一个输入框,一个要扫描的按钮和一个输出区域。
This was all very easy to set up in my index.html file.
在我的index.html文件中设置所有这些都很容易。
<link rel=”stylesheet” href=”index.css”>
<title>Fake Hemingway</title>
<div>
<h1>Fake Hemingway</h1>
<textarea name=”” id=”text-area” rows=”10"></textarea>
<button onclick=”format()”>Test Me</button>
<div id=”output”>
</div>
</div>
<script src=”index.js”></script>
Now to start on the interesting part. Now to get the Javascript working.
现在开始有趣的部分。 现在开始运行Javascript。
The first thing to do was to render the text from the text box into the output area. This involves finding the input text and setting the output’s inner html to that text.
首先要做的是将文本从文本框中渲染到输出区域中。 这涉及查找输入文本并将输出的内部html设置为该文本。
function format() {
let inputArea = document.getElementById(“text-area”);
let text = inputArea.value;
let outputArea = document.getElementById(“output”);
outputArea.innerHTML = text;
}
Next is getting the text split into paragraphs. This is accomplished by splitting the text by ‘\n’ and putting each of these into a <p> tag. To do this we can map over the array of paragraphs, putting them in between <p> tags. Using template strings makes doing this very easy.
接下来是将文本分成几段。 这可以通过用'\ n'分割文本并将每个文本放入<p>标记中来实现。 为此,我们可以映射段落数组,将其放在<p>标记之间。 使用模板字符串使此操作非常容易。
let paragraphs = text.split(“\n”);
let inParagraphs = paragraphs.map(paragraph => `<p>${paragraph}</p>`);
outputArea.innerHTML = inParagraphs.join(“ “);
Whilst I was working though that, I was becoming annoyed having to copy and paste the test text into the text box. To solve this, I implemented an Immediately Invoked Function Expression (IIFE) to populate the text box when the web page renders.
虽然我当时正在工作,但是我不得不将测试文本复制并粘贴到文本框中感到非常恼火。 为了解决这个问题,我实现了立即调用函数表达式(IIFE),以便在渲染网页时填充文本框。
(function start() {
let inputArea = document.getElementById(“text-area”);
let text = `The app highlights lengthy, …. compose something new.`;
inputArea.value = text;
})();
Now the text box was pre-populated with the test text whenever you load or refresh the web page. Much simpler.
现在,无论何时加载或刷新网页,文本框都会预填充测试文本。 简单得多。
突出显示 (Highlighting)
Now that I was rendering the text well and I was testing on a consistent text, I had to work on the highlighting. The first type of highlighting I decided to tackle was the hard and very hard sentence highlighting.
既然我已经很好地渲染了文本,并且正在对一致的文本进行测试,那么我必须进行突出显示。 我决定要解决的第一种突出显示方式是句子的突出显示。
The first stage of this is to loop over every paragraph and split them into an array of sentences. I did this using a `split()` function, splitting on every full stop with a space after it.
第一步是遍历每个段落并将它们分成句子数组。 我使用一个`split()`函数来做到这一点,在每个句号处都用空格分隔。
let sentences = paragraph.split(‘. ’);
From Heminway I knew that I needed to calculate the number of words and level of each of the sentences. The level of the sentence is dependant on the average length of words and the average words per sentence. Here is how I calculated the number of words and the total words per sentence.
从海明威,我知道我需要计算单词的数量和每个句子的级别。 句子的级别取决于单词的平均长度和每个句子的平均单词。 这是我计算每个句子的单词数和总单词数的方法。
let words = sentence.split(“ “).length;
let letters = sentence.split(“ “).join(“”).length;
Using these numbers, I could use the equation that I found in the Hemingway app.
使用这些数字,我可以使用在海明威应用程序中找到的方程式。
let level = Math.round(4.71 * (letters / words) + 0.5 * words / sentences — 21.43);
With the level and number of words for each of the sentences, set their difficulty level.
使用每个句子的单词级别和数量,设置其难度级别。
if (words < 14) {
return sentence;
} else if (level >= 10 && level < 14) {
return `<span class=”hardSentence”>${sentence}</span>`;
} else if (level >= 14) {
return `<span class=”veryHardSentence”>${sentence}</span>`;
} else {
return sentence;
}
This code says that if a sentence is longer than 14 words and has a level of 10 to 14 then its hard, if its longer than 14 words and has a level of 14 or up then its very hard. I used template strings again but include a class in the span tags. This is how I’m going to define the highlighting.
该代码表示,如果句子长于14个单词且级别为10到14,则很难;如果句子长于14个单词且级别为14或更高,则它很难。 我再次使用了模板字符串,但在span标签中包含了一个类。 这就是我要定义突出显示的方式。
The CSS file is really simple; it just has each of the classes (adverb, passive, hardSentence) and sets their background colour. I took the exact colours from the Hemingway app.
CSS文件非常简单。 它仅具有每个类(副词,被动,hardSentence)并设置其背景色。 我从海明威应用程序中提取了确切的颜色。
Once the sentences have been returned, I join them all together to make each of the paragraphs.
句子返回后,我将它们全部合并在一起以构成每个段落。
At this point, I realised that there were a few problems in my code.
至此,我意识到我的代码中存在一些问题。
- There were no full stops. When I split the paragraphs into sentences, I had removed all of the full stops. 没有句号。 当我将段落分成句子时,我删除了所有句号。
- The numbers of letters in the sentence included the commas, dashes, colons and semi-colons. 句子中字母的数量包括逗号,破折号,冒号和分号。
My first solution was very primitive but it worked. I used split(‘symbol’) and join(‘’) to remove the punctuation and then appended ‘.’ onto the end. Whist it worked, I searched for a better solution. Although I don’t have much experience using regex, I knew that it would be the best solution. After some Googling I found a much more elegant solution.
我的第一个解决方案非常原始,但是有效。 我使用split('symbol')和join('')删除标点符号,然后附加了'。'。 到最后。 一直奏效,我一直在寻找更好的解决方案。 尽管我没有太多使用正则表达式的经验,但我知道这将是最好的解决方案。 经过一番谷歌搜索后,我发现了一个更为优雅的解决方案。
let cleanSentence = sent.replace(/[^a-z0–9. ]/gi, “”) + “.”;
With this done, I had a partially working product.
完成此操作后,我得到了部分工作的产品。
The next thing I decided to tackle was the adverbs. To find an adverb, Hemingway just finds words that end in ‘ly’ and then checks that it isn’t on a list of non-adverb ‘ly’ words. It would be bad if ‘apply’ or ‘Italy’ were tagged as adverbs.
我决定解决的下一件事是副词。 为了找到副词,海明威只是找到以“ ly”结尾的单词,然后检查它是否不在非副词“ ly”单词列表中。 如果将'apply'或'Italy'标记为副词,那将是不好的。
To find these words, I took the sentences and split them into an arary of words. I mapped over this array and used an IF statement.
为了找到这些单词,我采用了句子并将其拆分为单词集。 我在此数组上映射并使用了IF语句。
if(word.match(/ly$/) &&, !lyWords[word] ){
return `<span class=”adverb”>${word}</span>`;
} else {
return word
};
Whist this worked most of the time, I found a few exceptions. If a word was followed by a punctuation mark then it didn’t match ending with ‘ly’. For example, “The crocodile glided elegantly; it’s prey unaware” would have the word ‘elegantly;’ in the array. To solve this I reused the .replace(/^a-z0-9. ]/gi,””)
functionality to clean each of the words.
在大多数情况下,这都是可行的,我发现了一些例外。 如果单词后面带有标点符号,则该单词与“ ly”结尾不匹配。 例如,“鳄鱼优雅滑翔; 猎物没有意识到”会带有“优雅”一词; 在数组中。 为了解决这个问题,我重用了.replace(/^a-z0-9. ]/gi,””)
功能来清理每个单词。
Another exception was if the word was capitalised, which was easily solved by calling toLowerCase()
on the string.
另一个例外是单词大写,可以通过在字符串上调用toLowerCase()
轻松解决。
Now I had a result that worked with adverbs and highlighting individual words. I then implemented a very similar method for complex and qualifying words. That was when I realised that I was no longer just looking for individual words, I was looking for phrases. I had to change my approach from checking if each word was in the list to seeing if the sentence contained each of the phrases.
现在,我得到了处理副词并突出显示单个单词的结果。 然后,我对复杂且合格的单词实施了一种非常相似的方法。 从那时起,我意识到我不再只是在寻找单个单词,而是在寻找短语。 我不得不将方法从检查每个单词是否在列表中更改为查看句子是否包含每个短语。
To do this I used the .indexOf()
function on the sentences. If there was an index of the word or phrase, I inserted an opening span tag at that index and then the closing span tag after the key length.
为此,我在句子上使用了.indexOf()
函数。 如果有单词或短语的索引,我会在该索引处插入一个开始跨度标签,然后在键长之后插入一个结束跨度标签。
let qualifiers = getQualifyingWords();
let wordList = Object.keys(qualifiers);
wordList.forEach(key => {
let index = sentence.toLowerCase().indexOf(key);
if (index >= 0) {
sentence =
sentence.slice(0, index) +
‘<span class=”qualifier”>’ +
sentence.slice(index, index + key.length) +
“</span>” +
sentence.slice(index + key.length);
}
});
With that working, it’s starting to look more and more like the Hemingway editor.
有了这项工作,它开始看起来越来越像海明威编辑器。
The last piece of the highlighting puzzle to implement was the passive voice. Hemingway used a 30 line function to find all of the passive phrases. I chose to use most of the logic that Hemingway implemented, but order the process differently. They looked to find any words that were in a list (is, are, was, were, be, been, being) and then checked whether the next word ended in ‘ed’.
突出显示难题的最后一部分是被动语态。 海明威使用30行函数查找所有被动短语。 我选择使用海明威实现的大多数逻辑,但是对过程的排序不同。 他们寻找找到列表中的任何单词(是,曾经,曾经,曾经,是,曾经,存在),然后检查下一个单词是否以“ ed”结尾。
I looped though each of the words in a sentence and checked if they ended in ‘ed’. For every ‘ed’ word I found, I checked whether the previous word was in the list of pre-words. This seemed much simpler, but may be less performant.
我遍历句子中的每个单词,并检查它们是否以“ ed”结尾。 对于找到的每个“ ed”单词,我都会检查前一个单词是否在预单词列表中。 这看似简单得多,但性能可能较差。
With that working I had an app that highlighted everything I wanted. This is my MVP.
通过这项工作,我有了一个突出显示我想要的一切的应用程序。 这是我的MVP。
然后我遇到了一个问题 (Then I hit a problem)
As I was writing this post I realised that there were two huge bugs in my code.
当我写这篇文章时,我意识到我的代码中有两个巨大的错误。
// from getQualifier and getComplex
let index = sentence.toLowerCase().indexOf(key);
// from getPassive
let index = words.indexOf(match);
These will only ever find the first instance of the key or match. Here is an example of the results this code will produce.
这些只会找到键或匹配项的第一个实例。 这是此代码将产生的结果的示例。
‘Perhaps’ and ‘been marked’ should have been highlighted twice each but they aren’t.
“也许”和“被标记”应分别高亮两次,但不是。
To fix the bug in getQualifier and getComplex, I decided to use recursion. I created a findAndSpan
function which uses .indexOf()
to find the first instance of the word or phrase. It splits the sentence into 3 parts: before the phrase, the phrase, after the phrase. The recursion works by passing the ‘after the phrase’ string back into the function. This will continue until there are no more instances of the phrase, where the string will just be passed back.
为了修复getQualifier和getComplex中的错误,我决定使用递归。 我创建了一个findAndSpan
函数,该函数使用。 indexOf()
查找单词或短语的第一个实例。 它将句子分为三部分:短语之前,短语,短语之后。 递归通过将“短语后”字符串传递回函数来工作。 这将继续进行,直到不再有该短语的实例为止,在该实例中该字符串将被传递回去。
function findAndSpan(sentence, string, type) {
let index = sentence.toLowerCase().indexOf(key);
if (index >= 0) {
sentence =
sentence.slice(0, index) +
`<span class="${type}">` +
sentence.slice(index, index + key.length) +
"</span>" +
findAndSpan(
sentence.slice(index + key.length),
key,
type);
}
return sentence;
}
Something very similar had to be done for the passive voice. The recursion was in an almost identical pattern, passing the leftover array items instead of the leftover string. The result of the recursion call was spread into an array that was then returned. Now the app can deal with repeated adverbs, qualifiers, complex phrases and passive voice uses.
对于被动语音,必须做一些非常相似的事情。 递归以几乎相同的模式进行,传递剩余的数组项而不是剩余的字符串。 递归调用的结果被传播到一个数组中,然后返回该数组。 现在,该应用程序可以处理重复的副词,限定词,复杂的短语和被动语音用法。
统计计数器 (Statistics Counter)
The last thing that I wanted to get working was the nice line of boxes informing you on how many adverbs or complex words you’d used.
我要开始工作的最后一件事是用漂亮的方框来告知您使用了多少个副词或复杂词。
To store the data I created an object with keys for each of the parameters I wanted to count. I started by having this variable as a global variable but knew I would have to change that later.
为了存储数据,我为每个要计数的参数创建了一个带有键的对象。 我首先将此变量作为全局变量,但是知道以后必须更改它。
Now I had to populate the values. This was done by incrementing the value every time it was found.
现在,我必须填充值。 这是通过在每次找到该值时增加该值来完成的。
data.sentences += sentence.length
or
data.adverbs += 1
The values needed to be reset every time the scan was run to make sure that values didn’t continuously increase.
每次运行扫描时都需要重置这些值,以确保这些值不会持续增加。
With the values I needed, I had to get them rendering on the screen. I altered the structure of the html file so that the input box and output area were in a div on the left, leaving a right div for the counters. These counters are empty divs with an appropriate id and class as well as a ‘counter’ class.
有了我需要的值,我不得不将它们呈现在屏幕上。 我更改了html文件的结构,以使输入框和输出区域位于左侧的div中,为计数器保留了右侧的div。 这些计数器是具有适当ID和类以及“计数器”类的空div。
<div id=”adverb” class=”adverb counter”></div>
<div id=”passive” class=”passive counter”></div>
<div id=”complex” class=”complex counter”></div>
<div id=”hardSentence” class=”hardSentence counter”></div>
<div id=”veryHardSentence” class=”veryHardSentence counter”></div>
With these divs, I used document.querySelector to set the inner html for each of the counters using the data that had been collected. With a little bit of styling of the ‘counter’ class, the web app was complete. Try it out here or look at my code here.
通过这些div,我使用document.querySelector使用已收集的数据为每个计数器设置内部html。 通过对“ counter”类进行一些样式设置,该Web应用程序就完整了。 在这里尝试或在这里查看我的代码。
乞力马扎罗山 海明威