如何使用JavaScript（或任何其他语言）构建数学表达式令牌生成器-CSDN博客

by Shalvah

由Shalvah

如何使用JavaScript(或任何其他语言)构建数学表达式令牌生成器 (How to build a math expression tokenizer using JavaScript (or any other language))

Some time ago, I got inspired to build an app for solving specific kinds of math problems. I discovered I had to parse the expression into an abstract syntax tree, so I decided to build a prototype in Javascript. While working on the parser, I realized the tokenizer had to be built first. I’ll walk you through how to do one yourself. (Warning: its easier than it looks at first.)

前段时间，我受到启发，开发了一款用于解决特定类型的数学问题的应用程序。我发现我不得不将表达式解析成抽象的语法树，因此我决定用Javascript构建原型。在解析器上工作时，我意识到必须首先构建令牌生成器。我将引导您逐步完成自己的工作。 (警告：它比起初看起来要容易。)

什么是令牌生成器？ (What is a Tokenizer?)

A tokenizer is a program that breaks up an expression into units called tokens. For instance, if we have an expression like “I’m a big fat developer”, we could tokenize it in different ways, such as:

标记器是一个将表达式分解为称为标记的单元的程序。例如，如果我们有一个类似“我是个胖胖的开发人员”的表达，我们可以用不同的方式标记它，例如：

Using words as tokens,

使用单词作为标记，

0 => I’m1 => a2 => big3 => fat4 => developer

Using non-whitespace characters as tokens,

使用非空格字符作为标记，

0 => I1 => ‘2 => m3 => a4 => b…16 => p17 => e18 => r

We could also consider all characters as tokens, to get

我们还可以将所有字符视为标记，以获取

0 => I1 => ‘2 => m3 => (space)4 => a5 => (space)6 => b…20 => p21 => e22 => r

You get the idea, right?

你明白了吧？

Tokenizers (also called lexers) are used in the development of compilers for programming languages. They help the compiler make structural sense out of what you are trying to say. In this case, though, we’re building one for math expressions.

标记器 (也称为词法分析器)用于开发编程语言的编译器。它们可以帮助编译器从您要说的内容中理解结构。不过，在这种情况下，我们正在为数学表达式构建一个。

代币 (Tokens)

A valid math expression consists of mathematically valid tokens, which for the purposes of this project could be Literals, Variables, Operators, Functions or Function Argument Separators.A few notes on the above:

一个有效的数学表达式由数学上有效的标记组成，就本项目而言，这些标记可以是文字，变量， 运算符，函数或函数参数分隔符 。

A Literal is a fancy name for a number (in this case). We’ll allow numbers in whole or decimal form only.
文字是数字的奇特名称(在这种情况下)。我们只允许使用整数或小数形式的数字。
A Variable is the kind you’re used to in math: a,b,c,x,y,z. For this project, all variables are restricted to one-letter names (so nothing like var1 or price). This is so we can tokenize an expression like ma as the product of the variables m and a, and not one single variable ma.
变量是您在数学中常用的类型：a，b，c，x，y，z。对于此项目，所有变量都限制为一个字母的名称(因此不像var1或price )。这样一来，我们可以将像ma这样的表达式标记为变量m和a的乘积，而不是一个变量ma 。
Operators act on Literals and Variables and the results of functions. We’ll permit operators +, -, *, /, and ^.
运算符对文字和变量以及函数的结果起作用。我们将允许运算符+，-，*，/和^。
Functions are “more advanced” operations. They include things like sin(), cos(), tan(), min(), max() etc
功能是“更高级”的操作。它们包括诸如sin()，cos()，tan()，min()，max()等内容。
A Function Argument Separator is just a fancy name for a comma, used in a context like this: max(4, 5) (the maximum one of the two values). We call it a Function Argument Separator because it, well, separates function arguments (for functions that take two or more arguments, such as max and min).
函数参数分隔符只是逗号的奇特名称，在这样的上下文中使用： max(4，5) (两个值中的最大值)。我们称其为“函数参数分隔符”是因为它很好地分隔了函数参数(对于带有两个或多个参数的函数，例如max和min )。

We’ll also add two tokens that aren’t usually considered tokens, but will help us with clarity: Left and Right Parentheses. You know what those are.

我们还将添加两个通常不被认为是令牌的令牌，但它们将有助于我们更加清楚：左括号和右括号 。你知道那是什么。

一些注意事项 (A Few Considerations)

隐式乘法 (Implicit Multiplication)

Implicit multiplication simply means allowing the user to write “shorthand” multiplications, such as 5x, instead of 5*x. Taking it a step further, it also allows doing that with functions (5sin(x) = 5*sin(x)).

隐式乘法只是意味着允许用户编写“速记”乘法，例如5x而不是5 * x 。更进一步，它还允许使用函数( 5sin(x) = 5 * sin(x) )执行此操作。

Even further, it allows for 5(x) and 5(sin(x)). We have the option of allowing it or not. Tradeoffs? Not allowing it would actually make tokenizing easier and would allow for multi-letter variable names (names likeprice). Allowing it makes the platform more intuitive to the user, and well, provides an added challenge to overcome. I chose to allow it.

更进一步，它允许5(x)和5(sin(x))。我们可以选择是否允许。权衡？不允许使用它实际上会使标记更容易，并且允许使用多个字母的变量名称(诸如price名称)。允许它使平台对用户更直观，并且很好地解决了其他难题。我选择允许它。

句法 (Syntax)

While we aren’t creating a programming language, we need to have some rules about what makes a valid expression, so users know what to enter and we know what to plan for. In precise terms, math tokens must be combined according to these syntax rules for the expression to be valid. Here are my rules:

当我们不创建一种编程语言时，我们需要对构成有效表达式的内容有一些规则，以便用户知道要输入什么，并且我们要计划什么。确切地说， 必须根据这些语法规则组合数学标记，以使表达式有效。 这是我的规则：

Tokens can be separated by 0 or more whitespace characters
令牌可以用0个或多个空格字符分隔

2+3, 2 +3, 2 + 3, 2 + 3 are all OK 5 x - 22, 5x-22, 5x- 22 are all OK

In other words, spacing doesn’t matter (except within a multi-character token like the Literal 22).

换句话说， 间距无关紧要 (除非在像Literal 22这样的多字符令牌中)。

2. Function arguments have to be in parentheses (sin(y), cos(45), not sin y, cos 45). (Why? We’ll be removing all spaces from the string, so we want to know where a function starts and ends without having to do some “gymnastics”.)

2. 函数参数必须放在括号中 ( sin(y) ， cos(45) ，而不是sin y ， cos 45 )。 (为什么？我们将从字符串中删除所有空格，因此我们想知道函数的开始和结束位置，而无需进行一些“体操”操作。)

3. Implicit multiplication is allowed only between Literals and Variables, or Literals and Functions, in that order (that is, Literals always come first), and can be with or without parentheses. This means:

3.仅允许在文字和变量之间或文字和函数之间按此顺序进行隐式乘法(即，文字总是最先出现)，并且可以带或不带括号。这表示：

a(4) will be treated as a function call rather than a*4
a(4)将被视为函数调用，而不是a * 4
a4 is not allowed
不允许a4
4a and 4(a) are OK
4a和4(a)可以

Now, let’s get to work.

现在，让我们开始工作。

资料建模 (Data Modelling)

It helps to have a sample expression in your head to test this on. We’ll start with something basic: 2y + 1

在您的脑海中有一个示例表达式可以对此进行测试。我们将从基本的东西开始： 2y + 1

What we expect is an array that lists the different tokens in the expression, along with their types and values. So for this case, we expect:

我们期望的是一个列出表达式中不同标记及其类型和值的数组。因此，对于这种情况，我们期望：

0 => Literal (2)1 => Variable (y)2 => Operator (+)3 => Literal (1)

First, we’ll define a Token class to make things easier:

首先，我们将定义一个Token类以使事情变得更容易：

function Token(type, value) {   this.type = type;   this.value = value}

算法 (Algorithm)

Next, let’s build the skeleton of our tokenizer function.

接下来，让我们构建标记器函数的框架。

Our tokenizer will go through each character of the str array and build tokens based on the value it finds.

我们的令牌生成器将遍历str数组的每个字符，并根据找到的值构建令牌。

[Note that we’re assuming the user gives us a valid expression, so we’ll skip any form of validation throughout this project.]

[请注意，我们假设用户为我们提供了一个有效的表达式，所以我们将在整个项目中跳过任何形式的验证。]

function tokenize(str) {  var result=[]; //array of tokens    // remove spaces; remember they don't matter?  str.replace(/\s+/g, "");

// convert to array of characters  str=str.split("");

str.forEach(function (char, idx) {    if(isDigit(char)) {      result.push(new Token("Literal", char));    } else if (isLetter(char)) {      result.push(new Token("Variable", char));    } else if (isOperator(char)) {      result.push(new Token("Operator", char));    } else if (isLeftParenthesis(char)) {      result.push(new Token("Left Parenthesis", char));    } else if (isRightParenthesis(char)) {      result.push(new Token("Right Parenthesis", char));    } else if (isComma(char)) {      result.push(new Token("Function Argument Separator", char));    }  });

return result;}

The code above is fairly basic. For reference, the helpers isDigit() , isLetter(), isOperator(), isLeftParenthesis(), and isRightParenthesis()are defined as follows (don’t be scared by the symbols — it’s called regex, and it’s really awesome):

上面的代码非常基本。作为参考，辅助isDigit() ， isLetter() ， isOperator() ， isLeftParenthesis()和isRightParenthesis()的定义如下(不必isRightParenthesis()这些符号-它叫做regex ，而且确实很棒)：

function isComma(ch) { return (ch === ",");}

function isDigit(ch) { return /\d/.test(ch);}

function isLetter(ch) { return /[a-z]/i.test(ch);}

function isOperator(ch) { return /\+|-|\*|\/|\^/.test(ch);}

function isLeftParenthesis(ch) { return (ch === "(");}

function isRightParenthesis(ch) { return (ch == ")");}

[Note that there are no isFunction(), isLiteral() or isVariable() functions, because we testing characters individually.]

[请注意 ，由于我们分别测试字符，因此 没有 isFunction() ， isLiteral() 或 isVariable() 函数。]

So now our parser actually works. Try it out on these expressions: 2 + 3, 4a + 1, 5x+ (2y), 11 + sin(20.4).

因此，现在我们的解析器可以正常工作了。尝试以下表达式：2 + 3、4a + 1、5x +(2y)，11 + sin(20.4)。

All good?

都好？

Not quite.

不完全的。

You’ll observe that for the last expression, 11 is reported as two Literal tokens instead of one. Also sin gets reported as three tokens instead of one. Why is this?

您会观察到，对于最后一个表达式，11被报告为两个 Literal令牌，而不是一个。此外， sin被报告为三个令牌而不是一个。为什么是这样？

Let’s pause for a moment and think about this. We tokenized the array character by character, but actually, some of our tokens can contain multiple characters. For example, Literals can be 5, 7.9, .5. Functions can be sin, cos etc. Variables are only single-characters, but can occur together in implicit multiplication. How do we solve this?

让我们暂停片刻，考虑一下。我们逐个字符地标记了数组，但是实际上，我们的一些标记可以包含多个字符。例如，文字可以是5、7.9，.5。函数可以是sin，cos等。变量只是单字符，但可以一起出现在隐式乘法中。我们该如何解决呢？

缓冲液 (Buffers)

We can fix this by implementing a buffer. Two, actually. We’ll use one buffer to hold Literal characters (numbers and decimal point), and one for letters (which covers both variables and functions).

我们可以通过实现缓冲区来解决此问题。二，实际上。我们将使用一个缓冲区来保存文字字符(数字和小数点)，并使用一个缓冲区来保存字母(包括变量和函数)。

How do the buffers work? When the tokenizer encounters a number/decimal point or letter, it pushes it into the appropriate buffer, and keeps doing so until it enters a different kind of operator. Its actions will vary based on the operator.

缓冲区如何工作？当分词器遇到数字/小数点或字母时，它会将其推入适当的缓冲区，并一直这样做，直到输入另一种运算符为止。根据操作员的不同，其动作也会有所不同。

For instance, in the expression 456.7xy + 6sin(7.04x) — min(a, 7), it should go along these lines:

例如，在表达式456.7xy + 6sin(7.04x)-min(a，7)中 ，它应该遵循以下几行：

read 4 => numberBuffer read 5 => numberBuffer read 6 => numberBuffer read . => numberBuffer read 7 => numberBuffer x is a letter, so put all the contents of numberbuffer together as a Literal 456.7 => result read x => letterBuffer read y => letterBuffer + is an Operator, so remove all the contents of letterbuffer separately as Variables x => result, y => result + => result read 6 => numberBuffer s is a letter, so put all the contents of numberbuffer together as a Literal 6 => result read s => letterBuffer read i => letterBuffer read n => letterBuffer ( is a Left Parenthesis, so put all the contents of letterbuffer together as a function sin => result read 7 => numberBuffer read . => numberBuffer read 0 => numberBuffer read 4 => numberBuffer x is a letter, so put all the contents of numberbuffer together as a Literal 7.04 => result read x => letterBuffer ) is a Right Parenthesis, so remove all the contents of letterbuffer separately as Variables x => result - is an Operator, but both buffers are empty, so there's nothing to remove read m => letterBuffer read i => letterBuffer read n => letterBuffer ( is a Left Parenthesis, so put all the contents of letterbuffer together as a function min => result read a=> letterBuffer , is a comma, so put all the contents of letterbuffer together as a Variable a => result, then push , as a Function Arg Separator => result read 7=> numberBuffer ) is a Right Parenthesis, so put all the contents of numberbuffer together as a Literal 7 => result

Complete. You get the hang of it now, right?

完成。您现在掌握了它，对不对？

We’re getting there, just a few more cases to handle.

我们到了那里，仅需处理几例。

This is the point where you sit down and think deeply about your algorithm and data modeling. What happens if my current character is an operator, and the numberBuffer is non-empty? Can both buffers ever simultaneously be non-empty?

这是您坐下来认真思考算法和数据建模的关键所在。如果我当前的字符是一个运算符，并且numberBuffer是非空的，会发生什么？两个缓冲区都可以同时为非空吗？

Putting it all together, here’s what we come up with (the values to the left of the arrow depict our current character (ch) type, NB=numberbuffer, LB=letterbuffer, LP=left parenthesis, RP=right parenthesis

总而言之，这就是我们想出的(箭头左侧的值描述了我们当前的字符(ch)类型，NB =数字缓冲区，LB =字母缓冲区，LP =左括号，RP =右括号

loop through the array:  what type is ch?

digit => push ch to NB  decimal point => push ch to NB  letter => join NB contents as one Literal and push to result, then push ch to LB  operator => join NB contents as one Literal and push to result OR push LB contents separately as Variables, then push ch to result  LP => join LB contents as one Function and push to result OR (join NB contents as one Literal and push to result, push Operator * to result), then push ch to result  RP => join NB contents as one Literal and push to result, push LB contents separately as Variables, then push ch to result  comma => join NB contents as one Literal and push to result, push LB contents separately as Variables, then push ch to result

end loop

join NB contents as one Literal and push to result, push LB contents separately as Variables,

Two things to note.

有两件事要注意。

Notice where I added “push Operator * to result”? That’s us converting the implicit multiplication to explicit. Also, when emptying the contents of LB separately as Variables, we need to remember to insert the multiplication Operator in between them.
请注意，我在哪里添加了“推算符*到结果”？那就是我们将隐式乘法转换为显式乘法。同样，当分别清空LB的内容作为变量时，我们需要记住在它们之间插入乘法运算符。
At the end of the function’s loop, we need to remember to empty whatever we have left in the buffers.
在函数循环的最后，我们需要记住清空缓冲区中剩下的所有内容。

将其翻译成代码 (Translating It to Code)

Putting it all together, your tokenize function should look like this now:

放在一起，您的标记化函数现在应该像这样：

We can run a little demo:

我们可以运行一些演示：

var tokens = tokenize("89sin(45) + 2.2x/7");tokens.forEach(function(token, index) {  console.log(index + "=> " + token.type + "(" + token.value + ")":});

结语 (Wrapping It Up)

This is the point where you analyze your function and measure what it does versus what you want it to do. Ask yourself questions like, “Does the function work as intended?” and “Have I covered all edge cases?”

在这一点上，您可以分析功能并评估其功能与所需功能。问自己一些问题，例如“该功能是否按预期工作？” 和“我涵盖了所有极端情况吗？”

Edge cases for this could include negative numbers and the like. You also run tests on the function. If at the end you are satisfied, you may then begin to seek out how you can improve it.

为此，边缘情况可能包括负数等。您还可以对该函数运行测试。如果最后您感到满意，则可以开始寻找如何改进它的方法。

Thanks for reading. Please click the little heart to recommend this article, and share if you enjoyed it! And if you have tried another approach for building a math tokenizer, do let me know in the comments.

谢谢阅读。请点击小小的心来推荐这篇文章，如果喜欢，请分享！如果您尝试了另一种构建数学标记器的方法，请在评论中告诉我。