Unicode Bidirectional Algorithm

http://www.unicode.org/reports/tr9/tr9-27.html


When working with bidirectional text, the characters are still interpreted in logical order—only the display is affected


The directional types left-to-right and right-to-left are calledstrong types, and characters of those types are called strong directional characters. The directional types associated with numbers are calledweak types, and characters of those types are called weak directional characters.


Although the term embedding is used for some explicit codes, the text within the scope of the codes is not independent of the surrounding text. Characters within an embedding can affect the ordering of characters outside, and vice versa. The algorithm is designed so that the use of explicit codes can be equivalently represented by out-of-line information, such as stylesheet information. However, any alternative representation will be defined by reference to the behavior of the explicit codes in this algorithm.



2.1 Explicit Directional Embedding

Abbr.CodeChartNameDescription
LREU+202A https://i-blog.csdnimg.cn/blog_migrate/f36d642cfc4049ddf828bd60f375ae3c.gifLEFT-TO-RIGHT EMBEDDINGTreat the following text as embedded left-to-right.
RLEU+202B https://i-blog.csdnimg.cn/blog_migrate/56945f53679f97b8f244fb7a2223d973.gifRIGHT-TO-LEFT EMBEDDINGTreat the following text as embedded right-to-left.

The effect of right-left line direction, for example, can be accomplished by embedding the text with RLE...PDF.


2.2 Explicit Directional Overrides

Abbr.CodeChartNameDescription
LROU+202D https://i-blog.csdnimg.cn/blog_migrate/3bc7bff6dd400b5206873d9a92467f91.gifLEFT-TO-RIGHT OVERRIDEForce following characters to be treated as strong left-to-right characters.
RLOU+202Ehttps://i-blog.csdnimg.cn/blog_migrate/27630b13ff8b79c5b11da60d86da4f06.gifRIGHT-TO-LEFT OVERRIDEForce following characters to be treated as strong right-to-left characters.


The right-to-left override, for example, can be used to force a part number made of mixed English, digits and Hebrew letters to be written from right to left.



2.3 Terminating Explicit Directional Code

Abbr.CodeChartNameDescription
PDFU+202C

https://i-blog.csdnimg.cn/blog_migrate/4dba545ec910acc466c01b36776d1c31.gif

POP DIRECTIONAL FORMATTINGRestore the bidirectional state to what it was before the last LRE, RLE, RLO, or LRO.



2.4 Implicit Directional Marks



Abbr.CodeChartNameDescription
LRMU+200Ehttps://i-blog.csdnimg.cn/blog_migrate/3114ad87f0a8d41d5a2572afe26e42c3.gifLEFT-TO-RIGHT MARKLeft-to-right zero-width character
RLMU+200F https://i-blog.csdnimg.cn/blog_migrate/0b5b2f94c98039a72da15f4b03f785a4.gifRIGHT-TO-LEFT MARKRight-to-left zero-width character


There is no special mention of the implicit directional marks in the following algorithm. That is because their effect on bidirectional ordering is exactly the same as a corresponding strong directional character; the only difference is that they do not appear in the display



3 Basic Display Algorithm

The Unicode Bidirectional Algorithm (UBA) takes a stream of text as input and proceeds infour main phases:

  • Separation into paragraphs. The rest of the algorithm is applied separately to the text within each paragraph.
  • Initialization. A list of directionalcharacter types is initialized, with one entry for each character in the original text. The value of each entry is the Bidi_Class property of the respective character. After this point, the original characters are no longer referenced until the reordering phase. A list of embedding levels, with one level per character, is then initialized.
  • Resolution of the embedding levels. A series of rules are applied to the lists of embedding levels and directional character types. Each rule is based on the current values of those lists, and can modify those values. Each rule is applied to each of the values in sequence before continuing to the next rule. The result of this phase is a modified list of embedding levels; the list of directional character types is no longer needed.
  • Reordering. The text within each paragraph is reordered for display: first, the text in the paragraph is broken into lines, then the resolved embedding levels are used to reorder the text of each line for display.

 

3.1 Definitions

BD1. The bidirectional characters types are values assigned to each   Unicode character, including unassigned characters. The formal property name in the    Unicode Character Database [UCD] is Bidi_Class.

BD2. Embedding levels are numbers that indicate how deeply the text is   nested, and the default direction of text on that level. The minimum embedding level of text is   zero, and the maximum explicit depth is level 61.

Embedding levels are explicitly set by both override format codes and by embedding format     codes; higher numbers mean the text is more deeply nested. The reason for having a limitation is     to provide a precise stack limit for implementations to guarantee the same results. Sixty-one     levels is far more than sufficient for ordering, even with mechanically generated formatting;     the display becomes rather muddied with more than a small number of embeddings.

BD3. The default direction of the current embedding level (for the character in question) is called the embedding direction. It is L if the embedding level is even,   and R if the embedding level is odd.

For example, in a particular piece of text, Level 0 is plain English text. Level 1 is plain     Arabic text, possibly embedded within English level 0 text. Level 2 is English text, possibly     embedded within Arabic level 1 text, and so on. Unless their direction is overridden, English     text and numbers will always be an even level; Arabic text (excluding numbers) will always be an     odd level. The exact meaning of the embedding level will become clear when the reordering     algorithm is discussed, but the following provides an example of how the algorithm works.

BD4. The paragraph embedding level is the embedding level that   determines the default bidirectional orientation of the text in that paragraph.

BD5. The direction of the paragraph embedding level is called the   paragraph direction.

  • In some contexts the paragraph direction is also known as the base direction.

BD6. The directional override status determines whether the bidirectional type of characters is to be reset. The override status is set by using explicit directional controls. This status   has three states, as shown in Table 2.

Table 2. Directional Override Status

StatusInterpretation
NeutralNo override is currently active
Right-to-leftCharacters are to be reset to R
Left-to-rightCharacters are to be reset to L

BD7. A level run is a maximal substring of characters that have the   same embedding level. It is maximal in that no character immediately before or after the substring   has the same level (a level run is also known as a directional run).

 

 

 

5.1 Reference Code

There are two versions of BIDI reference code available. Both have been tested to   produce identical results. One version is written in Java, and the other is written in C++. The   Java version is designed to closely follow the steps of the algorithm as described below. The C++   code is designed to show one of the optimization methods that can be applied to the algorithm,   using a state table for one phase.

One of the most effective optimizations is to first test for right-to-left     characters and not invoke the Bidirectional Algorithm unless they are present.

There are two directories containing source code for reference implementations at [Code9]. Implementers are encouraged to use this   resource to test their implementations. There is an online demo of bidi code at http://unicode.org/cldr/utility/bidi.jsp, which shows the results, plus the levels and the rules invoked for each character.


 

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值