从信息处理的角度看计算、语言和程序

从信息处理的角度看计算、语言和程序

译自:《 An Introduction to the Theory of Computation 》by Eitan Gurari
刘建文略译( http://blog.csdn.net/keminlau

KEY:信息处理 计算理论 形式语言 编程语言

Chapter 1 GENERAL CONCEPTS

Computations are designed for processing information. They can be as simple as an estimation for driving time between cities, and as complex as a weather prediction.

The study of computation aims at providing an insight into the characteristics of computations. Such an insight can be used for predicting the complexity of desired computations, for choosing the approaches they should take, and for developing tools that facilitate their design.

计算(Computations)是为了解题(可一般化为信息处理 )而设计的。问题可能很简单,比如估算两城市间的驾车时间;也可以很复杂,比如对天气进行预测。

对计算的研究,目的是把握计算过程的本质。了解计算的本质有助于对目标计算的复杂度进行预测,也有助于优化计算的设计。

The study of computation reveals that there are problems that cannot be solved. And of the problems that can be solved, there are some that require infeasible amount of resources (e.g., millions of years of computation time). These revelations might seem discouraging, but they have the benefit of warning against trying to solve such problems. Approaches for identifying such problems are also provided by the study of computation.

On an encouraging note, the study of computation provides tools for identifying problems that can feasibly be solved, as well as tools for designing such solutions. In addition, the study develops precise and well-defined terminology for communicating intuitive thoughts about computations.

现有对计算 的研究已经揭示,存在有无法用算法解决的问题,而有些即便算法可解也不能够在合理的时间内解决。这两个事实从正面上看的确让人泄气,但从反面看,则告诫我们不要试图解决这些问题。区别难题和不可解问题 的技术也需要依靠对计算的研究。

从正面意义方面看,[研究计算 ]除了有助于辨别问题的可解性外,也有助于设计出问题的解决方案。

The study of computation is conducted in this book through the medium of programs. Such an approach can be adopted because programs are descriptions of computations.

Any formal discussion about computation and programs requires a clear understanding of these notions, as well as of related notions. The purpose of this chapter is to define some of the basic concepts used in this book.

我们是通过以程序 作为媒介来研究计算 的。这个方法有效,是因为程序是对计算(过程)的描述。

在任何对计算和程序的形式讨论之前,必须厘清相关基本概念的定义,包括计算、程序、符号串和语言。

The first section of this chapter considers the notion of strings, and the role that strings have in representing information. The second section relates the concept of languages to the notion of strings, and introduces grammars for characterizing languages. The third section deals with the notion of programs, and the concept of nondeterminism in programs. The fourth section formalizes the notion of problems, and discusses the relationship between problems and programs. The fifth section defines the notion of reducibility among problems.

  • 第一,厘清符号串(strings)的概念和符号串作为物化信息 (representing information)的角色;
  • 第二,厘清语言的组成和语言的语法属性;
  • 第三,厘清程序的概念和程序的不确定性(nondeterminism );
  • 第四,厘清问题的形式定义,搞清问题与程序之间关系;
  • 第五,定义问题之间的可归约性(reducibility );

 

kemin: represent在这里可直译为表述、表征,不过我选择“物化”这个词,认为觉得更生动具体,更能刻画信息的非物质性。因为信息是抽象无形的东西,这里的意思是把信息物质化,用“表述”或“符号化”都不够好。

 

1.1 Alphabets, Strings, and Representations

The ability to represent information is crucial to communicating and processing information. Human societies created spoken languages to communicate on a basic level, and developed writing to reach a more sophisticated level.

The English language, for instance, in its spoken form relies on some finite set of basic sounds as a set of primitives. The words are defined in term of finite sequences of such sounds.

Sentences are derived from finite sequences of words. Conversations are achieved from finite sequences of sentences, and so forth.

我们知道[信息是抽象 ]的,要通过信息进行交流或对信息进行处理,必先物化它、消息化它(用符号表述它)。像我们使用的自然语言。

以英语为例,英语的口语 是基于有限个基本发音节进行演绎组合得来的。首先有限个发音节组成词(words),有限个词组成句(Sentences ),有限个句子组成对话(Conversations )。

Written English uses some finite set of symbols as a set of primitives. The words are defined by finite sequences of symbols. Sentences are derived from finite sequences of words.

Paragraphs are obtained from finite sequences of sentences, and so forth.

英语的书面语 则是基于有限个基本符号(symbols)进行演绎组合得来的。首先由有限个基本符号成词(words),有限个词组成句(Sentences ),有限个句子组成语段(Paragraphs)。

Similar approaches have been developed also for representing elements of other sets. For instance, the natural number can be represented by finite sequences of decimal digits.

类似的方法也用来“物化”其它信息集合的元素。比如,自然数(集合)就是由十进制数字集所“物化”。

集合与集合论

 

集合是一个范畴类概念,因此是一个不予定义的概念,所谓“集合”,是由我们的知觉或思维确定的总体。集合可用来定义非原始概念(集合与以其为基础派生出来的形成层次结构的各概念是静态的结构概念),比如按现代数学的观点,数学各分支的研究对象,或者是带有某种结构的集合(如群、环、域、拓扑空间等),或者是可用集合直接定义(如自然数、有理数、实数、函数等),或者是可借助集合定义(如范畴、函子、自然变换等)。

集合论是一门研究数学基础的学科,它试图从一个比“数”更简单的概念--集合出发,定义数及其运算,进而发展到整个数学。集合不仅可用来表示数及运算,更可以用于非数值信息的表示和处理,像数据的删节、插入、排序,数据间关系的描述,都很难用传统的数值计算来处理,但却可以用集合运算来实现。

 

Computations, like natural languages, are expected to deal with information in its most general form. Consequently, computations function as manipulators of integers, graphs, programs, and many other kinds of entities. However, in reality computations only manipulate strings of symbols that represent the objects. The previous discussion necessitates the following definitions.

与用自然语言进行交流类似,计算过程也是在处理信息 ,并且是以更一般的方式进行着。计算处理的信息对象包括整数、图、程序等其它与自然语言不一样的对象。不过实际上,计算处理的是表达这些对象的符号串 。符号串是计算科学的基石。虽然我们一般都认为计算机是处理数字、文字和图片的设备,但更准确的说,计算机是在处理表达这些数字、文字和图片的符号串。

Alphabets and Strings

A finite, nonempty ordered set will be called an alphabet if its elements are symbols , or characters (i.e., elements with "primitive" graphical representations). A finite sequence of symbols from a given alphabet will be called a string over the alphabet. A string that consists of a sequence a1, a2, . . . , an of symbols will be denoted by the juxtaposition a1a2 an. Strings that have zero symbols, called empty strings, will be denoted by .

字母表与符号串

一个非空有穷的集合,如果它的元素是符号或字母(最基本的无法再分解的图形),我们称其为字母表(alphabet)。从字母表中取有限个符号组成的序列,我们称之为“字母表上的符号串(strings)”。

集合论的第一用途,严格定义研究对象。上面对[字母表]和[符号串]的定义可以看到。比如基于[空]、[穷]、[元素]等集合的基本定义,再加上对符号和或元素的现场定义来定义[字母表]。

Ordering of Strings

Searching is probably the most commonly applied operation on information. Due to the importance of this operation, approaches for searching information and for organizing information to facilitate searching, receive special attention. Sequential search, binary search, insertion sort, quick sort, and merge sort are some examples of such approaches. These
approaches rely in most cases on the existence of a relationship that defines an ordering of the entities in question.

符号串排序

搜索(searching )也许是最常见的信息处理操作。基于搜索操作的重要性,搜索信息的方法以及有助于搜索的信息组织方法都备受重视。比如顺序搜索和二叉搜索,插入排序、快速排序和合并排序等。搜索和排序的根据都是信息(符号)间的关系(比如序关系)。

为什么研习排序算法?

 

b.可从实用和教学两个角度来说明排序算法学习的必要性:从实用的角度,首先排序算法还有很多直接应用的领域;其次排序是很多复杂问题算法处理的第一步;从教学的角度,排序算法包括了很多算法设计的思想,排序算法也可作为学习和研究数据结构和算法分析的样例。

 

A frequently used relationship for strings is the one that compares them alphabetically, as reflected by the ordering of names in telephone books. The relationship and ordering can be defined in the following manner.

一种很常用于符号串的关系是字母序,比如我们的电话本所使用的排序方式。字母序可以用如下方式定义:

设有字母表Σ ,如果符号串α和符号串β在Σ*内,并且满足以下两条件中任何一条,我们称α在Σ*[字母序小于 ]β,或β在Σ*[字母序大于 ]α:

第一,α是β的真前缀(proper prefix);

第二,设有Σ*符号串γ,Σ的字母a和b,且a在Σ中a先于(precedes)b ,符号串 γa 是 α 的前缀 并且 γb 是β的前缀。

 

KEMIN:字母表的字母有先后关系,字母是有序集还是无序集?

前缀与后缀

设x,y,z,w,v∈∑*,且x=yz,w=yv

  1. (1) y是x的前缀(prefix)。
  2. (2)如果z≠ε,则y是x的真前缀(proper prefix)。
  3. (3) z是x的后缀(suffix);
  4. (4) 如果y≠ε,则z是x的真后缀(proper suffix)。
  5. (5) y是x和w的公共前缀(common Prefix)。

还有一些重要的序关系(比如字典序)以及序关系性质,请看原文。

Representations

Given the preceding definitions of alphabets and strings, representations of information can be viewed as the mapping of objects into strings in accordance with some rules. That is, formally speaking, a representation or encoding over an alphabet ∑ of a set D is a function f from D to 2^∑* that satisfies the following condition: f(e1) and f(e2) are disjoint nonempty sets 不相交的非空子集for each pair of distinct elements e1 and e2 in D.

If is a unary alphabet一元字母, then the representation is said to be a unary representation. If is a binary alphabet, then the representation is said to be a binary representation.

In what follows each element in f(e) will be referred to as a representation, or encoding, of e.

有了前面对字母表和符号串的定义后,对信息进行物化(representations of information)可以看成按一定的规则把信息对象 映射为符号串。形式地说,对信息对象 集合D进行基于字母表∑ 的物化或编码是从D到2^∑*的函数f,并且函数f满足以下条件:对于D内任一对不相同的元素e1和e2,f(e1) 和f(e2)是不相交的非空子集(disjoint nonempty sets )。

如果∑是一无字母表,称一元编码(unary representation);如果∑ 是二进制字母表,则称为二进制编码(binary representation)。注意,如果e代表信息对象,f(e)是它的编码(物化物)。

On the other hand, f3 is a unary representation over {1} of the natural numbers if it assigns to the ith natural number the set consisting of the ith alphabetically (= canonically) smallest unary string. In such a case, f3(0) = { }, f3(1) = {1}, f3(2) = {11}, f3(3) = {111}, f3(4) = {1111}, . . . , f3(i) = {1i }, . . .

例如,假设f3是用基于字母表{1}对自然数编码的一元编码(assigns to the ith natural number the set consisting of the ith alphabetically (= canonically) smallest unary string),那么f3(0) = {ε}, f3(1) = {1}, f3(2) = {11}, f3(3) = {111}, f3(4) = {1111}, . . . , f3(i) = {1^i}, . . .

KEMIN:特别注意理解信息对象:自然数。都有数字表示容易混淆,看看这个,∑={A,B,C},我们对水果编码,f(苹果) = A,f(提子) = AB,f(橙) = ABC,f(香蕉) = AC……。

1.2 Formal Languages and Grammars

The universe of strings is a useful medium for the representation of information as long as there exists a function that provides the interpretation for the information carried by the strings.

An interpretation is just the inverse of the mapping that a representation provides, that is, an interpretation is a function g from ∑* to D for some alphabet ∑ and some set D. The string 111, for instance, can be interpreted as the number one hundred and eleven represented by a decimal string, as the number seven represented by a binary string, and as the number three represented by a unary string.

当有了一个解释符号串所携带的信息的函数时,符号串是有效的物化[信息 ]的载体。这个解释函数物化函数 相反,把符号串映射回信息对象 。也就是,解释函数是∑*到D的函数g,基于一个字母表∑和集合D。比如符号串111,如果是十进制符号串,可解释为数值信息“一百一十一”;如果是二进制符号串,可解释为数值信息“七”;如果是一元符号串,可解释为数值信息“三”。

The parties communicating a piece of information do the representing and interpreting. The representation is provided by the sender, and the interpretation is provided by the receiver. The process is the same no matter whether the parties are human beings or programs.

一般人没有意识到,其实当我们(通过自然语言)与别人交流时,我们各自都在物化信息和解释信息 。前者是表达意见,后者是理解所闻。也就是说,在信息交流过程中,发送者负责物化信息,而接受者负责解释信息,这个过程不管是人与人交流,还是程序与程序交流都是一样的。

Consequently, from the point of view of the parties involved, a language can be just a collection of strings because the parties embed the representation and interpretation functions in themselves.

Languages

In general, if ∑ is an alphabet and L is a subset of ∑*, then L is said to be a language over ∑ , or simply a language if ∑ is understood. Each element of L is said to be a sentence or a word or a string of the language.

A language that can be defined by a formal system, that is, by a system that has a finite number of axioms and a finite number of inference rules, is said to be a formal language.

设 ∑ 为字母表,如果 集合L是 ∑*的一个子集,我们称L是 ∑ 上的一个[语言]。语言L的元素称为语言的句子、词语或符号串。

语言还可以通过形式系统来定义,也就是通过有穷个公理(axioms )和有穷个推理规则(inference rules)来定义,这样的语言称为形式语言。

Grammars

It is often convenient to specify languages in terms of grammars. The advantage in doing so arises mainly from the usage of a small number of rules for describing a language with a large number of sentences.

上面只对语言的表现形式进行了定义,这是不够的。我们必须深入语言的内部。一种由少量[规则]组成的语法常被用来刻画语言(的句子)。因为语言的句子数量是很大的,有了语法规则后,可通过有限掌握无限。

For instance, the possibility that an English sentence consists of a subject phrase followed by a predicate phrase can be expressed by a grammatical rule of the form <sentence> <subject><predicate>. (The names in angular brackets are assumed to belong to the grammar metalanguage.) Similarly, the possibility that the subject phrase consists of a noun phrase can be expressed by a grammatical rule of the form <subject> → <noun>. In a similar manner it can also be deduced that "Mary sang a song" is a possible sentence in the language described by the following grammatical rules.

比如,由主语和谓语组成的英文句子可以用语法规则表示:<sentence> → <subject><predicate>。注意尖括号内的词语被约定为语法的元语言(metalanguage),不是被刻画语言的元素。

同样,由名词组成的主语可以用语法规则表示:<subject> → <noun>

知道语法规则的原理,我们根据以下的语法规则推断出句子“Mary sang a song”是英文的句子:

 

不过以上的语法规则还不足以描述所有正确的英语句子,比如它会认为“Mary sang a Mary”是合法的句子,而”Mary read a song.“是非法的句子。因此,以上的语法规则(集)对英语来说是不完整的语法系统。

The grammatical rules above also allow English sentences of the form "Mary sang a song" for other names besides Mary. On the other hand, the rules imply non-English sentences like "Mary sang a Mary," and do not allow English sentences like "Mary read a song." Therefore, the set of grammatical rules above consists of an incomplete grammatical system for specifying the English language.

由上可见,语言及其性质需要研究,语言的语法本身也是研究的对象。

一个形式语法 G 是下述元素构成的一个四元组(N, Σ, P, S):

 

  • * “非终结符号”集合 N。
  • * “终结符号”集合 Σ ,Σ 与 N 无交。
  • * 取如下形式的一组“产生式规则” P,(Σ ∪ N)*中的字符串 → (Σ ∪ N)* 中的字符串,并且产生式左侧的字符串中必须至少包括一个非终结符号。
  • * “起始符号”S,S 属于 N。

一个由形式语法 G = (N, Σ, P, S) 产生的语言是所有如下形式的字符串集合,这些字符串全部由“终结符号”集 Σ 中符号构成,并且可以从“初始符号”S 出发,不断应用 P 中的“产生式规则”而得到。

语法的分类

 

某些类型的文法及其产生的语言得到了细致的研究并被单独命名。最常见的文法的分类系统是诺姆·乔姆斯基于1956年发展的乔姆斯基谱系,这个分类谱系把所有的文法分成四种类型:无限制文法、上下文相关文法、上下文无关文法和正规文法。四类文法对应的语言类分别是递归可枚举语言、上下文相关语言、上下文无关语言和正规语言。这四种文法类型依次拥有越来越严格的产生式规则,同时文法所能表达的语言也越来越少。尽管表达能力比无限制文法和上下文相关文法要弱,但由于能高效率的实现,四类文法中最重要的是上下文无关文法和正规文法。例如对上下文无关语言存在算法可以生成高效率的LL 分析器和LR 分析器。

Programming languages

A compiler usually has two distinct components. A lexical analyzer, generated by a tool like lex, identifies the tokens of the programming language grammar, e.g. identifiers or keywords, which are themselves expressed in a simpler formal language, usually by means of regular expressions. At the most basic conceptual level, a parser, usually generated by a parser generator like yacc, attempts to decide if the source program is valid, that is if it belongs to the programming language for which the compiler was built. Of course, compilers do more than just parse the source code—they usually translate it in some executable format. Because of this, a parser usually outputs more than a yes/no answer, typically an abstract syntax tree, which is used by subsequent stages of the compiler to eventually generate an executable containing machine code that runs directly on the hardware, or some intermediate code that requires a virtual machine to execute.

1.3 Programs

Our deep dependency on the processing of information brought about the deployment of programs in an ever increasing array of applications. Programs can be found at home, at work, and in businesses, libraries, hospitals, and schools. They are used for learning, playing games, typesetting, directing telephone calls, providing medical diagnostics, forecasting weather, flying airplanes, and for many other purposes.

其实我们无时无刻都在和信息打交道,我们却不能意识到自己时时刻刻在“编程”。程序存在于家中、办公室、图书馆、医院和学校等。

To facilitate the task of writing programs for the multitude of different applications, numerous programming languages have been developed. The diversity of programming languages reflects the different interpretations that can be given to information. However, from the perspective of their power to express computations, there is very little difference among them. Consequently, different programming languages can be used in the study of programs.

为了应付生活中不同维度的编程需要,各种不同的编程语言被设计和开发出来了。这些编程语言的多样性只体现在信息(数据信息和操作信息)不同的解释,从表达计算的能力的角度看,这些编程语言区别不大。由此可见,研究程序的问题转为研究不同语言的问题。

The study of programs can benefit, however, from fixing the programming language in use. This enables a unified discussion about programs. The choice, however, must be for a language that is general enough to be relevant to all programs but primitive enough to simplify the discussion.

了解和掌握编程序所使用的编程语言(programming language)的特征对研究程序(programs )是有裨益的。因为可以统一地通过编程语言来研究程序。不过,所选择的语言应用足够的通用(general)和原始(primitive),以便于讨论研究。

Here, a program is defined as a finite sequence of instructions over some domain D. The domain D, called the domain of the variables, is assumed to be a set of elements with a distinguished element, called the initial value of the variables. Each of the elements in D is assumed to be a possible assignment of a value to the variables of the program. The sequence of instructions is assumed to consist of instructions of the following form.

这里,[程序]被定义为(作用)域D上的有限指令序列。域D(称为变量的域domain of the variables)被假设为一可互相区别的元素的集合,集合元素称为变量的初始值(initial value of the variables)。D中的每一个元素被假设为程序变量的可能的取值。假设程序的指令有如下的一些:

1. Read instructions of the form

read x

where x is a variable.
2. Write instructions of the form

write x

where x is a variable.
3. Deterministic assignment instructions of the form

y := f(x1, . . . , xm)

where x1, . . . , xm, and y are variables, and f is a function from Dm to D.
4. Conditional if instructions of the form

if Q(x1, . . . , xm) then I

where I is an instruction, x1, . . . , xm are variables, and Q is a predicate from Dm to {false, true}.
5. Deterministic looping instructions of the form

do
a
until Q(x1, . . . , xm)

where a is a nonempty sequence of instructions, x1, . . . , xm are variables, and Q is a predicate from Dm to {false, true}.
6. Conditional accept instructions of the form

if eof then accept

 

 

 

 

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 4
    评论
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值