按照Javadoc里的描述:StreamTokenizer
类获取输入流并将其解析为“标记”,允许一次读取一个标记。解析过程由一个表和许多可以设置为各种状态的标志控制。该流的标记生成器可以识别标识符、数字、引用的字符串和各种注释样式等。
简单的说就是一个可以将源代码文件解释成一个个标记的类,这些标记都对应不同的类别,例如数字,单词,行尾,末尾等。
本文中将使用以下源文件作为演示内容:
package com.iteye.liugang594.java.thread;
import java.util.concurrent.CountDownLatch;
public class TestCountDownLatch {
private int number = 30;
private long seconds = 40000L;
/**
* @param args
* @throws InterruptedException
*/
public static void main(String[] args) throws InterruptedException {
//create countdown
final CountDownLatch countDownLatch = new CountDownLatch(10);
for(int i = 0;i< 10;i++){
new Thread("Thread "+i){
public void run() {
try {
//wait
countDownLatch.await();
} catch (InterruptedException e) {
e.printStackTrace();
}
System.out.println(getName()+" started");
}
}.start();
Thread.sleep(50);
//count down
countDownLatch.countDown();
}
}
}
一、提取数字
首先,看一下怎么提取以上内容中的数字,这里有以下几个:30,40000L, 10, 0, 10, 50。
StreamTokenizer tokenizer = new StreamTokenizer(reader);
tokenizer.parseNumbers();
int nextToken = tokenizer.nextToken();
while(nextToken != StreamTokenizer.TT_EOF){
if(nextToken == StreamTokenizer.TT_NUMBER){
System.out.println("number "+ tokenizer.nval+" on line "+tokenizer.lineno());
}
nextToken = tokenizer.nextToken();
}
打印结果:
number 30.0 on line 7
number 40000.0 on line 8
number 10.0 on line 17
number 0.0 on line 19
number 10.0 on line 19
number 0.0 on line 30
number 50.0 on line 31
有一个问题:在第30行上其实没有数字,却显示0.0,我觉得StreamTokenier的一个bug,它把终结符(也就是 } )后的点识别为一个数字。可以试着把其实的点删除,就可以得到正确答案。或者把点作为普遍的字符的对待,例如:
StreamTokenizer tokenizer = new StreamTokenizer(reader);
tokenizer.parseNumbers();
tokenizer.ordinaryChar('.');
int nextToken = tokenizer.nextToken();
while(nextToken != StreamTokenizer.TT_EOF){
if(nextToken == StreamTokenizer.TT_NUMBER){
System.out.println("number "+ tokenizer.nval+" on line "+tokenizer.lineno());
}
nextToken = tokenizer.nextToken();
}
二、删除注释
在示例源文件里有一些注释内容,这里演示一下怎么删除其中的注释:
StreamTokenizer tokenizer = new StreamTokenizer(reader);
tokenizer.resetSyntax(); // reset all chars as ordinary char
tokenizer.slashSlashComments(true); // recognize //
tokenizer.slashStarComments(true); // recognize /**/
tokenizer.wordChars(Character.MIN_VALUE, Character.MAX_VALUE); // all chars will be considered as a part of word
tokenizer.commentChar('/'); //set the comment char
tokenizer.quoteChar('"'); // set quote char, this should be set, else the the / inside a string will be considered as comment
int token = tokenizer.nextToken();
while (token != StreamTokenizer.TT_EOF) { // continue if not the end of file
if(token == '"'){ //print " + content + " if we encounted a quote
System.out.print((char)token);
}
if(tokenizer.sval != null){ // print token content if have
System.out.print(tokenizer.sval);
}
if(token == '"'){
System.out.print((char)token);
}
token = tokenizer.nextToken();
}
注意:quote也需要被解析,否则 "//hello" 里的 //hello" 会被认为是注释而处理。
检查打印的内容,所有的注释都被删除,只留下源码主体部分。
三、字符串提取
和上节类似,这次只需要打印字符Token的内容:
StreamTokenizer tokenizer = new StreamTokenizer(reader);
tokenizer.resetSyntax(); // reset all chars as ordinary char
tokenizer.slashSlashComments(true); // recognize //
tokenizer.slashStarComments(true); // recognize /**/
tokenizer.wordChars(Character.MIN_VALUE, Character.MAX_VALUE); // all chars will be considered as a part of word
tokenizer.commentChar('/'); //set the comment char
tokenizer.quoteChar('"');
int token = tokenizer.nextToken();
while (token != StreamTokenizer.TT_EOF) { // continue if not the end of file
if(token == '"'){
System.out.println(tokenizer.sval);
}
token = tokenizer.nextToken();
}
注意,注释部分也需要包括到token中,否则注释里的字符串也会被解析,例如: // "hello world"
四、去除空行
去除空行可以将文件的大小进行压缩。例如用于网络传输的时候。以下代码片段可以用来去除空行:
StreamTokenizer tokenizer = new StreamTokenizer(reader);
tokenizer.resetSyntax(); // remove all symbols before continue
//assume all characters are words
tokenizer.wordChars(Character.MIN_VALUE, Character.MAX_VALUE);
//treat the new line as a token, this requires \r and \n should be white space
tokenizer.eolIsSignificant(true);
tokenizer.whitespaceChars('\r', '\r');
tokenizer.whitespaceChars('\n', '\n');
int nextToken = tokenizer.nextToken();
while(nextToken != StreamTokenizer.TT_EOF){
//print the content between lines if it's not empty
if(nextToken != StreamTokenizer.TT_EOL){
if(tokenizer.sval != null && !"".equals(tokenizer.sval.trim())){
System.out.println(tokenizer.sval);
}
}
nextToken = tokenizer.nextToken();
}
首先清除所有的标记,然后把Character范围内的值都认为是单词的一部分,然后设置换行符为一个Token,这里需要指定\r和\n为空格字符,以使得换行符起作用;然后在扫描的过程中,所在非换行符的内容如果为空字符串则跳过,否则打印。