java 读取单词,尝试从Java文件中读取2个单词

最新推荐文章于 2022-07-12 20:21:25 发布

雪山战鹰

最新推荐文章于 2022-07-12 20:21:25 发布

阅读量181

点赞数

文章标签： java 读取单词

该博客讨论了尝试用Java从文本文件中读取单词对并存储到TreeSet时遇到的问题。代码示例中，由于Scanner默认按空格分隔，导致无法匹配连续的两个单词。解决方案是逐行读取文件，然后使用split()方法解析单词，并创建单词对。

摘要由CSDN通过智能技术生成

I'm trying to write a simple program to read a text file and store pair of words in a Set. Here is the code I wrote for that

import java.io.File;

import java.io.FileNotFoundException;

import java.util.Scanner;

import java.util.TreeSet;

public class Main {

public static void main(String[] args) {

TreeSet phraseSet = new TreeSet();

try {

Scanner readfile = new Scanner(new File("data.txt"));

while(readfile.hasNext("\\w{2}")) {

String phrase = readfile.next("\\w{2}");

phraseSet.add(phrase);

}

} catch (FileNotFoundException e) {

e.printStackTrace();

}

for(String p : phraseSet) {

System.out.println(p);

}

The code compiles but prints out a blank line (The while loop is never entered).

The data.txt file contents are:

There are seven words in this line.

And then there are few more words in this line.

I'm expecting following Strings in my TreeSet (off course in sorted order)

There are

are seven

seven words

words in

in this

this line

line And

And then

then there

there are

....

this line

解决方案

Your main problem is that Scanner by default parses tokens by whitespace.

According to the API:

A Scanner breaks its input into tokens using a delimiter pattern, which by default matches whitespace. The resulting tokens may then be converted into values of different types using the various next methods.

If you take a look at hasNext(String pattern), you'll see that it

Returns the next token if it matches the pattern constructed from the specified string. If the match is successful, the scanner advances past the input that matched the pattern.

(emphasis mine)

i.e. By the time you are asking for the Scanner to check for your token, it's already broken up the input by whitespace, so asking to find a token with a space in the middle will always fail.

A better way to do this would be to have the Scanner read in a line at a time, and then just split() the line and parse it yourself:

Scanner readfile = new Scanner(new File("data.txt"));

while (readfile.hasNextLine()) {

String[] words = readfile.nextLine().split("\\s");

for (int i=0; i

phraseSet.add(words[i] + " " + words[i+1]);

}

Your question didn't explicitly mention it, but from your example output, it looks like you want to ignore line breaks in reading. This approach makes that slightly more complicated, but you can just store off the last word of each line and add it when parsing the next, like so:

String lastWord = null;

while (readfile.hasNextLine()) {

String[] words = readfile.nextLine().split("\\s");

if (lastWord != null) {

phraseSet.add(lastWord + " " + words[0]);

}

for (int i=0; i

phraseSet.add(words[i] + " " + words[i+1]);

}

lastWord = words[words.length-1];

}

If this is actually what you're looking for, you're probably better off just using next() to pull each word one at a time like other answers have shown how to do.

To sum up

You cannot use Scanner to directly look for multi-word tokens, you'll have to do the parsing yourself.