I'm trying to write a simple program to read a text file and store pair of words in a Set. Here is the code I wrote for that
import java.io.File;
import java.io.FileNotFoundException;
import java.util.Scanner;
import java.util.TreeSet;
public class Main {
public static void main(String[] args) {
TreeSet phraseSet = new TreeSet();
try {
Scanner readfile = new Scanner(new File("data.txt"));
while(readfile.hasNext("\\w{2}")) {
String phrase = readfile.next("\\w{2}");
phraseSet.add(phrase);
}
} catch (FileNotFoundException e) {
e.printStackTrace();
}
for(String p : phraseSet) {
System.out.println(p);
}
}
}
The code compiles but prints out a blank line (The while loop is never entered).
The data.txt file contents are:
There are seven words in this line.
And then there are few more words in this line.
I'm expecting following Strings in my TreeSet (off course in sorted order)
There are
are seven
seven words
words in
in this
this line
line And
And then
then there
there are
....
this line
解决方案
Your main problem is that Scanner by default parses tokens by whitespace.
According to the API:
A Scanner breaks its input into tokens using a delimiter pattern, which by default matches whitespace. The resulting tokens may then be converted into values of different types using the various next methods.
If you take a look at hasNext(String pattern), you'll see that it
Returns the next token if it matches the pattern constructed from the specified string. If the match is successful, the scanner advances past the input that matched the pattern.
(emphasis mine)
i.e. By the time you are asking for the Scanner to check for your token, it's already broken up the input by whitespace, so asking to find a token with a space in the middle will always fail.
A better way to do this would be to have the Scanner read in a line at a time, and then just split() the line and parse it yourself:
Scanner readfile = new Scanner(new File("data.txt"));
while (readfile.hasNextLine()) {
String[] words = readfile.nextLine().split("\\s");
for (int i=0; i
phraseSet.add(words[i] + " " + words[i+1]);
}
}
Your question didn't explicitly mention it, but from your example output, it looks like you want to ignore line breaks in reading. This approach makes that slightly more complicated, but you can just store off the last word of each line and add it when parsing the next, like so:
String lastWord = null;
while (readfile.hasNextLine()) {
String[] words = readfile.nextLine().split("\\s");
if (lastWord != null) {
phraseSet.add(lastWord + " " + words[0]);
}
for (int i=0; i
phraseSet.add(words[i] + " " + words[i+1]);
}
lastWord = words[words.length-1];
}
If this is actually what you're looking for, you're probably better off just using next() to pull each word one at a time like other answers have shown how to do.
To sum up
You cannot use Scanner to directly look for multi-word tokens, you'll have to do the parsing yourself.