project1
For this project, you will write a Java program that processes all text files in a directory and its subdirectories, cleans and parses the text into word stems, and builds an in-memory inverted index to store the mapping from word stems to the documents and position within those documents where those word stems were found.
For example, suppose we have the following mapping stored in our inverted index:
{
“capybara”: {
“input/mammals.txt”: [
11
]
},
“platypus”: {
“input/dangerous/venomous.txt”: [
2
],
“input/mammals.txt”: [
3,
8
]
}
}
This indicates that after processing the word stems from files, the word capybara is found in the file input/mammals.html in position 11. The word platypus is found in two files, input/mammals.html and input/dangerous/venomous.html. In the file input/mammals.html, the word platypus appears twice in positions 3 and 8. In file input/dangerous/venomous.html, the word platypus is in position 2 in the file.
The process of stemming reduces a word to a base form (or “stem”), so that words like interesting, interested, and interests all map to the stem interest. Stemming is a common preprocessing step in many web search engines.
Functionality
The core functionality of your project must satisfy the following requirements:
Process command-line arguments to determine the input to process and output to produce. See the Input and Output sections below for specifics.
Create a custom inverted index data structure that stores a mapping from a word stem to the location(s) the word was found, and the position(s) in that file the word is located. The positions should start at 1. This will require nesting multiple built-in data structures.
If provided a directory as input, find all files within that directory and all subdirectories and parse each text file found. Any files that end in the .text or .txt extension (case insensitive) should be considered a text file. If provided a single file as input, only parse that individual file (regardless of its extension).
Use the UTF-8 character encoding for all file processing, including reading and writing.
Process text files into word stems by removing any non-letter symbols (including digits, punctuation, accents, special characters), convert the remaining alphabetic characters to lowercase, split the text into words by whitespace, and then stem the word using the Apache OpenNLP toolkit.
Use the regular expression (?U)[^\p{Alpha}\p{Space}]+ to remove special characters from text.
Use the regular expression (?U)\p{Space}+ to split text into words by whitespace.
Use the SnowballStemmer English stemming algorithm in OpenNLP to stem words.
If the appropriate command-line arguments are provided, output the inverted index in pretty JSON format. See the Output section below for specifics.
Output user-friendly error messages in the case of exceptions or invalid input. Under no circumstance should your main() method output a stack trace to the user!
The functionality of your project will be evaluated with the Project1Test.java group of JUnit tests.
Input
Your main method must be placed in a class named Driver. The Driver class should accept the following command-line arguments:
-text path where the flag -text indicates the next argument is a path to either a single file or a directory. If the value is a file, open and process that file regardless of its extension. If the value is a directory, find and process all of the text files (with .txt and .text extensions) in that directory and its subdirectories.
-index path where the flag -index is an optional flag that indicates the next argument is the path to use for the inverted index output file. If the path argument is not provided, use index.json as the default output path. If the -index flag is provided, always generate output even if it is empty. If the -index flag is not provided, do not produce an output file.
The command-line flag/value pairs may be provided in any order. Do not convert paths to absolute form when processing command-line input!
Output
All output will be produced in “pretty” JSON format using \t tab characters for indentation. According to the JSON standard, numbers like integers should never be quoted. Any string or object key, however, should always be surrounded by " quotes. Objects (similar to maps) should use curly braces { and } and arrays should use square brackets [ and ]. Make sure there are no trailing commas after the last element.
The paths should be output in the form they were originally provided. The tests use normalized relative paths, so the output should also be normalized relative paths. As long as command-line parameters are not converted to absolute form, this should be the default output provided by the path object.
The contents of your inverted index should be output in alphabetically sorted order as a nested JSON object using a “pretty” format. For example:
{
“capybara”: {
“input/mammals.txt”: [
11
]
},
“platypus”: {
“input/dangerous/venomous.txt”: [
2
],
“input/mammals.txt”: [
3,
8
]
}
}
The project tests account for different path separators (forward slash / for Linux/Mac systems, and backward slash \ for Windows systems). Your code does not have to convert between the two!
Examples
The following are a few examples (non-comprehensive) to illustrate the usage of the command-line arguments that can be passed to your Driver class via a “Run Configuration” in Eclipse, assuming you followed the Running Driver guide to setup the working directory to the project-tests directory.
Consider the following example:
-text “input/text/simple/hello.txt”
-index “actual/index-simple-hello.json”
The above arguments indicate that Driver should build an inverted index from the single hello.txt file in the input/text/simple subdirectory of the current working directory, and output the inverted index as JSON to the index-simple-hello.json file in the