搜索引擎

project1

For this project, you will write a Java program that processes all text files in a directory and its subdirectories, cleans and parses the text into word stems, and builds an in-memory inverted index to store the mapping from word stems to the documents and position within those documents where those word stems were found.

For example, suppose we have the following mapping stored in our inverted index:

{
“capybara”: {
“input/mammals.txt”: [
11
]
},
“platypus”: {
“input/dangerous/venomous.txt”: [
2
],
“input/mammals.txt”: [
3,
8
]
}
}
This indicates that after processing the word stems from files, the word capybara is found in the file input/mammals.html in position 11. The word platypus is found in two files, input/mammals.html and input/dangerous/venomous.html. In the file input/mammals.html, the word platypus appears twice in positions 3 and 8. In file input/dangerous/venomous.html, the word platypus is in position 2 in the file.

The process of stemming reduces a word to a base form (or “stem”), so that words like interesting, interested, and interests all map to the stem interest. Stemming is a common preprocessing step in many web search engines.

Functionality

The core functionality of your project must satisfy the following requirements:

Process command-line arguments to determine the input to process and output to produce. See the Input and Output sections below for specifics.

Create a custom inverted index data structure that stores a mapping from a word stem to the location(s) the word was found, and the position(s) in that file the word is located. The positions should start at 1. This will require nesting multiple built-in data structures.

If provided a directory as input, find all files within that directory and all subdirectories and parse each text file found. Any files that end in the .text or .txt extension (case insensitive) should be considered a text file. If provided a single file as input, only parse that individual file (regardless of its extension).

Use the UTF-8 character encoding for all file processing, including reading and writing.
Process text files into word stems by removing any non-letter symbols (including digits, punctuation, accents, special characters), convert the remaining alphabetic characters to lowercase, split the text into words by whitespace, and then stem the word using the Apache OpenNLP toolkit.

Use the regular expression (?U)[^\p{Alpha}\p{Space}]+ to remove special characters from text.

Use the regular expression (?U)\p{Space}+ to split text into words by whitespace.

Use the SnowballStemmer English stemming algorithm in OpenNLP to stem words.

If the appropriate command-line arguments are provided, output the inverted index in pretty JSON format. See the Output section below for specifics.

Output user-friendly error messages in the case of exceptions or invalid input. Under no circumstance should your main() method output a stack trace to the user!

The functionality of your project will be evaluated with the Project1Test.java group of JUnit tests.

Input

Your main method must be placed in a class named Driver. The Driver class should accept the following command-line arguments:

-text path where the flag -text indicates the next argument is a path to either a single file or a directory. If the value is a file, open and process that file regardless of its extension. If the value is a directory, find and process all of the text files (with .txt and .text extensions) in that directory and its subdirectories.

-index path where the flag -index is an optional flag that indicates the next argument is the path to use for the inverted index output file. If the path argument is not provided, use index.json as the default output path. If the -index flag is provided, always generate output even if it is empty. If the -index flag is not provided, do not produce an output file.

The command-line flag/value pairs may be provided in any order. Do not convert paths to absolute form when processing command-line input!

Output

All output will be produced in “pretty” JSON format using \t tab characters for indentation. According to the JSON standard, numbers like integers should never be quoted. Any string or object key, however, should always be surrounded by " quotes. Objects (similar to maps) should use curly braces { and } and arrays should use square brackets [ and ]. Make sure there are no trailing commas after the last element.

The paths should be output in the form they were originally provided. The tests use normalized relative paths, so the output should also be normalized relative paths. As long as command-line parameters are not converted to absolute form, this should be the default output provided by the path object.

The contents of your inverted index should be output in alphabetically sorted order as a nested JSON object using a “pretty” format. For example:

{
“capybara”: {
“input/mammals.txt”: [
11
]
},
“platypus”: {
“input/dangerous/venomous.txt”: [
2
],
“input/mammals.txt”: [
3,
8
]
}
}
The project tests account for different path separators (forward slash / for Linux/Mac systems, and backward slash \ for Windows systems). Your code does not have to convert between the two!

Examples

The following are a few examples (non-comprehensive) to illustrate the usage of the command-line arguments that can be passed to your Driver class via a “Run Configuration” in Eclipse, assuming you followed the Running Driver guide to setup the working directory to the project-tests directory.

Consider the following example:

-text “input/text/simple/hello.txt”
-index “actual/index-simple-hello.json”
The above arguments indicate that Driver should build an inverted index from the single hello.txt file in the input/text/simple subdirectory of the current working directory, and output the inverted index as JSON to the index-simple-hello.json file in the

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

永远喜欢薇尔莉特

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值