811312A Data Structures and Algorithms 2022 Assignment, Instructions

811312A Data Structures and Algorithms 2022 Assignment, Instructions

需要的朋友qq1703105484

This document describes both general and specific requirements for the assignment of the course Data Structures and Algorithms.

  1. General Instructions

The assignment is an individual work. You are not allowed to copy code from others. NOTE! All suspected misconduct will be investigated and brought to the attention of NJIT study office. However, the course materials, for instance exercise solutions and example programs can be used freely. When utilizing already made solutions, referencing practices need to be followed and explanation of how it has been modified need to be included. The solution must be provided with the standard features of C language. The use of extra code libraries is not allowed.

Everyone shall return one assignment implemented individually in C programming language and one written report. Return the solution and the report in one zip file in the course's Moodle workspace before the deadline of 23:59 on March 25, 2022. All late submissions will get reduced grade. If not approvable, a returned solution can be revised one time, according to inspector's advice. See more detailed requirements later.

The assignment submission shall contain the source code for your program, possible header files, and your written report. See detailed instruction and requirements further down in this document and see the checklist.

The assignment shall be returned at latest 23.59 on March 25, 2022.

The task

 
  1. Description of the Assignment

In this assignment, you shall find the 100 most frequently occurring words from a large text file. The program shall be implemented in C language. A continuous string of characters a..z and A..Z, with possible apostrophes ’, is considered a word. Words with uppercase and lowercase letters are considered equal. For example, in the text

Herman Melville’s book Moby Dick starts, as we all know, with the sentence ”Call me Ishmael”.

the words are: “herman”, “melville’s”, “book”, “moby”, “dick”, “starts”, “as”, “we”, “all”, “know”, “with”, “the”, “sentence”, “call”, “me”, and “ishmael”

A word does not contain separators. For instance, “book”, “book,” (“book” with a comma after it), and “book.” (“book” with a full stop after it) should all exclude the separator and be counted as “book”.

A number in numeral form is not a word. For instance, “1” is not a word, while “one” is.

The name of the input text file shall be given as an input from the user. That is, the input file should not be hardcoded into the source code of the program, but the program should ask the input file from the user.

The program shall print the total number of words in the input file, the number of different words in the input file, and the 100 most frequently occurring words and their frequencies. The words are printed in descending order according to their frequencies. The program shall also measure and print the processing time.

Because the input file can be very large, you need a suitable data structure to store words and their frequencies. This can be, for instance, a hash table or a binary search tree. You can find the most frequent words by sorting the structure with some fast algorithm. The challenge of the task is that the input file your program will be tested with is very large (close to 3 million words).

Therefore, regular table structure is out of the question, and linked list data structure is too slow. In addition, you will need to reserve enough memory.

There are examples of input files and corresponding outputs in Moodle. All the files are texts written in English. You should use these to test your program. The biggest file, Bulk.txt, does not have an output file but you can use it to test if your solution is fast. Bulk.txt can also contain words of different languages, but uses the same English alphabet, so it should not be a problem. Output of your program is not necessarily exactly the same with large input files. If they are not, think about why and include it in your report.

The report

The report must be written in English, and it shall contain

    1. Student’s name
    2. Student’s University of Oulu ID number
    3. Description of the solution and
    4. Analysis of the solution program.

In the description of your solution, you shall explain what data structures and algorithms you have chosen and why. You shall also explain the most important functions of your program.

You shall do analysis of the efficiency of your program. The size of the input is measured as the number of words in the text file. Measure the running times of your program with given test input files and make an estimate, how many words there can be in a file that you can process in 15 seconds. Based on this, estimate the maximum size (in megabytes) of files that your program can process in the mentioned time, when we know that the average word length in English is 5.1 letters.

Analysis contains thus the measurements of program’s running times and the mentioned estimates for number of words and file size in 15 seconds. The report shall naturally contain student’s name and student’s University of Oulu ID number. If you wish, you can also give feedback of the course and/or assignment in your report.

Remember to zip the files before returning.

The program

 
  1. Detailed requirements

The program shall read words from text file input (.txt file extension). The program shall ask for input file from user.

The program shall store the words it reads from a file into a suitable data structure. The program shall count how many words there are in an input file.

The program shall count how many different words there are in an input file. The program shall count the frequency of each occurring word.

The program shall measure the time spent for processing the input file. The program shall print the total number of words in the input file.

The program shall print the number of different words found in the input file.

The program shall print the 100 most frequent words in descending order based on frequency and include the frequency.

The program shall print the time it took for it to process the file and give the result (print-out does not have to be included in the measurement of time)

The program shall execute the given task in reasonable time even with the larges test material (Bulk.txt)

  • Over 10 seconds – unacceptable
  • Under 10 seconds – acceptable
  • Under 5 seconds – good
  • Under 2 seconds – very good

The code

You shall in write in the C language.

You shall write clean and readable code.

  • Decide on indentation and keep it consistent
  • Comment code to explain functions and variables – comments shall be in English
  • Use a consistent and meaningful name scheme for variables and functions

You shall implement all data structures and algorithms on your own do not use ready library functions

do not copy.

You shall give reference if you use available code found in study material and explain how you have used it.

The code shall not be dependent on any IDE – it should compile and run from command line. The program shall, before it terminates, free memory it has allocated.

The program reserves enough space for the input filename.

The report

You shall use the given template for the report.

The report shall include your name.

The report shall include your University of Oulu student ID number. The report shall be written in English.

The report shall briefly present the chosen data structures and algorithms.

  • Explain how you solved the problem.

The report shall briefly explain the most important functions of your program.

  • Why these functions, what are they used for?

The report shall include measurements of running times for different example input files (at a minimum 3 different files)

The report shall include comparison of the output with given output files and your elaboration on why they are different if they are different.

The report shall include the estimate of how big files it can process in 15 seconds (the analysis) and whether this is realistic

Your report may include feedback regarding the course (teaching, material, methods, anything) if you wish to include it. This will not affect your grade.

Submission

All files shall be included in one zip file.

You shall submit program source code whose file extension is .c and possible header files with extension .h

  • No .ccp files.
  • No executable files.
  • No input files.
  • No additional files.

You shall submit the report as a word file, an open office file, or a pdf file.

You shall use English in your file names, or at least the letters of the English alphabet.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值