Project 2: Elastic Array & Disk Usage Analyzer
Starter repository on GitHub: Sign in to GitHub · GitHub
Q1703105484
As storage densities continue to increase, so too will humanity’s ability to find new ways to generate more and more data. Storage space often seems unlimited… until it’s not! In this project, we will design a helpful command line utility for users, developers, and system administrators to analyze how their disk space is being used. Here’s a demonstration of the tool, da
:
$ ./da -l 15 -s /usr
32.4 MiB | Aug 21 2022 | /usr/lib/valgrind/libvex-armv8-linux.a
36.5 MiB | Aug 21 2022 | /usr/lib/valgrind/libvex-amd64-linux.a
38.6 MiB | Aug 21 2022 | /usr/lib/libclang.so
38.6 MiB | Aug 21 2022 | /usr/lib/libclang.so.10
46.9 MiB | Feb 01 2023 | /usr/bin/containerd
47.2 MiB | Mar 08 2023 | /usr/lib/libclang-cpp.so
47.2 MiB | Mar 08 2023 | /usr/lib/libclang-cpp.so.10
52.4 MiB | Sep 09 2022 | /usr/lib/libgo.so
52.4 MiB | Sep 09 2022 | /usr/lib/libgo.so.16
52.4 MiB | Sep 09 2022 | /usr/lib/libgo.so.16.0.0
71.0 MiB | Dec 03 2022 | /usr/bin/docker
83.7 MiB | Mar 08 2023 | /usr/lib/libLLVM-10.0.1.so
83.7 MiB | Mar 08 2023 | /usr/lib/libLLVM-10.so
83.7 MiB | Mar 08 2023 | /usr/lib/libLLVM.so
84.4 MiB | Feb 01 2023 | /usr/bin/dockerd
In this example, the user requested the top 15 files (-l 15
), sorted by size (-s
) from the /usr
directory. If they’re really trying to save space on this machine, then maybe it’s time to remove docker? :-)
The output columns include the file size in human readable units, the last time the particular file was accessed, and the file name.
To get a sense of the functionality we will implement, take a look at the help/usage information:
$ ./da -h
Disk Analyzer (da): analyzes disk space usage
Usage: ./da [-ahs] [-l limit] [directory]
If no directory is specified, the current working directory is used.
Options:
* -a Sort the files by time of last access (descending)
* -h Display help/usage information
* -l limit Limit the output to top N files (default=unlimited)
* -s Sort the files by size (default, ascending)
Your implementation will be split into two parts: (1) building an elastic data structure that can store an unbounded number of elements (memory permitting), and (2) directory traversal and disk usage analysis. You will be able to leverage your code from the previous project to help you complete the directory traversal.
You can think of the elastic array as being somewhat analogous to the ArrayList in Java; it will automatically resize, allow a variety of retrieval operations, and provide utility functionality such as retrieving the number of elements, trimming the amount of heap space used to save memory, and sorting the elements. When you are finished, you’ll have produced reusable library that may be helpful in future C projects.
The Elastic Array
While C has primitive array types, they must be dimensioned in advance and do not support convenience features like appending to the list or retrieving its size. Our goal for the elist
library is to fill this gap in functionality. Your elist
should support the following functions:
elist_add
– appends an element to the arrayelist_add_new
– creates storage space for a new element and returns a pointer to itelist_capacity
– retrieves the current list capacityelist_clear
– removes all elements from the arrayelist_clear_mem
– removes all elements from the array and zeroes them outelist_create
– initializes a newelist
data structureelist_destroy
– destroys and frees any memory allocated by anelist
elist_get
– retrieves a particular element by its indexelist_remove
– removes an element at a particular indexelist_set
– replaces an element in the array at a particular indexelist_set_capacity
– increases or decreases the storage capacity of the arrayelist_size
– retrieves the number of elements in the arrayelist_sort
– sorts the array
Array elements will have a fixed size; i.e., the expected size of the elements will be provided to elist_create
. This could be something like sizeof(int)
or even sizeof(struct my_special_struct)
, but regardless all elements will consume the same amount of bytes on the heap.
Elements added to the list via add
or set
will be copied onto the list on the heap; your array should not simply store pointers to the elements. This provides the most flexibility, since the user could maintain an array of pointers if that is the behavior they desire. The add_new
function will return a pointer to a new, uninitialized memory block in the list so that the user can populate it with data to simplify usage and avoid extra copies when unnecessary:
struct my_struct *s = malloc(sizeof(struct my_struct));
s->memb1 = 123;
s->memb2 = 456;
elist_add(list, s); // 's' is copied into the list
// vs.
struct my_struct *s = elist_add_new(list);
s->memb1 = 123;
s->memb2 = 456;
The array will start with an initial capacity, and once full you will double the capacity (RESIZE_MULTIPLIER = 2
) and realloc
the array’s storage. Removing a list element shifts the entire list; empty gaps are not allowed. The array will not be shrunk unless requested via set_capacity
, and if elements exist beyond the requested new capacity then they will be freed.
There are several C functions that will allow you to manipulate the memory allocated to the list. Some functions you may be interested in investigating include memcpy
, memcmp
, memmove
, and memset
.
To allow sorting functionality, you can use qsort(3)
. The user will provide a comparator that your sort function passes to qsort
.
The Disk Usage Analyzer
The disk analyzer will traverse the file system recursively, locating all the files under a given directory. During traversal, each file’s full path, size, and last access time will be recorded in our elastic array for further inspection, sorting, and final formatting.
You will most likely want to use opendir
and readdir
to provide this listing, and stat
to retrieve access times and file sizes. It is recommended to store this information in a struct for each file, and place the structs in your elastic array.
Output Formatting
Working left to right, the list output shown above is formatted as follows:
- 10 characters for the file size, followed by a separator (
|
) - 11 characters for the last access time, followed by a separator (
|
) - The remaining space should be used to display the file name
Note that you can pass sizes in as part of your format strings, e.g.:
printf("%10s | %11s\n", var1, var2);
would print var1
and var2
as 10-character and 11-character columns, respectively. Another fun fact: you can pass a variable width to printf
like so: printf("%*s\n", 10, str);
would be a 10-character string.
To make the output more readable for human beigns, write functions to perform unit conversions (bytes to human-readable units, like MiB, GiB, and so on) and format the date strings as shown in the demo above. For the date conversion, you are allowed to use strftime
, and snprintf
may help simplify your human_readable_size
function. You should support units up to ZiB (zebibyte). Note that we are using units based on powers of 2, not SI units, so the abbreviations will be KiB, MiB, etc. as opposed to KB, MB, and so on.
Learning Objectives
- Dynamic memory allocation (including
malloc
,calloc
,realloc
, andfree
) - Memory manipulation (
memcpy
,memcmp
,memmove
,memset
) - Pointer arithmetic
- structs
- Directory traversal
- String manipulation
- Sorting and function pointers (via
qsort
)
Implementation Restrictions
Restrictions: you may use any standard C library functionality. External libraries are not allowed unless permission is granted in advance. If in doubt, ask first. Your code must compile and run on your VM set up with Arch Linux as described in class – failure to do so will receive a grade of 0.
Testing Your Code
Check your code against the provided test cases. We’ll have interactive grading for projects, where you will demonstrate program functionality and walk through your logic.
Submission: submit via GitHub by checking in your code before the project deadline.
Grading
Check your code against the provided test cases. You should make sure your code runs on your Arch Linux VM.
Submission: submit via GitHub by checking in your code before the project deadline.
Your grade is based on:
- Passing the test cases (
make test
).- Remember to run
make testupdate
to pull in the latest test cases. - When you are satisfied with your project, use
make grade
to test it on our test hardware. - You should continue to test your code for robustness beyond the provided test cases; i.e., just because you hard-coded a function to pass does not guarantee you will receive the points for its functionality.
- Remember to run
- Code review: evaluation of code quality, stylistic consistency, cleanliness, efficiency, and documentation.
Changelog
- Initial project specification posted (2/23)