Date,Locality,District,New Cases,Hospitalizations,Deaths
5/21/2020,Accomack,Eastern Shore,709,40,11
5/21/2020,Albemarle,Thomas Jefferson,142,19,4
5/21/2020,Alleghany,Alleghany,9,4,0
5/21/2020,Amelia,Piedmont,22,7,1
5/21/2020,Amherst,Central Virginia,25,3,0
5/21/2020,Appomattox,Central Virginia,25,1,0
5/21/2020,Arlington,Arlington,1763,346,89
... // skipped down to the next day
5/20/2020,Accomack,Eastern Shore,709,39,11
5/20/2020,Albemarle,Thomas Jefferson,142,18,4
5/20/2020,Alleghany,Alleghany,10,4,0
5/20/2020,Amelia,Piedmont,21,7,1
5/20/2020,Amherst,Central Virginia,25,3,0
5/20/2020,Appomattox,Central Virginia,24,1,0
5/20/2020,Arlington,Arlington,1728,334,81
5/20/2020,Augusta,Central Shenandoah,88,4,1
... // continued
I have data for a State in the US like the above in a CSV and would like to do some data analysis on it so that I can send it through a rest API. The data analysis that I would like to do are various aggregations, such as: total cases across the state by date, total cases for the entire state , total cases grouped by district, total cases for a district by date, total cases for a county by date, etc. Just all the basic groupby's that one could do with this data.
Now, my problem is figuring out how to properly store this data in java, without a database. I have one successful implementation using a list of Row objects, where each Row object contains just one row in the CSV. Then using java's Stream api I have been able to filter and get some of these statistics. I then package these statistics into a single Row object or a List and send it to the API to be parsed into JSON. This has worked ok, but I feel that this is not the best way.
Is there some other more object-oriented way to utilize the Date, District, County, Cases column.
I was thinking of doing something like this :
class State {
List districtList;
String name;
}
class District {
List countyList;
String name;
}
class County {
LocalDate date;
String name;
int cases;
// more stuff
}
Then I would create one State object with a list of District objects, each with a list of many County objects, one per date.
Does this seem like overkill? Is there some other clean way to read this dataset into a data structure that allows for easily aggregating summary information.
The way that I'm currently doing it now works, but I am looking for a better way!
解决方案
From your description, your approach seems sound, and properly object-oriented. However, without additional information (e.g. specific aggregations which may dictate otherwise), it seems odd you would have multiple "duplicate" 'County' objects in your District objects. For example:
[{"date":"5/21/2020","name":"Accomack"},
{"date":"5/20/2020","name":"Accomack"}]
From an object-oriented view, it seems you'd want an additional level of aggregation, by "Date" (with each date containing a list of 'County' rows).
One consideration: if your aggregations align better with a database approach, I would think each row from the source data should be kept and queried AS/IS, filtered and sorted via Stream lambdas.