INFS 5135 – Analysis of a datasetC/C++

最新推荐文章于 2025-12-19 16:34:09 发布

原创最新推荐文章于 2025-12-19 16:34:09 发布 · 1.1k 阅读

7 ·

CC 4.0 BY-SA版权

文章标签：

#c++ #开发语言

Java Python INFS 5135 Assignment – Analysis of a dataset

Introduction

The aim of the assignment is to introduce you to analysis of routine data sets (“wild datasets”). You will need to explore issues such as writing Data dictionaries; assessing data quality, explore the data using visual tools and perform. some data wrangling; consider and perform. data analysis and write a comprehensive report including account on your findings and summarising recommendations.

For the assignment, you will be given a general scenario and a suggestion of a raw dataset. You will need to explore the given problem in more depth – this includes finding more data (datasets) relevant to the job.

You will be working in groups to produce both group and individual deliverables.

Project methodology

Data can be a product of a meticulously planned study, or it can be a side-product of practice (wild datasets). While planned studies typically yield well defined, clean data, these studies are typically expensive both in terms of money and other resources. Such effort is not sustainable in the long term.

Data produced as part of routine activities or observation are, on the other hand, readily available with minimal cost. However, such data are typically incomplete, contain possible errors and require cleansing and transformation before they can be used beyond their primary purpose.

The framework we will be using for this assignment was developed by industry as Cross-Industry Standard Process for data Mining (CRISP-DM). This process has several phases:

Business understanding

Before you start any attempt to collect/analyse data you need to get a good idea why you are doing the exercise – understand the purpose. The main components are:

• Determine business objectives

– Initial situation/problem etc.

– Explore the context of the problem and context of the data collection (…types of organisations generating the data; processes involved in the data creation...)

• Assess situation

– Inventory of resources (personnel, data, software)

– Requirements (e.g. deadline), constraints (e.g. legal issues), risks

Understanding your business will support determining the scope of the project, the timeframe, budget etc.

NB: The direction of your analysis is determined by your business needs. An attempt to analyse a dataset without prior identification of the main directions would lead to extensive exploration. While this may be justified in some cases, in real business it is seldom required. You are NOT doing academic research aiming to create new knowledge – you are trying to get answers to drive your business decisions!

Data understanding

Next step is to look at what data is needed (available) and write data definitions (so that we know exactly what we talking about – this is very important for aggregation of apparently same data: the definitions may not be the same! – blood pressure data may look exactly the same – but there is indeed a difference whether it is data acquired at the ICU via intra-arterial cannula; or it is a casual self-monitoring measure the patient is doing by himself at home; nailing down date format is important – especially when aggregating data from different sources – 02/03/12 can be 2^nd of March 2012; 3^rd of February 2012, 3^rd of December 2002; explicitly describe any coding schemas, etc. …).

• Collect initial data

– Acquire data listed in project resources

– Report locations of data, methods used to acquire them, ...

• Describe data

– Examine "surface" properties

– Report for example format, quantity of data, ... à Data dictionary

– NB: data dictionary summarises your knowledge on each piece of data – this description can be considered to be part of the dataset – each piece of data comes with metadata describing meaning, coding, context of collection etc. In many cases you will be given these descriptions along with the dataset

• Explore data

– Examine central tendencies, distributions, look for patterns (visualisations etc.)

– Report insights suggesting examination of particular data subsets (data selection)

• Determine data quality (consider the dimensions of data quality)

– Completeness

– Uniqueness

– Timeliness

– Validity

– Accuracy

– Consistency

NB: this is an initial exploration – scouting the problem space. It helps you to understand what data is available and it helps to align your approach to the business objectives and the data available. At the same time – this phase can help to verify, whether the project is viable (feasibility) and refine the project scope, budget, resources etc.

This phase is very different to a typical research prospective approach where you design the study in a way you always know what you are getting…

Data preparation

Typically, the data you get is not in the right format for analysis (it was collected for other purposes) and needs to be pre-processed

• Select data

– Relevance to the data mining goals

– Quality of data

– Technical constraints, e.g. limits on data volume

• Clean data

– Raise data quality if possible

– Selection of clean subsets

– Insertion of defaults

• Construct data

– Derived attributes (e.g. age = NOW – DOB; possibly subsequent coding of age into buckets etc.) – do not forget to add these attributes to your data dictionary!

• Integrate data

– Merge data from different sources

– Merge data within source (tuple merging)

• Format data

– Data must conform. to requirements of initially selected mining tools (e.g. input data is different for Weka, and different to Disco).

Modelling

This phase goes hand-in-hand with the data preparation. Here you select what analytic techniques you are planning to use, in which sequence etc. Once you have the analysis design, you execute it.

• Select modelling technique

– Finalise the methods selection with respect to the characteristics of the data and purpose of the analysis

– E.g., linear regression, correlation, association detection, decision tree construction…

• Generate test design

– Define your testing plan – what needs to be done to verify the results from analysis (verify the validity of your model). E.g.:

• Separate test data from training data (in case of supervised learning)

• Define quality measures for the model

• Build model

– List parameters and chosen values

– Assess model

At the end of the Data preparation/Modelling phase you have a set of results coming from the analysis (you have a model).

NB: this needs to be assessed and evaluated from the technical point of view (to mitigate issues such as overfitting etc.).

Evaluation

Here you evaluate the results (model) from the business perspective (Did we learn something new? How do the results fit into knowledge we already have? Does the predictive model work? etc.).

• Evaluate results from business perspective

– Test models on test applications if possible

• Review process

– Determine if there are any important factors or tasks that have been overlooked

• Determine next steps (Recommendations)

– Depending on your analysis (results, interpretations) you need to recommend, what will be the next step. In general, the next step can be:

• Deploy the solution (you reached as stage where you got a viable solution)

• Kill the project (you exhausted all meaningful options and decide, that continuation of the project is not viable/feasible from the business point of view)

• Go into the next iteration.

• Improve the model.

• Build an entirely new model.

NB: Do not jump to decisions without the analytic evidence to support such decisions (recommendations).

Deployment

In this phase you conclude the project.

• Plan deployment

– Determine how results (discovered knowledge) are effectively used to reach business objectives

• Plan monitoring and maintenance

– Results become part of day-to-day business and therefore need to be monitored and maintained.

• Final report

• Project review

– Assess what went right and what went wrong, debriefing

NB: Deployment can be a launch of a new project with its own problems. E.g. you have a static data extract you can use to develop a solution. Once you have a viable solution, deploying it will require connection to live data input feeds. This opens a whole new set of issues to be solved:

· Automate data extraction

· Automate semantic interoperability and data linkage

· Automate data quality monitoring

· Design, develop and deploy security context

· Etc.

Caveats

The CRISP-DM framework describes the phases in a rather linear (cyclic) fashion. In theory, it can be done that way. However in reality, this is an exploration process frequently based on the try-and-err basis. You will work with the data and use frequent visualisation to “see” the patterns. Then you confirm what you “see” with more formal statistics.

General scenario

A US consulting company was engaged to analyse job market for people with business analyst qualifications. They were able to scrape job data from LinkedIn on job listings posted in 2024.

Your task is to look at patterns related to jobs requiring Business Analyst qualification (such as what jobs require this skill, what employers look for people with this skill, where the jobs are located, what are the other skills listed along with Business Analyst skill etc.).

You will have to deal with several challenges, such as size of the datasets, decomposing skills listed in one field, matching skills to job descriptions etc.

You will need to explore the problem space (reading and mind maps), declare the narrowed-down focus (the time and resource limitations do not allow to do a complete study). You will need to decide how to work with a collection of large data sources, extract relevant parts and possibly find additional data (from public sources). In this course you are expected to do the first iteration ( and recommend next steps at the end of it – this typically leads to planning of 2^nd iteration of the project) NB: you may not be able to reach a stage when you have a business solution, so do not jump to conclusions!

Business understanding

Explore the dataset and the source of this data. You will discuss this in your group and document the discussion by drawing mind maps (individual as preparation for the group discussion; then final group mind map representing your understanding of the problem).

Annotate (CRAAP) relevant publications (you annotate 2 publications, but you read as many as necessary). Brainstorm and summarise your findings in the group. Decide on the focus for your analysis – what factors you expect to go into the model and why.

Write a brief justification of a project – make your decisions explicit.

Analysis of data

For this part of your assignment, you will need to identify and acquire relevant datasets from public sources. Your task will be to have a look at the data (with your understanding from your previous reading – if you do not have enough idea, you will need some more reading) to understand your data:

· Extract a data dictionary from the data source documentation and add description of any data you construct. Note any assumptions you made. NB: you add all you know about each piece of data into the data dictionary.

· Select which data you will be using for your analysis (and justify your choice)

· Consider any additional data/datasets you may need for your analysis (and document them)

· Construct data you think you need – justify why you need this data, and describe in detail (in data dictionary) how you are going to construct the data point (formulas, …)

· Explore the data (e.g. basic statistics, graphs…)

· Comment on data quality (refer to the 6 dimensions of data quality mentioned in the lecture) – BOTH at the dataset level (e.g. selection bias) and variable level

Based on your understanding of the purpose of analysis and the data you got you:

· Make your choices on analytic methods (start with basic stats and visualisations) and justify your choices.

· Formatting/re-formatting data – what changes need to be done for methods you apply (NB: if there is no need for re-formatting, briefly state this)

· Write an analysis plan – to discuss in the class

At this point you should have a reasonably clear idea on what you plan to do with the data, as well as what transformations were needed to prepare the data. You execute your analytic plan (modelling...):

· Perform. the analysis as you propose it considering any comments you may have got.

You may need to go back to data preparation or do some additional reading – the process is not linear! Do not forget to check the (technical) validity of your results (e.g. overfitting...)

Now you have your results, you evaluate them and write comments and recommendations. You need to discern findings (facts you found; evidence coming directly from your analysis); interpretations (what do *you* think the findings mean - use your data/business understanding here) and finally recommendations (what you suggest being next steps: such as – do more analysis, collect a specific dataset, do a study focussing on something more specific; or how to use the model if you think it is good). Make sure your recommendations are consistent with your findings and interpretations.

(Evaluation of a predictive model – generate a confusion matrix; comment on it and recommend what might be the next steps to improve the performance of the model)

Formatting

Your document is supposed to be aimed to professional audience (consulting company) – adjust your style. accordingly. Both assignments form. one project, so you re-cycle some of your deliverables from assig 1 in Assig 2 (data dictionary, data quality comments etc.). Re-using some of these components may lead to higher Turnitin scores.

Pleas INFS 5135 Assignment – Analysis of a datasetC/C++ e do not write lengthy introductions (your audience is expected to know their business!).

Images and tables are expected to have captions. Lengthy components (such as Data dictionary) are expected to be presented as appendices (and referred to in the document whenever appropriate).

Use references only if you need them (no merit in “backfilling” references) but use as many references as necessary to document any work which is not yours. Preferred format is Harvard, but you can use any other format if you use it consistently throughout the entire document.

Word count – you use any number of words you need. No penalty will be for exceeding the word count. I may consider deducting points for excessive “fluff” (unnecessary fillers).

Assignments

The work described above is split into 2 assignments. The following sections describe the expected deliverables for groups and individuals (NB: in many cases the group deliverable is derived from individual contributions).

Assignment 1

In this part you will do:

Group:

· Mind map of the problem (result of brainstorming; distillation from individual mind maps)

· Project justification and scope (explain what is going to be the main goal of your analysis)

· Data dictionary – this includes a consistent description of data – both copied/adjusted from the data source and description of data the group members constructed

· Summary of dataset exploration (datasets you were given PLUS any additional datasets you consider using in your analysis – e.g. socio-economic data...).

· Analysis plan with justification and assigning work to individuals (What do you think the data is saying you – based on your preliminary exploration, and how you are going to confirm/reject our hypotheses with science – statistic/analytic methods...)

Individual

· Bibliography - find, read, and evaluate (CRAAP test) 2 sources as a basis for group discussions and brainstorming. Draft an individual mind map.

· Mind map of the problem (your individual preparation for the group discussion)

· Data dictionary⁸ – detailed description of the data you are going to use (including data construction -- any derived data you may need to create)

· Result of exploring the data – each group member will submit result of their exploration of the dataset (what was done, why it was done, what was the result, what do you think about the result; this includes any visualisations you have done)

· Data quality analysis – at the dataset level and at each variable level (what are the problems, can they be fixed? How data quality will influence the validity/trustworthiness of results...). Refer to data quality dimensions to structure this deliverable. Use quantitative measures whenever applicable (e.g. you identify missing data: you need to state what is the proportion of missing data for each variable; look for patterns of missing data – randomly distributed or some specific relationships...).

NB: Check the course calendar on what task is due at what time. Feedback will be given in the practical class (external students- feedback will be either in writing by e-mail, or via phone/teleconference).

Hint: you may use a smaller extract (sample) from the large dataset for initial exploration. Then you will have to think, how you prepare the dataset you will be analysing (removing unnecessary parts etc.)

Assignment 2

In this part you will do:

Group:

· Results of analysis compilation – summary of results/findings produced by individual group members (results from your objective assessment: in your exploration you identified interesting patterns/relationships – now you need to confirm these with objective methods and report the results. E.g. exploration: 2 lines appear to correlate à hypothesis of correlation à you calculate correlation coefficient à result confirming/rejecting the correlation)

· Description of the model you created, how you tested it and its performance (test results)

· Final report - includes interpretation of findings and Recommendations

Individual:

· Analysis – as assigned by the group in the analysis plan

· Result of analysis (what you found, what the data tells you – i.e. trends, patterns; results of testing your model etc.; NB: this is about facts, not interpretations or opinions)

· Interpretation of your findings (here is where you express your opinions, interpretations etc. of what you found. Interpretation puts your results into context with other aspects of the “business” – you may need to do some additional reading). In this section you submit all your results and interpretations, even if they do not become part of the group final report.

Submission of your work

Assignments are submitted via learnonline (refer to the BIA schedule document on deadlines and mode of presentation). For the learnonline group submission please follow these rules:

Assignment 1:

1. you write a consolidated document for the *group* (this is the document you would give to your client as consultants - clients do not particularly care about who did what part...). NB: you are writing for a senior person - please do not give lengthy introductions... The structure of the document is roughly governed by the deliverables described in the assignment specification, however, I accept any meaningful structure (think about making sure we can easily find the makeable components!)

2. your individual material - this will be included in the form. as you produce it. No further editing/formatting required - this is material for us to work out the individual mark, so you do not need to make it look the same for all members (I recommend you to negotiate a uniform. format before you write the individual parts - this makes compiling the material into a consolidated report easier...). Your individual material will be added to the document as appendices - one appendix for each member of the group, e-mail ID and the student name as the header of the appendix. E.g.: Appendix X - Jan Stanek - STAJY001.

3. The contribution of members of the group has to be made explicit: you add an appendix explaining the contribution of each member (who did what). NB: if you have non-contributing members DO NOT exclude them from communication - keep all of them posted and in general treat them in the same way as the contributing members and give them all the opportunity to contribute. Make the contribution list available for all members.

4. The complete material (consolidated document + appendices with individual work) will be submitted by one member of the group. The group will choose who will submit - and this is the submission I will be taking for marking. All other members can submit whatever they want (if you think the submitting member will miss the deadline etc...), but this material will be possibly not seen by me (although I can use it for conflict resolution...). The main reason for having one identified submission for marking per group is to avoid version conflicts and confusion stemming from several slightly different versions of the same report being submitted by members of the same group.

5. The name of the submitted file for marking will be: "<practical> Group <group number> report for marking" where <practical> is "Monday", “Tuesday”, "Wednesday" or "External"; and <group number> is the group number we assigned to your group E.g.: "Thursday Group 1 report for marking". All other members of the group can submit their material (or indeed whatever version of the material they want) with the file name: "<practical> Group <group number> - individual <your networkID" where <network ID> is your e-mail ID - e.g. "Thursday Group 1 individual STAJY001". The preferred format of the file is Word or PDF. If there are more files, pack them as ZIP or RAR.

6. Each submission is expected to have the group identification (e.g. "Tuesday Group 1") and student list including your networkID ("Stanek Jan, STAJY001"...) on its first page (I print the material for marking...). Please use networkIDs, not your student ID numbers.

Assignment 2 submission

The final report is aimed at general public.

I suggest the following structure:

Background and motivation - You recycle your intro and justification here. Any argument needs to be supported by references (use your bibliography).

Material for analysis - Brief (!) comments on what data you got and what is the data quality. Your data dictionary goes into the appendices <copy&paste&edit from Assignment 1...>- it would be rather disruptive to have these details in the main document. Data quality issues should be commented (recycle your Assig1 material), but put any lengthy tables, statistics etc. in an appendix.

Analytic approach - What you have done with the data (use your Analysis plan as a start)? Typically here would go a schematic diagram on what you did with the data (and most details would be listed in the appendix). NB: your actual analysis may differ from the plan stated in Assignment 1 - as you proceed with analysis your plan may change: make all relevant changes.

Findings - Here go the relevant (from the business point of view) results (all other results will go into an appendix). To keep the message prominent, the graphic representation is preferred (NB: use captions explaining what each graph, table shows - with reference to a more detailed result in the appendix wherever appropriate). Describe your resulting model.

Interpretations - Here you express your opinions on what the results mean - you interpret the results. You evaluate the model and comment on its strengths and weaknesses.

Recommendations - You write your suggestions, based on your results and their interpretations - what are the options for further steps for the decision-makers.

E.g.: in your analysis, you describe how you identify outliers (method); in your results, you show what you found; in the interpretation, you comment whether these are errors, extreme/rare values, or a mix of different subpopulations (or what?). If you think the outliers are important and the decision-makers need to do something about them - you write a recommendation (NB: not all findings lead to a recommendation).

E.g. if you remember the Disco process mining exercise: there are 198 cases when a patient bypassed triage (finding); these cases can be an error in coding, or this can be patients arriving by an ambulance and triage was already done by the paramedics (2 alternate interpretations of the finding - list both); if this is considered to be important enough, you can recommend following options:

1. get the IDs of these patients and do an audit (go to archive, pull out the paper records and try to find reasons why the patient was not triaged);

2. get more data (from the full data dictionary you know that there is a piece of data stating the mode of arrival - seeing the patients in question coded as arrived by an ambulance will support that interpretation),

3. run a prospective study looking into recording triage... (The decision-makers then judge these options from the point of view of time, effort, expenses etc.)

The structure of your report should be shaped so that it is easy to identify the markable components.

References – you include any references on data sources you used; any papers, guidelines, formulas, images etc. you did not create. Any such material should make sense in context with your analysis and report (i.e. do not include references just because you want to have a lot of them…)

Submission process is similar as for Assignment 1 (please make sure your material has your group Day-number and list of team members on the first page - I typically print the material for marking)

1. you write a consolidated document for the *group* (this is the document you would give to your client as consultants - clients do not particularly care about who did what part...).

2. your individual material - this will be included in the form. as you produce it. No further editing/formatting required - this is material for me to work out the individual mark, so you do not need to make it look the same for all members (I recommend you to negotiate a uniform. format before you write the individual parts - this makes compiling the material into a consolidated report easier...). Your individual material will be added to the document as appendices - one appendix for each member of the group, e-mail ID and the student’s name as the header of the appendix. These appendices will come after the appendices needed for your group final report. E.g.: Appendix X - Jan Stanek - STAJY001.

3. The contribution of members of the group must be made explicit: you add an appendix explaining the contribution of each member (who did what). NB: if you have non-contributing members DO NOT exclude them from communication - keep all of them posted and in general treat them in the same way as the contributing members and give them all the opportunity to contribute. Make the contribution list available for all members.

4. The complete material (consolidated document + appendices with individual work) will be submitted by one member of the group. The group will choose who will submit - and this is the submission I will take for marking. All other members can submit whatever they want (if you think the submitting member will miss the deadline etc...), but this material will be possibly not seen by me (I can use it for conflict resolution...). The main reason for having one identified submission for marking per group is to avoid version conflicts and confusion stemming from several slightly different versions of the same report submitted by members of the same group.

5. The name of the submitted file for marking will be: "<practical> Group <group number> report for marking" where <practical> is “Monday”, "Tuesday", "Wednesday" or "External"; and <group number> is the group number I assigned to your group E.g.: "Thursday Group 1 report for marking". All other members of the group can submit their material (or indeed whatever version of the material they want) with the file name: "<practical> Group <group number> - individual <your networkID" where <network ID> is your e-mail ID - e.g. "Thursday Group 1 individual STAJY001". The format of the file can be Word or PDF (if you want to submit something else, write me an e-mail and ask). If there are more files, pack them as ZIP or RAR.

6. Each submission is expected to have the group identification (e.g. "Tuesday Group 1") and student list including your networkID ("Stanek Jan, STAJY001") on its first page.