Analysis for INFO 212 Data Science Programming I Assignment 1

大数据第一次作业

Assignment 1: NumPy and basic Python

题目描述:

介绍数据集的背景

The GreenHub initiative is a collaborative approach to Android energy consumption analysis. Its most important component is a dataset. The entries in the GreenHub dataset include multiple pieces of information, e.g., active sensors, memory usage, battery voltage, and temperature, running applications, model, and manufacturer, and network details. This raw data was obtained by continuous crowdsourcing through a mobile application called BatteryHub. It is worth noting that all such data is publicly available, while maintaining the anonymity and privacy of all its users. Indeed, it is impossible to associate any data with the user who originated it. The dataset is sizable and thus far it comprises of 23+ million unique samples, including more than 700+ million data points pertaining to processes running on these devices. The dataset is also diverse. It includes data stemming from 1.6k+ different brands, 11.8k+ smartphone models, from over 50 Android versions, across 160 countries. Overall, the dataset comprises more than 120GB of uncompressed data.

注意本数据集虽然为csv文件,但是它声明了是用;来分割的

For this assignment, you will have to examine a small part of this dataset, the dataset-samples part of the dataset (hereafter just the “dataset-samples” dataset). This part of the dataset consists of multiple files, each one grouping approximately 300,000 observations collected from a number of devices. Each of these observations includes data such as the temperature of the device when the observation was collected, whether wifi was on or off, amount of free memory, cpu usage, current battery level, among others. The files are organized so that each row corresponds to one observation. The values in each observation are separated by semicolon (’;’) and in this manner organized as columns. The first row of each file contains the labels for the columns. The following is the list of columns of each of these files:

列名:

[‘id’, ‘device_id’, ‘timestamp’, ‘battery_state’, ‘battery_level’, ‘timezone’, ‘country_code’, ‘memory_active’, ‘memory_inactive’, ‘memory_free’, ‘memory_user’, ‘charger’, ‘health’, ‘voltage’, ‘temperature’, ‘usage’, ‘up_time’, ‘sleep_time’, ‘network_status’, ‘network_type’, ‘mobile_network_type’, ‘mobile_data_status’, ‘mobile_data_activity’, ‘wifi_status’, ‘wifi_signal_strength’, ‘wifi_link_speed’, ‘screen_on’, ‘screen_brightness’, ‘roaming_enabled’, ‘bluetooth_enabled’, ‘location_enabled’, ‘power_saver_enabled’, ‘nfc_enabled’, ‘developer_mode’, ‘free’, ‘total’, ‘free_system’, ‘total_system’]

要完成的任务:

You should perform the following tasks:

  1. Obtain the dataset-samples dataset.

Dataset download link:http://yang.lzu.edu.cn/dsp/dataset-samples.7z(这个链接可能会失效)

  1. Familiarize yourself with the data contained in the .csv files.
  2. Build a function that takes as argument a list of column names and creates a NumPy array containing the values in the corresponding columns of the dataset-samples dataset, for all the files in the dataset. The number of columns of the matrix should be the same as the list of column names received as argument and the number of rows should be 49296480, the overall number of rows in this dataset. The function should return that NumPy array, filled in with the values read from the observations, for the corresponding columns.
  3. Using the aforementioned function, determine the minimum and maximum temperatures among all the observations in the dataset. What are the ids of the devices with these values?
  4. Determine what is the percentage of observations where location is enabled.
  5. What is the average battery level when power saver mode is enabled? And what is the average battery level when it is not enabled?
  6. What is the percentage of the observations where the network is off, mobile data is off, and wi-fi is off?
说白了就是基础的数据处理,但是此次的数据集非常大,压缩包有1.46 GB,解压完10.2 GB,一共165个csv文件,而且这并不是所有的数据,如上文提及,

Overall, the dataset comprises more than 120GB of uncompressed data

原数据集没解压之前就120GB,此次的数据集仅仅是重大数据中的一部分。
要提交运行完毕的Jupyter文件,后缀是.ipynb的文件

The deliverable for this assignment is a Jupyter Notebook with all the code that you wrote. The notebook should also use the functions you built so that the responses to the questions posed in the tasks are visible in the notebook. For example, it should include the code to determine the percentage of observations where location is enabled but also show that percentage, as calculated for the dataset.

题0和1就不用看了,获取数据集和了解数据集。

对于第2题,

Build a function that takes as argument a list of column names and creates a NumPy array containing the values in the corresponding columns of the dataset-samples dataset, for all the files in the dataset. The number of columns of the matrix should be the same as the list of column names received as argument and the number of rows should be 49296480, the overall number of rows in this dataset. The function should return that NumPy array, filled in with the values read from the observations, for the corresponding columns.

我是用pandas来做的,虽然这次的作业的题目是NumPy and basic Python,但是外教后来也说了,

- The first question is could we use other libraries like Pandas?
Well, the specification does not say you cannot, so I won’t contradict it. Having said that, you will have plenty of opportunity to use Pandas. Therefore, I’d rather you processed the data manually and used NumPy arrays instead.

他不反对,但是他希望看到同学用numpy而不借助pandas来写此次的作业,但是我之前学过pandas,所以毫无疑问地选择了偷懒的选项。

第二题简单来说就是构造一个函数,函数的参数是一个列表,表内元素是列名,然后返回值是一个矩阵,矩阵的数据应该是所有数据集中除了列名的数据。但是题目中描述有些问题,他说行数应该是49296480,但是不应该包括列名行的,因为一共有165个文件,所以最后矩阵的行数应该是49296315 = 49296480 - 165行才对。

# function for reduce
def stack(x, y):
    return np.hstack((x, y))
def to_array(x):
    return fh.loc[:, [x]].values.astype('object')

# parameter is a list with names
# output is a matrix whose values are related to names list
def readcol(name_list):
    path=r'C:\Users\·lzu\Downloads\dataset-samples'
    length = len(name_list)
    res = np.empty([0,length]).astype('object')
    for filename in os.listdir(path):
        fn = os.path.join(path, filename)
        global fh
        fh = pd.read_csv(fn, sep=";")
        array = list(map(to_array, name_list))
        # make sure the data showing originally
        data = reduce(stack, array) if length>1 else fh.loc[:, name_list].values
        res = np.append(res, data, axis=0)
    return res

这个函数很重要的地方是初始化res,首先要用numpy的empty方法来构建一个有输入列数个数个列的空向量,然后强制转换为object类型。在这儿转换类型的原因值得让人思考:矩阵的dtype是唯一的,但是在数据集中,里面有浮点数,有整数,还有字符串,这些东西混入到一起会混乱的;比如说如果不强制转换,将整数和浮点数合并到一起,矩阵的dtype会跟随元素的最精确类型,就是float64了,所以本来的整数232就会变成类似于232.00的数值。

将fh变为global全局变量就是为了能广播到to_array()里面

就算是列表长度为1,运行这个函数也会花费很长的时间

对于第3题,

Using the aforementioned function, determine the minimum and maximum temperatures among all the observations in the dataset. What are the ids of the devices with these values?

必须要用到刚才定义过的函数来进行。
对于observation的解释可以在商务统计书中了解到:

An observation is a single member of a collection of items that we want to study, such as a person, firm, or region.An example of an observation is an employee or an invoice mailed last month.

最直观的就是看一张图:
observation
通俗的意思就是一行数据。
第三题要求温度的最大和最小值以及对应的id。

首先把数据加载到内存中

# get the temperature column values
temperatures = readcol(['temperature'])

通过在Jupyter Notebook上运行,利用魔术代码%%time来记录时间,Wall time: 2min 47s。

# get the maximum temperature value
# nanmax to avoid nan numbers
max_temp = np.nanmax(temperatures)
max_temp
255.0
# get the minimum temperature value
# # nanmin to avoid nan numbers
min_temp = np.nanmin(temperatures)
min_temp
-128.0

定义一个获得对应id的函数:

# apply index to an array
def with_index(body, index_array, value):
    return list(set(body[np.where(index_array == value)]))

将device_id列加载到内存中

# get the device_id column values
device_ids = readcol(['device_id'])

获得max_id和min_id,有可能是多个

# get the id of index of devices whose temperatures are max or min
max_id = with_index(device_ids, temperatures, max_temp)
min_id = with_index(device_ids, temperatures, min_temp)
# device id with max temperature
max_id
[305332]
# device id with min temperature
min_id
[136969]

紧接着问题4,

Determine what is the percentage of observations where location is enabled.

这个可以用pandas的DataFrame里面的info函数来看数据的列名,然后找到了题目中涉及的location_enabled列

如果针对location_enabled列进行value_counts统计,只有0和1,不像之后的wifi_status,有enabled,disabled,unknown,enabling,disabling这么多值,那么如果wifi是关闭的,应该选哪个?应该是disabled吧,其实要得到这个答案还得去思考一下。

对于本题,采用的方法不是把所有的数据读到内存中然后再筛选,而是每读取完一个文件,统计分子和分母,之后累加,最后做除法操作,获得百分比。

# function for percentage of observations where location is enabled
def percentage_loaction():
    path=r'C:\Users\·lzu\Downloads\dataset-samples'
    numerator = 0
    denominator = 0
    for filename in os.listdir(path):
        fn = os.path.join(path, filename)
        fh = pd.read_csv(fn, sep=";")
        splited_fh = fh[fh.loc[:, 'location_enabled'] == 1]
        numerator += splited_fh.loc[:, 'location_enabled'].count()
        denominator += len(fh)
    return numerator/denominator
# percentage of observations where location is enabled
percentage_loaction()
0.37298972549976606

第5题,

What is the average battery level when power saver mode is enabled? And what is the average battery level when it is not enabled?

在power saver mode处于开启状态下的平均battery level,这限定了对象(或observation)的范围;不处于开启就是多加个!来进行非判断就行。

第5题的处理方式和4题类似,用累加,最后做除法。

# function for average battery level when power saver mode is enabled or not enabled
def average_battery(mode):
    path=r'C:\Users\·lzu\Downloads\dataset-samples'
    numerator = 0
    denominator = 0
    if mode==True:
        for filename in os.listdir(path):
            fn = os.path.join(path, filename)
            fh = pd.read_csv(fn, sep=";")
            splited_fh = fh[fh.loc[:, 'power_saver_enabled'] == 1]
            battery_value = splited_fh.loc[:, 'battery_level'].values
            # pop the nan numbers
            battery_value = battery_value[~np.isnan(battery_value)]
            numerator += sum(battery_value)
            denominator += len(battery_value)
    elif mode==False:   
        for filename in os.listdir(path):
            fn = os.path.join(path, filename)
            fh = pd.read_csv(fn, sep=";")
            splited_fh = fh[fh.loc[:, 'power_saver_enabled'] != 1]
            battery_value = splited_fh.loc[:, 'battery_level'].values
            # pop the nan numbers
            battery_value = battery_value[~np.isnan(battery_value)]
            numerator += sum(battery_value)
            denominator += len(battery_value)
            
    return numerator/denominator
# average battery level when power saver mode is enabled
average_battery(True)
51.385269810249866
# average battery level when power saver mode is not enabled
average_battery(False)
55.253716605866785

最后一题,

What is the percentage of the observations where the network is off, mobile data is off, and wi-fi is off?

要满足三者,联想到布尔与运算

# function for the percentage of the observations where the network is off, mobile data is off, and wi-fi is off
def percentage_three_condition():
    path=r'C:\Users\·lzu\Downloads\dataset-samples'
    res = np.empty([0,1]).astype('object')
    numerator = 0
    denominator = 0
    for filename in os.listdir(path):
        fn = os.path.join(path, filename)
        fh = pd.read_csv(fn, sep=";")
        network_status = fh.loc[:, 'network_status'].values == 'disconnected'
        mobile_data_status = fh.loc[:, 'mobile_data_status'].values == 'disconnected'
        wifi_status = fh.loc[:, 'wifi_status'].values == 'disabled'

        denominator += len(fh)
        # satisfies three conditions
        three_bool_list = network_status & mobile_data_status & wifi_status
        numerator += list(three_bool_list).count(True)

    return numerator/denominator

其中的四行

network_status = fh.loc[:, 'network_status'].values == 'disconnected'
mobile_data_status = fh.loc[:, 'mobile_data_status'].values == 'disconnected'
wifi_status = fh.loc[:, 'wifi_status'].values == 'disabled'
three_bool_list = network_status & mobile_data_status & wifi_status

是本问题的核心解答方案

# the percentage of the observations where the network is off, mobile data is off, and wi-fi is off
percentage_three_condition()
0.1886132665291513

多多调试,多多搜索,一定能吃透

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值