《Machine Learning for OpenCV》学习笔记:数据预处理之处理缺失的数据

本文是《Machine Learning for OpenCV》的学习笔记,主要探讨了数据预处理中的缺失值处理。介绍如何使用scikit-learn的SimpleImputer类,通过mean、median、most_frequent和constant策略填充缺失值,并进行了验证。
摘要由CSDN通过智能技术生成

一.缺失数据处理

1.1填充缺失值

大多数的机器学习算法无法处理非数值(not a number), 在python中非数值用nan表示。所以我们就需要把所有的nan值替换为某个合适的填充值。这个操作称为填充缺失值。

1.2scikit-learn实现

在scikit-learn的sklearn.impute.SimpleImputer类中提供了三种不同的方法(策略)来填充缺失值。

(1)mean: 将所有的nan值填充为矩阵在指定坐标轴上元素的平均值(默认情况,axis=0)。

(2)median: 将所有的nan值填充为矩阵在指定坐标轴上元素的中值(默认情况,axis=0)。

(3)most_frequent: 将所有的nan值填充为矩阵在指定坐标轴上00(出现频率最高的值(默认情况,axis=0)

(4)constant: 将所有的nan值使用常量填充。

1.3验证

# -*- coding:utf-8 -*-
import numpy as np
from numpy import nan
from sklearn.impute import SimpleImputer

X = np.array(
	[[nan, 0, -3],
	 [2, 9, -8],
	 [1, nan, 1],
	 [5, 2, 4],
	 [7, 6, -3]]
)

# 使用平均值(mean)填充
imp_mean = SimpleImputer(missing_values=nan, strategy='mean')
X_mean =
Machine learning is no longer just a buzzword, it is all around us: from protecting your email, to automatically tagging friends in pictures, to predicting what movies you like. As a subfield of data science, machine learning enables computers to learn through experience: to make predictions about the future using collected data from the past. And the amount of data to be analyzed is enormous! Current estimates put the daily amount of produced data at 2.5 exabytes (or roughly 1 billion gigabytes). Can you believe it? This would be enough data to fill up 10 million blu-ray discs, or amount to 90 years of HD video. In order to deal with this vast amount of data, companies such as Google, Amazon, Microsoft, and Facebook have been heavily investing in the development of data science platforms that allow us to benefit from machine learning wherever we go—scaling from your mobile phone application all the way to supercomputers connected through the cloud. In other words: this is the time to invest in machine learning. And if it is your wish to become a machine learning practitioner, too—then this book is for you! But fret not: your application does not need to be as large-scale or influential as the above examples in order to benefit from machine learning. Everyone starts small. Thus, the first step of this book is to introduce you to the essential concepts of statistical learning, such as classification and regression, with the help of simple and intuitive examples. If you have already studied machine learning theory in detail, this book will show you how to put your knowledge into practice. Oh, and don't worry if you are completely new to the field of machine learning—all you need is the willingness to learn. Once we have covered all the basic concepts, we will start exploring various algorithms such as decision trees, support vector machines, and Bayesian networks, and learn how to combine them with other OpenCV functionality. Along the way, you will learn how to understand the task by understanding the data and how to build fully functioning machine learning pipelines.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值