Learning Data Mining with Python - Chapter2

所有代码以及数据包均来自《Learning Data Mining with Python (Robert Layton 著)》。
使用环境为Jupyter Notebook。

Chapter 2

2.1 scikit-learn估计器

为了帮助用户实现大量分类算法,scikit-learn把相关功能封装成所谓的估计器,它包括 fit() 和 predict() 两个函数,也就是训练步和测试步。下面介绍scikit-learn中的近邻算法。

数据集:Ionosphere数据集,可从UCI机器学习数据库下载。该数据集每行有35个值,前34个是天线采集的数据,最后一个是“g”或“b”,表示数据的好坏。

首先,导入numpy和csv库,加载数据集,创建Numpy数据X和Y存放数据集。
在这里插入图片描述
接下来,将数据集切分成训练集和测试集,调用sklearn库中的K近邻分类器,并测试算法。
在这里插入图片描述

2.2 交叉检验

在先前的实验中,如果碰巧测试集很简单,算法可能表现很好,但运气不好的话可能算法会表现很糟糕,使用交叉检验能够解决一次测试带来的问题。将整个大数据集分为几个部分,对于每一个部分:将当前部分作为测试集,用剩余部分训练算法,记录当前得分。

scikit-learn中提供了交叉检验的方法:
在这里插入图片描述
效果比之前的稍稍差了一点,接下来调整参数。
我们测试从1到20的近邻数(n_neighbors)看看哪个效果最好:
在这里插入图片描述
这里提一句,%matplotlib inline是一种Python提供的魔法命令,它可以将matplotlib的图表嵌入到Notebook之中。画出来的图是这样的:
在这里插入图片描述

2.3 预处理

为了需要,我们先对Ionosphere做些破坏。首先为了不破坏原来的数据集,我们创建一个副本,将X每隔一行就把第二个特征的值除以10,我们直接计算一下准确率:
在这里插入图片描述
准确率不出意外地降低了。接下来用MinMaxScaler类进行基于特征的规范化,即 x = ( x - xmin ) / ( xmax - xmin ),这样就将每个特征的值规范化为0到1之间。然后在预处理器MinMaxScaler上调用训练和转换函数fit_transform()。
在这里插入图片描述
正确率再次提升。

2.4 流水线

随着实验的增加,操作的复杂度也在提高,比如对特征进行各种操作,要跟踪记录这些操作不是很容易,因此引入流水线结构。流水线的输入是一连串的数据挖掘步骤,其中最后一步必须是估计器,前几步是转换器,输入的数据集经过转换器的处理后,输出的结果作为下一步的输入。最后用最后一步的估计器对数据进行分类。

在这里,我们流水线分为两大步:
(1)用MinMaxScaler将特征取值范围规范到0~1;
(2)指定KNeighborsClassifier分类器。
每一步都用(‘名称‘,步骤)来表示。

在这里插入图片描述
结果跟之前一样。

在这一章中其实用不用流水线结构其实看不出区别,但后面章节使用更高级的测试方法的时候,设置流水线就很有必要了。

earning Data Mining with Python - Second Edition by Robert Layton English | 4 May 2017 | ASIN: B01MRP7VFV | 358 Pages | AZW3 | 2.85 MB Key Features Use a wide variety of Python libraries for practical data mining purposes. Learn how to find, manipulate, analyze, and visualize data using Python. Step-by-step instructions on data mining techniques with Python that have real-world applications. Book Description This book teaches you to design and develop data mining applications using a variety of datasets, starting with basic classification and affinity analysis. This book covers a large number of libraries available in Python, including the Jupyter Notebook, pandas, scikit-learn, and NLTK. You will gain hands on experience with complex data types including text, images, and graphs. You will also discover object detection using Deep Neural Networks, which is one of the big, difficult areas of machine learning right now. With restructured examples and code samples updated for the latest edition of Python, each chapter of this book introduces you to new algorithms and techniques. By the end of the book, you will have great insights into using Python for data mining and understanding of the algorithms as well as implementations. What you will learn Apply data mining concepts to real-world problems Predict the outcome of sports matches based on past results Determine the author of a document based on their writing style Use APIs to download datasets from social media and other online services Find and extract good features from difficult datasets Create models that solve real-world problems Design and develop data mining applications using a variety of datasets Perform object detection in images using Deep Neural Networks Find meaningful insights from your data through intuitive visualizations Compute on big data, including real-time data from the internet About the Author Robert Layton is a data scientist working mainly on text mining problems for industries including the finance, information security, and transport sectors. He runs dataPipeline to build algorithms for practical use, and Eurekative, helping bringing start-ups to life in regional Australia. He has presented at the last four PyCon AU conferences, at multiple international research conferences, and has been training in some capacity for five years. He has a PhD in cybercrime analytics from the Internet Commerce Security Laboratory at Federation University Australia, where he was the Inaugural Young Alumni of the Year in 2014 and is currently and Honorary Research Fellow. You can find him on LinkedIn at https://www.linkedin.com/in/drrobertlayton and on Twitter at @robertlayton. Robert writes regularly on data mining and cybercrime, in a private, consultancy, and a research capacity. Robert is an Official Member of the Ballarat Hackerspace, where he helps grow the future-tech sector in regional Victoria.
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值