Machine Learning 学习 之 C4.5

#coding=utf-8
#每个类先事先划分属性 数据结构定义为[[[类1,类2,...],结果],...(样本n)]
#ID3算法 比较最大增益 增益越大 说明该分类器商相对越小
#,分类越显著(概率比越大,类别对结果的影响越大)
"""
上面为了简便,将特征属性离散化了,其实日志密度和好友密度都是连续的属性。
对于特征属性为连续值,可以如此使用ID3算法:
先将D中元素按照特征属性排序,则每两个相邻元素的中间点可以看做潜在分裂点,
从第一个潜在分裂点开始,分裂D并计算两个集合的期望信息,
具有最小期望信息的点称为这个属性的最佳分裂点,
其信息期望作为此属性的信息期望

关于是否需要去重复
视情况而定

---**C4.5考虑了特征过多的影响**----
C4.5算法:

#这里 对决策树增加一步预处理,当某一label的输入data列全为一致即 len(set([]))==1,则对该列进行排除,因为提供不了分类信息

或者将 该列的增益计算直接变为-100,变为最低,不去选择

G=G0-G1

S=G/H

1、同一label 不同属性 纯度越大 G1越小,G越大

2、同一label 不确定性(样本数量分布越均匀)越大 H越大,越不倾向选择


3、考虑其他细节,如果遇到缺失值,可以按照
 取该类不缺失的样本代表全局 而不是进行全部信息都有才保留的选择
。
"""
import math
from drawTree import DrawTree

class ID3:
      def __init__(self,data,labels,inpu):
            self.data=data
            self.labels=labels
            self.inpu=inpu
            #self.tree=self.getTree(data,labels)
            #self.result=self.getR()
            self.tree=self.getTree(data,labels,'begin')
            self.getR(self.tree)
            #print self.result

      def getTree(self,dataSet,labels,st):
            re=[dataSet[i][1] for i in range(len(dataSet))]     
            #1、考虑没有分类的情况len(set(re))==1 代表完成分类 无须再进行递归操作
            #2、考虑分到底的情况(标签为0)len(labels)==0 这表明不知道该属于哪一类
            if len(set(re))==1:
                  return list(set(re))[0]
            elif len(labels)==1:
                  #只剩下一个标签则按当前标签分类
                  dicc={}
                  #print dataSet,re
                  dd=[dataSet[j][0][0] for j in range(len(dataSet))]
                  res=list(set(re))
                  for i in range(len(list(set(dd)))):
                        for k in range(len(res)):
                              ma=-99
                              index=0
                              a=dataSet.count([[list(set(dd))[i]],res[k]])
                              if a>ma:
                                    ma=a
                                    index=res[k]    
                        dicc[list(set(dd))[i]]=index
                  return dicc

            else:
                  dat=[[dataSet[j][0][i] for j in range(len(dataSet))]\
                       for i in range(len(dataSet[0][0]))]
                  depth=len(dataSet[0][0])
                  s0=self.calS(0,re)
                  ma=-99
                  index=0
                  for k in range(len(dataSet[0][0])):
                        s=self.calS(dat[k],re)
                        s=s0-s
                        if s>ma:
                              ma=s
                              index=k
                  best=labels[index]
                  tree={labels[index]:{}}
                  nex=list(set(dat[index]))
                  labels=labels[0:index]+labels[index+1:len(labels)]
                  for i in range(len(nex)):
                        data=[]
                        for j in range(len(dat[index])):
                              if dat[index][j]==nex[i]:

                                    li=[dataSet[j][0],re[j]]
                                    data.append(li)

                        data=[[data[k][0][0:index]+data[k][0][index+1:len(data)],\
                               data[k][1]] for k in range(len(data))]
                        st=nex[i]
                        subTree=self.getTree(data,labels,st)
                        tree[best][nex[i]]=subTree
                  #去掉friends 的列 并拆分 重新计算
            return tree


      def getR(self,tree):
            #self.inpu
            #print tree.keys()[0]
            idx=self.inpu[self.getIndex(tree.keys()[0])]
            #result=0
            try:
                  tree=tree[tree.keys()[0]][idx]
                  if type(tree)==type({0:1}):
                        self.getR(tree)
                  else:
                        self.result=tree
            except:
                  tree=tree[tree.keys()[0]]
                  self.result=tree


      def getIndex(self,ll):
            for i in range(len(self.labels)):
                  if ll==self.labels[i]:
                        return i

      def calS(self,liA,liB):
            S=0
            if liA==0:
                  res=list(set(liB))
                  for i in range(len(res)):
                        pi=1.0*liB.count(res[i])/len(liB)
                        #print liB.count(res[i]),pi,math.log(pi,2)
                        if pi!=0:
                              S+=-1.0*pi*math.log(pi,2) 
                  return S
            else:
                  #计算关联熵
                  S1=0
                  S2=0
                  res=list(set(liB))
                  liAS=list(set(liA))
                  inp=[[] for i in range(len(liAS))]
                  for i in range(len(liA)):
                        for j in range(len(liAS)):
                              if liA[i]==liAS[j]:
                                    inp[j].append(i)            
                  for i in range(len(inp)):
                        p1=1.0*len(inp[i])/len(liA)
                        ansB=[liB[inp[i][j]] for j in range(len(inp[i]))]
                        #print ansB,p1
                        S2+=-1.0*p1*math.log(p1,2)
                        for k in range(len(res)):
                              pi=1.0*ansB.count(res[k])/len(ansB)
                              if pi!=0:
                                    S1+=-1.0*p1*pi*math.log(pi,2)
                  if S2==0:
                        S=-100
                  else:
                        S=1.0*S1/S2
                  return S





#训练的数据类型如下
data=[
            [['sd','ss','np'],'n'],
            [['sd','ls','yp'],'y'],
            [['ld','ms','yp'],'y'],
            [['md','ms','yp'],'y'],
            [['ld','ms','yp'],'y'],
            [['md','ls','np'],'y'],
            [['md','ss','np'],'n'],
            [['ld','ms','np'],'y'],
            [['md','ss','np'],'y'],
            [['sd','ss','yp'],'n']
      ]

inpu=['sd','ls','yp']
labels=['daily','friends','photo']
id3=ID3(data,labels,inpu)

print 'tree:',  id3.tree
print 'inpu:',  id3.result


draw=DrawTree(id3.tree)
draw.showSave('12.jpg')




Power your C# and .NET applications with exciting machine learning models and modular projects Key Features Produce classification, regression, association, and clustering models Expand your understanding of machine learning and C# Get to grips with C# packages such as Accord.net, LiveCharts, and Deedle Book Description Machine learning is applied in almost all kinds of real-world surroundings and industries, right from medicine to advertising; from finance to scientifc research. This book will help you learn how to choose a model for your problem, how to evaluate the performance of your models, and how you can use C# to build machine learning models for your future projects. You will get an overview of the machine learning systems and how you, as a C# and .NET developer, can apply your existing knowledge to the wide gamut of intelligent applications, all through a project-based approach. You will start by setting up your C# environment for machine learning with the required packages, Accord.NET, LiveCharts, and Deedle. We will then take you right from building classifcation models for spam email fltering and applying NLP techniques to Twitter sentiment analysis, to time-series and regression analysis for forecasting foreign exchange rates and house prices, as well as drawing insights on customer segments in e-commerce. You will then build a recommendation model for music genre recommendation and an image recognition model for handwritten digits. Lastly, you will learn how to detect anomalies in network and credit card transaction data for cyber attack and credit card fraud detections. By the end of this book, you will be putting your skills in practice and implementing your machine learning knowledge in real projects. What you will learn Set up the C# environment for machine learning with required packages Build classification models for spam email filtering Get to grips with feature engineering using NLP techniques for Twitter sentiment analysis Forecast foreign exchange rates using continuous and time-series data Make a recommendation model for music genre recommendation Familiarize yourself with munging image data and Neural Network models for handwritten-digit recognition Use Principal Component Analysis (PCA) for cyber attack detection One-Class Support Vector Machine for credit card fraud detection Who this book is for If you're a C# or .NET developer with good knowledge of C#, then this book is perfect for you to get Machine Learning into your projects and make smarter applications. Table of Contents Basics of machine learning modeling Spam email filtering Twitter sentiment analysis Foreign exchange rate forecast Fair value of house/property Customer segmentation Music genre recommendation Handwritten digit recognition Cyber attack detection Credit card fraud detection What is next?
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值