decision tree is supervised machine learning method. There is a scene in the movie "shameless bastard" game, the pub has several people to play 20 games, the rules of the game is a fan out of a target of the card (a person, also may be the thing, and the Riddler) questions, many people can only answer is. Or not, in some problems (up to twenty questions) after guessing by gradually narrowing the scope of accurate answer. This is similar to the working principle of a decision tree. (FIGS. 1) is a way of judging the categories of mail. It can be seen that the judging method is very simple, and all of them are thresholding judgments. The key is how to build decision trees, that is, how to train a decision tree.

(Figure 1)

pseudo code to build decision tree as follows:

 Check if every item in the dataset is in the same class: If so return the class label Else find the best feature to split the data split the dataset create a branch node for each split call create Branch and add the result to the branch node return branch node 

< p> only one principle, to make each node labels as little as possible, pay attention to the above pseudo code in a find the best feature to said. Split the data find thebest Fe, so how to Ature? A general rule is to make the branch of the branch as pure as possible, that is, the accuracy of the division. As shown in (Figure two), 5 animal fish from the sea, we have to determine whether they are fish, first with what characteristics?

(Figure two)

in order to improve the accuracy, we are the first to use "leave the land can survive" or "whether there is a web to judge? We have to have a criterion, which is often used as information theory, Gini purity, and so on, and the former is used here. Our goal is to select the set data so that the segmented label information gain the biggest feature, information gain is the original data set label based entropy segmented data set minus the label entropy, in other words, the information gain is the entropy decreases, making the data set more orderly. The calculation of entropy (as shown in Formula 1):

is the guiding principle, the code into the combat phase, first look at the code for calculating entropy:

 def calcShannonEnt (dataSet): numEntries = len (labelCounts = dataSet) for featVec in dataSet: #the the is number of unique elements and their occurance currentLabel if currentLabel not in = featVec[-1] (labelCounts.keys): labelCounts[currentLabel] = 0 labelCounts[currentLabel] = 1 the number of # collect all categories, creating a dictionary shannonEnt = 0 for key in labelCounts: prob = float (labelCounts[key]) /numEntries shannonEnt log (prob = prob *, 2) #log base 2 return shannonEnt 

to calculate entropy calculation code entropy, then look at the information in accordance with the principle of selecting the larger gain characteristics of code:

 def splitDataSet (dataSet, axis, value): retDataSet = for featVec in dataSet: if featVec[axis] [] = = value: reducedFeatVec = featVec[#chop out axis used: axis] for splitting reducedFeatVec.extend (featVec[axis+1:]) retDataSet.append (reducedFeatVec) return retDataSet def chooseBestFeatureToSplit (dataSet): numFeatures = len (dataSet[0]) - last column is used 1 #the for the labels baseEntropy (dataSet) = calcShannonEnt bestInfoGain = 0; bestFeature = -1 for I in range (numFeatures): #iterate over all the features FEA TList = [example[i] for example in dataSet]#create a list of all the examples of this feature uniqueVals = set (featList) #get a set of unique values newEntropy value in uniqueVals: = 0 for subDataSet = splitDataSet (dataSet, I, value) prob = len (subDataSet) /float (len (dataSet) = prob * calcShannonEnt (newEntropy) subDataSet) infoGain = baseEntropy - newEntropy #calculate the info gain reduction in entropy (if ie; infoGain > bestInfoGain this to the): #compare best gain so far # selection information gain maximum code for this bestInfoGain #if better than current = infoGain best, set to best bestFeature I return bestFeature #returns an = integer 

From the last if can be seen, the biggest feature selection makes the information gain as the segmentation feature, now have the characteristics of segmentation criterion, continue to enter the next link, how to build a decision tree, is in accordance with the pseudo code of the top write down, using recursive segmentation in turn thought down until the execution completed build decision trees. The code is as follows:


 def majorityCnt classCount={} for vote in classList: if vote not in (classCount.keys): classCount[vote] = 0 classCount[vote] = 1 sortedClassCount = sorted (classCount.iteritems (key=), operator.itemgetter (1), reverse=True) return sortedClassCount[0][0] def createTree (dataSet, labels): classList = [example[-1] for example in dataSet] if classList.count (classList[0]) = len (classList): return classList[0]#stop splitting when all of the classes are equal if len (dataSet[0]) #stop splitting when there are = 1: no more features in dataSet return majorityCnt (classList) bestFeat = chooseBestFeatureToSplit (dataSet) = labels[bestFeat] myTree = bestFeatLabel {bestFeatLabel:{}} del (labels[bestFeat]) featValues = [example[bestFeat] for example in dataSet] uniqueVals = set (featValues) for value in uniqueVals: subLabels #copy all of = labels[]: labels, so trees don't mess up existing labels myTree[bestFeatLabel][value] = createTree (splitDataSet (dataSet, bestFeat, value), subLabels return myTree 

) with the sample in Figure two the construction of the decision tree as shown (Figure three):

(Figure three)

with decision tree, you can use it to do classification. The classification code is as follows

 def classify (input: 

Tree, featLabels, testVec): firstStr = inputTree.keys ([0]) secondDict = inputTree[firstStr] featIndex = featLabels.index (firstStr) key = testVec[featIndex] valueOfFeat = secondDict[key] if isinstance (valueOfFeat, dict): classLabel = classify (valueOfFeat, featLabels, testVec) else: classLabel return classLabel

= valueOfFeat finally serialized decision tree (the decision tree model store in the hard disk) code:

 def storeTree (inputTree, filename): import pickle FW = open (filename,'w') pickle.dump (inputTree, FW) fw.close (DEF) grabTree (filename): import pickle fr = open (filename) return pickle.load (FR) 

has the advantages of fast detection speed of

disadvantages: easy

Fitting, you can use pruning way to avoid

reference: machine learning in action

above is the whole content of this article, I hope to help you learn, and I hope you will support the script home.

This paper fixed link: | Script Home | +Copy Link

Article reprint please specify:Python machine learning theory and real battle (two) decision tree | Script Home

You may also be interested in these articles!