这次实验的任务是用一个贷款审批的数据集,写代码训练一个模型,让电脑学会怎么自动判断该不该给一个人贷款。我们会用上两种经典的方法——ID3 和 Gini 指数,看看它们各自有啥门道。
任务背景:当个AI审批官 想象一下,银行每天都收到一大堆贷款申请,要是都靠人工一个个看,效率也太低了。我们的目标就是用机器学习,让这个过程自动化。手头有16条已经审批过的老数据,里面记录了申请人的年龄、有无工作、有无房产和信誉情况,以及最终是否获得了贷款。
数据长啥样?
训练集 (dataset.csv) : 16条数据,用来教我们的模型。
测试集 (testset.csv) : 7条新数据,用来考考我们的模型学得怎么样。
特征 (已经转成了数字) :
年龄段 : 0 (青年), 1 (中年), 2 (老年)
有没有工作 : 0 (没), 1 (有)
有没有房子 : 0 (没), 1 (有)
信贷情况 : 0 (一般), 1 (好), 2 (非常好)
目标 :
理论速成——决策树到底在想啥? 决策树这东西,说白了就是模拟人做决定的过程。比如你想想要不要出门,可能会先看看“天气怎么样?”,如果“下雨”,就再考虑“带伞了吗?”,最后做出“出门”或“不出门”的决定。
决策树就是这样,从所有数据开始(根节点),找一个最好的问题来问(最优特征),把数据分成几堆。然后再对每一堆数据重复这个过程,直到分出来的小堆足够“纯”(比如,里面的人全都获得了贷款),就停下来。
那怎么找到“最好的问题”呢?ID3 和 Gini 指数就是两种不同的衡量标准。
ID3:信息增益越大越好 ID3的核心是“信息熵”。简单理解,熵 就是描述一堆东西有多乱。如果一堆数据里,一半是“给贷款”,一半是“不给贷款”,那它就很乱,熵就很高。如果全是“给贷款”,那它就很整齐,熵就是0。
信息增益 就是,我用一个特征(比如“有没有房子”)把数据分开后,整体的混乱程度降低了多少。降低得越多,说明这个特征越牛,划分效果越好。我们就选信息增益最大的那个特征来划分。
Gini 指数:不纯度越小越好 Gini 指数的想法更直接。基尼不纯度 衡量的是,我从一堆数据里随便抽两个,它俩类别不一样的概率。概率越大,说明数据越不純。
和ID3一样,我们也是选一个特征,让划分后的基尼不纯度下降得最多(也就是基尼增益 最大)。
敲代码 理论懂了,就该上代码了。我们把整个过程拆解一下,一步步来。
基础工具:计算纯度的函数 首先,我们得先把理论里的那几个公式变成代码。这相当于准备我们造决策树需要的“螺丝”和“扳手”。我们需要四个函数:两个用来算纯度(信息熵和基尼不纯度),两个用来算划分前后的纯度变化(信息增益和基尼增益)。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 def calculate_entropy (y ): counts = np.bincount(y) probabilities = counts / len (y) probabilities = probabilities[probabilities > 0 ] entropy = -np.sum (probabilities * np.log2(probabilities)) return entropy def calculate_information_gain (X, y, feature_index ): total_entropy = calculate_entropy(y) feature_values = np.unique(X[:, feature_index]) weighted_entropy = 0 for value in feature_values: subset_indices = np.where(X[:, feature_index] == value)[0 ] subset_entropy = calculate_entropy(y[subset_indices]) weighted_entropy += (len (subset_indices) / len (y)) * subset_entropy return total_entropy - weighted_entropy def calculate_gini_impurity (y ): counts = np.bincount(y) probabilities = counts / len (y) gini = 1 - np.sum (probabilities**2 ) return gini def calculate_gini_gain (X, y, feature_index ): total_gini = calculate_gini_impurity(y) feature_values = np.unique(X[:, feature_index]) weighted_gini = 0 for value in feature_values: subset_indices = np.where(X[:, feature_index] == value)[0 ] subset_gini = calculate_gini_impurity(y[subset_indices]) weighted_gini += (len (subset_indices) / len (y)) * subset_gini return total_gini - weighted_gini
主体结构:DecisionTree 类 接下来,我们搭一个 DecisionTree 类的架子。这就像是决策树的“大脑”,它需要知道用哪种标准(criterion),并且负责指挥整个建树(fit)和预测(predict)的过程。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 class DecisionTree : def __init__ (self, criterion='entropy' , max_depth=None ): self .criterion = criterion self .max_depth = max_depth self .tree = None def fit (self, X, y ): self .tree = self ._build_tree(X, y, 0 ) def predict (self, X ): return [self ._predict_single(x, self .tree) for x in X]
核心逻辑:递归建树 这是最关键的一步,也就是 _build_tree 函数。它是一个递归函数,意思是它会自己调用自己,一层层地把树建起来。
它的逻辑是:
先判断要不要停下来 :如果当前这堆数据已经很“纯”了(所有人都一个类别),或者满足了其他停止条件(比如没有特征可用了),就直接返回结果,不再往下分了。
找最好的特征 :调用我们前面写的增益函数,遍历所有特征,找到那个能让数据变得最“纯”的特征。
分裂 :根据最好特征的不同取值,把数据分成好几份,然后对每一份数据,重复上面的过程(调用自己),去建立子树。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 def _get_best_split (self, X, y ): best_gain = -1 best_feature_index = -1 for i in range (X.shape[1 ]): if self .criterion == 'entropy' : gain = calculate_information_gain(X, y, i) else : gain = calculate_gini_gain(X, y, i) if gain > best_gain: best_gain = gain best_feature_index = i return best_feature_index def _build_tree (self, X, y, depth ): if len (np.unique(y)) == 1 : return y[0 ] if X.shape[1 ] == 0 or (self .max_depth is not None and depth == self .max_depth): return Counter(y).most_common(1 )[0 ][0 ] best_feature_index = self ._get_best_split(X, y) if best_feature_index == -1 : return Counter(y).most_common(1 )[0 ][0 ] tree = {best_feature_index: {}} feature_values = np.unique(X[:, best_feature_index]) for value in feature_values: subset_indices = np.where(X[:, best_feature_index] == value)[0 ] subtree = self ._build_tree(X[subset_indices], y[subset_indices], depth + 1 ) tree[best_feature_index][value] = subtree return tree
最终组装:完整代码 好了,把所有零件都拼起来,再加上读取数据、调用模型和打印结果的部分,就是我们最终的完整代码了。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 import numpy as npimport pandas as pdfrom collections import Counterdef calculate_entropy (y ): counts = np.bincount(y) probabilities = counts / len (y) probabilities = probabilities[probabilities > 0 ] entropy = -np.sum (probabilities * np.log2(probabilities)) return entropy def calculate_information_gain (X, y, feature_index ): total_entropy = calculate_entropy(y) feature_values = np.unique(X[:, feature_index]) weighted_entropy = 0 for value in feature_values: subset_indices = np.where(X[:, feature_index] == value)[0 ] subset_entropy = calculate_entropy(y[subset_indices]) weighted_entropy += (len (subset_indices) / len (y)) * subset_entropy return total_entropy - weighted_entropy def calculate_gini_impurity (y ): counts = np.bincount(y) probabilities = counts / len (y) gini = 1 - np.sum (probabilities**2 ) return gini def calculate_gini_gain (X, y, feature_index ): total_gini = calculate_gini_impurity(y) feature_values = np.unique(X[:, feature_index]) weighted_gini = 0 for value in feature_values: subset_indices = np.where(X[:, feature_index] == value)[0 ] subset_gini = calculate_gini_impurity(y[subset_indices]) weighted_gini += (len (subset_indices) / len (y)) * subset_gini return total_gini - weighted_gini class DecisionTree : def __init__ (self, criterion='entropy' , max_depth=None ): self .criterion = criterion self .max_depth = max_depth self .tree = None def _get_best_split (self, X, y ): best_gain = -1 best_feature_index = -1 for i in range (X.shape[1 ]): if self .criterion == 'entropy' : gain = calculate_information_gain(X, y, i) else : gain = calculate_gini_gain(X, y, i) if gain > best_gain: best_gain = gain best_feature_index = i return best_feature_index def _build_tree (self, X, y, depth ): if len (np.unique(y)) == 1 : return y[0 ] if X.shape[1 ] == 0 or (self .max_depth is not None and depth == self .max_depth): return Counter(y).most_common(1 )[0 ][0 ] best_feature_index = self ._get_best_split(X, y) if best_feature_index == -1 : return Counter(y).most_common(1 )[0 ][0 ] tree = {best_feature_index: {}} feature_values = np.unique(X[:, best_feature_index]) for value in feature_values: subset_indices = np.where(X[:, best_feature_index] == value)[0 ] subtree = self ._build_tree(X[subset_indices], y[subset_indices], depth + 1 ) tree[best_feature_index][value] = subtree return tree def fit (self, X, y ): self .tree = self ._build_tree(X, y, 0 ) def _predict_single (self, x, tree ): if not isinstance (tree, dict ): return tree feature_index = list (tree.keys())[0 ] branches = tree[feature_index] value = x[feature_index] if value in branches: return self ._predict_single(x, branches[value]) return None def predict (self, X ): return [self ._predict_single(x, self .tree) for x in X] def print_tree (tree, indent="" ): if not isinstance (tree, dict ): print (indent + "预测:" , tree) return feature_index, branches = next (iter (tree.items())) print (indent + f"特征 {feature_index} :" ) for value, subtree in branches.items(): print (indent + f" 值 {value} :" ) print_tree(subtree, indent + " " ) def calculate_accuracy (y_true, y_pred ): correct = np.sum (y_true == y_pred) return correct / len (y_true) if __name__ == '__main__' : train_data = pd.read_csv('dataset.csv' , header=None ) test_data = pd.read_csv('testset.csv' , header=None ) X_train = train_data.iloc[:, :-1 ].values y_train = train_data.iloc[:, -1 ].values X_test = test_data.iloc[:, :-1 ].values y_test = test_data.iloc[:, -1 ].values print ("--- 构建决策树 (ID3) ---" ) tree_id3 = DecisionTree(criterion='entropy' ) tree_id3.fit(X_train, y_train) print ("\nID3 决策树结构:" ) print_tree(tree_id3.tree) y_pred_id3 = tree_id3.predict(X_test) accuracy_id3 = calculate_accuracy(y_test, y_pred_id3) print (f"\nID3 模型精度: {accuracy_id3:.4 f} " ) print ("\n--- 构建决策树 (Gini 指数) ---" ) tree_gini = DecisionTree(criterion='gini' ) tree_gini.fit(X_train, y_train) print ("\nGini 指数决策树结构:" ) print_tree(tree_gini.tree) y_pred_gini = tree_gini.predict(X_test) accuracy_gini = calculate_accuracy(y_test, y_pred_gini) print (f"\nGini 模型精度: {accuracy_gini:.4 f} " )
运行结果 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 --- 构建决策树 (ID3) --- ID3 决策树结构: 特征 2: 值 0: 特征 1: 值 0: 预测: 0 值 1: 预测: 1 值 1: 预测: 1 ID3 模型精度: 1.0000 --- 构建决策树 (Gini 指数) --- Gini 指数决策树结构: 特征 2: 值 0: 特征 1: 值 0: 预测: 0 值 1: 预测: 1 值 1: 预测: 1 Gini 模型精度: 1.0000
分析一下