这次实验的任务是用一个贷款审批的数据集，写代码训练一个模型，让电脑学会怎么自动判断该不该给一个人贷款。我们会用上两种经典的方法——ID3 和 Gini 指数，看看它们各自有啥门道。

任务背景：当个AI审批官

想象一下，银行每天都收到一大堆贷款申请，要是都靠人工一个个看，效率也太低了。我们的目标就是用机器学习，让这个过程自动化。手头有16条已经审批过的老数据，里面记录了申请人的年龄、有无工作、有无房产和信誉情况，以及最终是否获得了贷款。

数据长啥样？

训练集 (dataset.csv): 16条数据，用来教我们的模型。
测试集 (testset.csv): 7条新数据，用来考考我们的模型学得怎么样。
特征 (已经转成了数字):
1. 年龄段: 0 (青年), 1 (中年), 2 (老年)
2. 有没有工作: 0 (没), 1 (有)
3. 有没有房子: 0 (没), 1 (有)
4. 信贷情况: 0 (一般), 1 (好), 2 (非常好)
目标:
- 给不给贷款: 0 (不给), 1 (给)

理论速成——决策树到底在想啥？

决策树这东西，说白了就是模拟人做决定的过程。比如你想想要不要出门，可能会先看看“天气怎么样？”，如果“下雨”，就再考虑“带伞了吗？”，最后做出“出门”或“不出门”的决定。

决策树就是这样，从所有数据开始（根节点），找一个最好的问题来问（最优特征），把数据分成几堆。然后再对每一堆数据重复这个过程，直到分出来的小堆足够“纯”（比如，里面的人全都获得了贷款），就停下来。

那怎么找到“最好的问题”呢？ID3 和 Gini 指数就是两种不同的衡量标准。

ID3：信息增益越大越好

ID3的核心是“信息熵”。简单理解，熵就是描述一堆东西有多乱。如果一堆数据里，一半是“给贷款”，一半是“不给贷款”，那它就很乱，熵就很高。如果全是“给贷款”，那它就很整齐，熵就是0。

信息增益就是，我用一个特征（比如“有没有房子”）把数据分开后，整体的混乱程度降低了多少。降低得越多，说明这个特征越牛，划分效果越好。我们就选信息增益最大的那个特征来划分。

Gini 指数：不纯度越小越好

Gini 指数的想法更直接。基尼不纯度衡量的是，我从一堆数据里随便抽两个，它俩类别不一样的概率。概率越大，说明数据越不純。

和ID3一样，我们也是选一个特征，让划分后的基尼不纯度下降得最多（也就是基尼增益最大）。

敲代码

理论懂了，就该上代码了。我们把整个过程拆解一下，一步步来。

基础工具：计算纯度的函数

首先，我们得先把理论里的那几个公式变成代码。这相当于准备我们造决策树需要的“螺丝”和“扳手”。我们需要四个函数：两个用来算纯度（信息熵和基尼不纯度），两个用来算划分前后的纯度变化（信息增益和基尼增益）。

# 计算信息熵
def calculate_entropy(y):
    counts = np.bincount(y)
    probabilities = counts / len(y)
    probabilities = probabilities[probabilities > 0] # 避免log(0)
    entropy = -np.sum(probabilities * np.log2(probabilities))
    return entropy

# 计算信息增益
def calculate_information_gain(X, y, feature_index):
    total_entropy = calculate_entropy(y)
    feature_values = np.unique(X[:, feature_index])
    weighted_entropy = 0
    for value in feature_values:
        subset_indices = np.where(X[:, feature_index] == value)[0]
        subset_entropy = calculate_entropy(y[subset_indices])
        weighted_entropy += (len(subset_indices) / len(y)) * subset_entropy
    return total_entropy - weighted_entropy

# 计算基尼不纯度
def calculate_gini_impurity(y):
    counts = np.bincount(y)
    probabilities = counts / len(y)
    gini = 1 - np.sum(probabilities**2)
    return gini

# 计算基尼增益
def calculate_gini_gain(X, y, feature_index):
    total_gini = calculate_gini_impurity(y)
    feature_values = np.unique(X[:, feature_index])
    weighted_gini = 0
    for value in feature_values:
        subset_indices = np.where(X[:, feature_index] == value)[0]
        subset_gini = calculate_gini_impurity(y[subset_indices])
        weighted_gini += (len(subset_indices) / len(y)) * subset_gini
    return total_gini - weighted_gini

主体结构：DecisionTree 类

接下来，我们搭一个 DecisionTree 类的架子。这就像是决策树的“大脑”，它需要知道用哪种标准（criterion），并且负责指挥整个建树（fit）和预测（predict）的过程。

class DecisionTree:
    def __init__(self, criterion='entropy', max_depth=None):
        self.criterion = criterion # 'entropy' (信息熵) 或 'gini' (基尼指数)
        self.max_depth = max_depth
        self.tree = None

    # 训练
    def fit(self, X, y):
        self.tree = self._build_tree(X, y, 0)

    # 预测
    def predict(self, X):
        return [self._predict_single(x, self.tree) for x in X]

核心逻辑：递归建树

这是最关键的一步，也就是 _build_tree 函数。它是一个递归函数，意思是它会自己调用自己，一层层地把树建起来。

它的逻辑是：

先判断要不要停下来：如果当前这堆数据已经很“纯”了（所有人都一个类别），或者满足了其他停止条件（比如没有特征可用了），就直接返回结果，不再往下分了。
找最好的特征：调用我们前面写的增益函数，遍历所有特征，找到那个能让数据变得最“纯”的特征。
分裂：根据最好特征的不同取值，把数据分成好几份，然后对每一份数据，重复上面的过程（调用自己），去建立子树。

# 找最好的划分特征
def _get_best_split(self, X, y):
    best_gain = -1
    best_feature_index = -1
    for i in range(X.shape[1]):
        if self.criterion == 'entropy':
            gain = calculate_information_gain(X, y, i)
        else:
            gain = calculate_gini_gain(X, y, i)
        if gain > best_gain:
            best_gain = gain
            best_feature_index = i
    return best_feature_index

# 递归建树
def _build_tree(self, X, y, depth):
    # 递归停止条件
    if len(np.unique(y)) == 1:
        return y[0]
    if X.shape[1] == 0 or (self.max_depth is not None and depth == self.max_depth):
        return Counter(y).most_common(1)[0][0]
    
    best_feature_index = self._get_best_split(X, y)
    if best_feature_index == -1:
        return Counter(y).most_common(1)[0][0]

    # 开始分裂
    tree = {best_feature_index: {}}
    feature_values = np.unique(X[:, best_feature_index])
    
    for value in feature_values:
        subset_indices = np.where(X[:, best_feature_index] == value)[0]
        subtree = self._build_tree(X[subset_indices], y[subset_indices], depth + 1)
        tree[best_feature_index][value] = subtree
    return tree

最终组装：完整代码

好了，把所有零件都拼起来，再加上读取数据、调用模型和打印结果的部分，就是我们最终的完整代码了。

import numpy as np
import pandas as pd
from collections import Counter

# --- 基础工具函数 ---
def calculate_entropy(y):
    counts = np.bincount(y)
    probabilities = counts / len(y)
    probabilities = probabilities[probabilities > 0]
    entropy = -np.sum(probabilities * np.log2(probabilities))
    return entropy

def calculate_information_gain(X, y, feature_index):
    total_entropy = calculate_entropy(y)
    feature_values = np.unique(X[:, feature_index])
    weighted_entropy = 0
    for value in feature_values:
        subset_indices = np.where(X[:, feature_index] == value)[0]
        subset_entropy = calculate_entropy(y[subset_indices])
        weighted_entropy += (len(subset_indices) / len(y)) * subset_entropy
    return total_entropy - weighted_entropy

def calculate_gini_impurity(y):
    counts = np.bincount(y)
    probabilities = counts / len(y)
    gini = 1 - np.sum(probabilities**2)
    return gini

def calculate_gini_gain(X, y, feature_index):
    total_gini = calculate_gini_impurity(y)
    feature_values = np.unique(X[:, feature_index])
    weighted_gini = 0
    for value in feature_values:
        subset_indices = np.where(X[:, feature_index] == value)[0]
        subset_gini = calculate_gini_impurity(y[subset_indices])
        weighted_gini += (len(subset_indices) / len(y)) * subset_gini
    return total_gini - weighted_gini

# --- 决策树主类 ---
class DecisionTree:
    def __init__(self, criterion='entropy', max_depth=None):
        self.criterion = criterion
        self.max_depth = max_depth
        self.tree = None

    def _get_best_split(self, X, y):
        best_gain = -1
        best_feature_index = -1
        for i in range(X.shape[1]):
            if self.criterion == 'entropy':
                gain = calculate_information_gain(X, y, i)
            else:
                gain = calculate_gini_gain(X, y, i)
            if gain > best_gain:
                best_gain = gain
                best_feature_index = i
        return best_feature_index

    def _build_tree(self, X, y, depth):
        if len(np.unique(y)) == 1:
            return y[0]
        if X.shape[1] == 0 or (self.max_depth is not None and depth == self.max_depth):
            return Counter(y).most_common(1)[0][0]
        
        best_feature_index = self._get_best_split(X, y)
        if best_feature_index == -1:
            return Counter(y).most_common(1)[0][0]

        tree = {best_feature_index: {}}
        feature_values = np.unique(X[:, best_feature_index])
        
        for value in feature_values:
            subset_indices = np.where(X[:, best_feature_index] == value)[0]
            subtree = self._build_tree(X[subset_indices], y[subset_indices], depth + 1)
            tree[best_feature_index][value] = subtree
        return tree

    def fit(self, X, y):
        self.tree = self._build_tree(X, y, 0)

    def _predict_single(self, x, tree):
        if not isinstance(tree, dict):
            return tree
        feature_index = list(tree.keys())[0]
        branches = tree[feature_index]
        value = x[feature_index]
        
        if value in branches:
            return self._predict_single(x, branches[value])
        return None

    def predict(self, X):
        return [self._predict_single(x, self.tree) for x in X]

# --- 辅助函数 ---
def print_tree(tree, indent=""):
    if not isinstance(tree, dict):
        print(indent + "预测:", tree)
        return
    feature_index, branches = next(iter(tree.items()))
    print(indent + f"特征 {feature_index}:")
    for value, subtree in branches.items():
        print(indent + f"  值 {value}:")
        print_tree(subtree, indent + "    ")

def calculate_accuracy(y_true, y_pred):
    correct = np.sum(y_true == y_pred)
    return correct / len(y_true)

# --- 主程序 ---
if __name__ == '__main__':
    train_data = pd.read_csv('dataset.csv', header=None)
    test_data = pd.read_csv('testset.csv', header=None)

    X_train = train_data.iloc[:, :-1].values
    y_train = train_data.iloc[:, -1].values
    X_test = test_data.iloc[:, :-1].values
    y_test = test_data.iloc[:, -1].values

    print("--- 构建决策树 (ID3) ---")
    tree_id3 = DecisionTree(criterion='entropy')
    tree_id3.fit(X_train, y_train)
    print("\nID3 决策树结构:")
    print_tree(tree_id3.tree)
    y_pred_id3 = tree_id3.predict(X_test)
    accuracy_id3 = calculate_accuracy(y_test, y_pred_id3)
    print(f"\nID3 模型精度: {accuracy_id3:.4f}")

    print("\n--- 构建决策树 (Gini 指数) ---")
    tree_gini = DecisionTree(criterion='gini')
    tree_gini.fit(X_train, y_train)
    print("\nGini 指数决策树结构:")
    print_tree(tree_gini.tree)
    y_pred_gini = tree_gini.predict(X_test)
    accuracy_gini = calculate_accuracy(y_test, y_pred_gini)
    print(f"\nGini 模型精度: {accuracy_gini:.4f}")

运行结果

--- 构建决策树 (ID3) ---

ID3 决策树结构:
特征 2:
  值 0:
    特征 1:
      值 0:
        预测: 0
      值 1:
        预测: 1
  值 1:
    预测: 1

ID3 模型精度: 1.0000

--- 构建决策树 (Gini 指数) ---

Gini 指数决策树结构:
特征 2:
  值 0:
    特征 1:
      值 0:
        预测: 0
      值 1:
        预测: 1
  值 1:
    预测: 1

Gini 模型精度: 1.0000

分析一下

决策树长啥样: 没想到，不管是用 ID3 还是 Gini，最后生成的决策树一模一样！而且逻辑超级简单：（特征2是“有没有房子”，特征1是“有没有工作”）
1. 先问你有没有房子？
2. 有房？那好，直接给贷款。
3. 没房？再问你有没有工作？
  - 没工作？那对不起，不给贷款。
  - 有工作？行，给贷款。
    看来在这个数据集里，房子和工作是决定性因素，年龄和信誉反而没那么重要。
模型准确性: 两个模型在测试集上的准确率都是 100%，可能是因为数据集比较小而且规律比较明显。