Machine Learning for Trading: Build a Decision Tree Learner

Introduction

The focus of this article is building a decision tree learner and evaluate its performance in different settings and finally compare it with a linear regression learner. The main learning goals are:

Supervised Learning: Understanding what is supervised learning and how to train, query and evaluate its performance
Decision Tree: Learning how to build a decision tree and use it in different configurations

Supervised Learning

Consider a series of predictor(here predictor just means inputs) measurements $x_i, i = 1, 2, \dots, N$ where for each predictor there is an associated response measurement $y_i$ . From a statistical point of view the goal is to usually find a fit such that we can relate the predictors to the responses as accurately as possible. This is interesting because it allows to understand the relationships between the predictors and responses and also we can use such a fit to make predictions of future values for responses given historical predictor data. Such problems where there is a series of predictor measurements and an accompanying series of response measurements belong to the supervised learning problems category.

The opposite of supervised learning problems are unsupervised learning where there is no associated series of response measurements for the predictors. This makes the learning problem more challenging as we are sort of working in blind. Another way to say this is that we do not have a response variable to supervise our analysis. This is where the terms supervised and unsupervised originate from.

CART Algorithms

Classification and Regression Trees (CARTs) algorithms are used in supervised machine learning to deduct information from large data sets for prediction purposes. In this article we will build a decision tree learner focused on regression analysis, meaning the data we will work with is numerical.

Decision Tree Learner

Our main learner is the decision tree learner based on JR Quinlan algorithm. The pseudocode of the algorithm is shown below.

The algorithm build a decision tree in a recursive manner. In the actual implementation a hyperparameter called leaf size is also introduced which determines the aggregation threshold of the learner. The decision on what is the best feature to split on is made by computing the feature with the highest correlation coefficient.

We train our learner using financial stock market data. Our predictor measurements will be the adjusted closing price of different index funds and our goal is to provide a prediction for the MSCI Emerging Markets (EM) index. We will use 60% of the data set to train the learner and the other 40% is used to evaluate the predictions.

The implementation of our decision tree is given below.

class DTLearner:
    """
    Decision Tree Learner

    This class can be instantiated to create a decision learner for numerical analysis (regression)

    Args:
        leaf_size (int):    the leaf size hyperparameter, used to decide the threshold of leaf aggregation
        verbose (bool):     toggle to allow printing to screen

    Returns:
        An instance of the class with a tree built from the data, the instance can then be used to query predictions
    """
    def __init__(self, leaf_size=1, verbose=False):
        self.leaf_size = leaf_size
        self.verbose = verbose
        self.tree = None

    def add_evidence(self, data_x, data_y):
        """
        Add training evidence

        The method combines the separate data sets into one data set, this makes building the tree easier later.
        Usually the shape of data_x is something like (r, c) and data_y is (r, ). We can combine this two data sets
        if we concatenate along the axis for which the dimensions are not necessarily equal. But it is required to
        first reshape data_y so that instead of a vector it becomes a 2-d array with a single column.

        Args:
            data_x (nd numpy array):    a multidimensional numpy array holding the factor values
            data_y  (1d numpy array):   a vector of labels (a.k.a. Y values)

        Returns:
            None, it just combines the data and invokes build_tree method
        """
        data = np.concatenate((data_x, data_y[:, None]), axis=1)
        self.tree = self.build_tree(data)
        if self.verbose:
            print(f'tree shape: {self.tree.shape}')
            print(self.tree)

    def find_best_feature(self, data):
        """
        Find the best feature to split

        It finds the best feature to split the rows using the correlation coefficient. We need to watch out
        for the situation where the standard deviation of feature_x is zero, since this will result
        in a division by zero in the calculation of the correlation coefficient and will cause a runtime error

        Args:
            data (nd numpy array):  the data set as an n-dimensional numpy array

        Returns:
            The best feature to split the data on, the returned value is an integer representing the column index.
            If there is a situation where the correlation coefficient for multiple columns are the same, it will
            deterministicly return the last column.
        """
        best_feature = None
        best_corr_coef = float('-inf')

        for col in range(data.shape[1] - 1):
            feature_x = data[:, col]
            # check std first, to prevent division by zero when numpy tries to
            # calculate the covariance matrix
            std = np.std(feature_x)
            if std > 0:
                c = np.corrcoef(feature_x, y=data[:, -1])[0, 1]
            else:
                c = 0
            if c >= best_corr_coef:
                best_corr_coef = c
                best_feature = col
        return best_feature

    def build_tree(self, data):
        """
        Builds the decision tree for numerical analysis based on the JR Quinlan algorithm

        Args:
            data (nd numpy array):  the full data set as an n-dimensional numpy array

        Returns:
            The decision tree as a nd numpy array in a tabular format where each row is
            [best_feature_idx, split_val, relative_left_tree_node_idx, relative_right_tree_node_idx]

            A leaf can be determined when a given row has its best_feature_idx set to numpy nan.
        """
        if data.shape[0] <= self.leaf_size:
            return np.atleast_2d([np.nan, np.mean(data[:, -1]), np.nan, np.nan])

        if np.all(data[:, -1] == data[0, -1]):
            return np.atleast_2d([np.nan, data[0, -1], np.nan, np.nan])

        best_feature_idx = self.find_best_feature(data)
        split_val = np.median(data[:, best_feature_idx])
        ys = data[:, best_feature_idx]

        if np.all(ys <= split_val) or np.all(ys > split_val):
            return np.atleast_2d([np.nan, np.mean(ys), np.nan, np.nan])

        true_rows = data[data[:, best_feature_idx] <= split_val]
        false_rows = data[data[:, best_feature_idx] > split_val]

        left_tree = self.build_tree(true_rows)
        right_tree = self.build_tree(false_rows)

        root = np.atleast_2d([best_feature_idx, split_val, 1, left_tree.shape[0] + 1])

        return np.vstack((root, left_tree, right_tree))

    def predict(self, point, node=0):
        """
        Predict the label for a given point

        Args:
            point (1d numpy array): 1 dimensional numpy array where each entry is the value for a feature
            node (int): the current decision node

        Returns:
            A prediction for the label of a given point from the decision tree.
        """
        if np.isnan(self.tree[node, 0]):
            return self.tree[node, 1]

        feature = int(self.tree[node, 0])
        split_val = self.tree[node, 1]

        if point[feature] <= split_val:
            node += int(self.tree[node, 2])
        else:
            node += int(self.tree[node, 3])

        return self.predict(point, node)

    def query(self, points):
        """
        Query the decision tree to get predictions for the given points

        Args:
            points (nd numpy array):    nd numpy array where each entry is the value for a feature

        Returns:
            An 1d numpy array where each entry is the prediction for a given point
        """
        predictions = np.empty([len(points), ])
        for idx, point in enumerate(points):
            predictions[idx] = self.predict(point)
        return predictions

    def print_tree(self, node=0, spacing=''):
        """
        Print the decision tree

        Args:
            node (int): current node in the tree
            spacing (str):  current spacing

        Returns:
            None, it will just print the tree to the console
        """
        if np.isnan(self.tree[node, 0]):
            print(f'{spacing} Label: {self.tree[node, 1]}')
            return

        print(f'{spacing} <= {self.tree[node, 1]}')
        print(f'{spacing} --> True:')
        self.print_tree(int(self.tree[node, 2]) + node, f'{spacing}\t')
        print(f'{spacing} --> False:')
        self.print_tree(int(self.tree[node, 3]) + node, f'{spacing}\t')

We can experiment with our tree and evaluate its performance. In this experiment we train several instances of the decision tree learner by varying the leaf size. In each simulation we evaluate the performance of the learner by calculating the RMSE parameter for in sample and out of sample predictions.

def experiment_1(train_x, train_y, test_x, test_y):
    max_leaf_size = 50
    in_sample_rmse = np.empty([max_leaf_size, ])
    out_of_sample_rmse = np.empty([max_leaf_size, ])

    for leaf_size in range(1, max_leaf_size + 1):
        # create a learner and train it
        learner = dtl.DTLearner(leaf_size=leaf_size)
        learner.add_evidence(train_x, train_y)  # train it

        # evaluate in sample
        pred_y = learner.query(train_x)  # get the predictions
        rmse_in = math.sqrt(((train_y - pred_y) ** 2).sum() / train_y.shape[0])
        in_sample_rmse[leaf_size - 1] = rmse_in

        # evaluate out of sample
        pred_y = learner.query(test_x)  # get the predictions
        rmse_out = math.sqrt(((test_y - pred_y) ** 2).sum() / test_y.shape[0])
        out_of_sample_rmse[leaf_size - 1] = rmse_out

    df_temp = pd.concat(
        [pd.DataFrame(in_sample_rmse).rename(columns={0: 'RMSE In Sample'}), pd.DataFrame(out_of_sample_rmse).rename(columns={0: 'RMSE Out of Sample'})], axis=1
    )
    df_temp.plot()

    plt.title('RMSE variation with leaf size using Istanbul data set and DTLearner')
    plt.xlabel('Leaf Size')
    plt.ylabel('RMSE')

    plt.xlim((0, 50))
    plt.savefig('./experiment_1.png')

The result of the experiment can be seen in the figure below. When evaluating the performance of our learner we are interested in whether overfitting occurs or not. Overfitting is an undesirable situation where a learner tries to fit the training data excessively, in other words the learner simply overlearns the training data. As a result, the learner is unable to handle test data properly and provide accurate estimates.

Looking at the figure, it can be seen overfitting occurs for a leaf size < 10. In this range, the RMSE for in sample data, i.e. training data, is increasing with leaf size, while the RMSE for out of sample data, i.e. test data, is decreasing with leaf size. When leaf size increases the flexibility of the learner decreases, at low leaf sizes the learner will fit almost one-to-one to the training data, this explains the in-creasing behavior of the in sample data RMSE, as the learner is forced to aggregate leaves, its prediction accuracy decreases. On the other hand, the test data RMSE is decreasing because, by increasing the leaf size the learner is forced to generalize. As leaf size approaches 10, the RMSE seems to be optimum for in sample and out of same. For leaf size > 10 both in sample and out of sample RMSE seems to be increasing, meaning the learner’s ability to provide accurate estimates is decreasing.