Overview

A study on Gradient Boosting classifiers

Juliano Garcia de Oliveira, NUSP: 9277086
Advisor: Prof. Roberto Hirata

Abstract

Gradient Boosting Machines (GBMs) is a supervised machine learning algorithm that has been achieving state-of-the-art results in a wide range of different problems and winning machine learning competitions. When building any machine learning model, the hyperparameter optimization can become a costly and time-consuming task depending on the number and the hyperparameter space of the tuning procedure. Machine learning users that are not experienced researchers or data science professionals can struggle to define which hyperparameters and values to choose when starting the model tuning, especially with newer GBMs implementations like the XGBoost and LightGBM library. In this work, a large-scale experiment with 70 datasets is conducted using the OpenML platform, measuring the sensitivity of binary classifiers evaluation metrics to changes in three LightGBM hyperparameters. A solid statistical framework is applied to the study results, analyzing the behavior from three different viewpoints: results by hyperparameters, results by characteristics of the dataset and results by performance metric. The carried out experiments indicate insightful relationships of the hyperparameters in gradient boosting classifiers, uncovering which combinations of hyperparameters resulted in models with the highest change in the metrics from the baseline, what metrics are most sensitive and which characteristics of the studied datasets stood out. These results are hereby here presented to facilitate the model building of gradient boosting classifiers for machine learning users.

Initial Project Outline

This project consists in a study of hyperparameter effect in Gradient Boosting Machines (GBMs), namely in the LightGBM library. Gradient Boosting Machines are the state-of-the-art machine learning techniques to deal with structured data. XGBoost (another famous library) and LightGBM implementations are widely used in the industry, and also are part of almost all winning solutions in machine learning competions in Kaggle, according to The State of Data Science and Machine Learning 2019.

The performance of GBM models heavily depends on hyperparameter tuning, and the objective of this work is to study the impact of different hyperparameters in the performance of GBM classification models in different datasets and how different datasets characteristics impact the performance of the models. With this study, I expect to obtain more insight into the boundaries and capabilities of GBM models and the important aspects that make them so valuable for modern data science solutions in real world applications.

You can find my formal project outline here.

Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically Decision Trees.

Gradient Boosting yields a model which is an ensemble of individual weak models: $$ F_M(x) = F_0(x) + \sum_{m=1}^M F_m(x)$$

Weak learner is a model which is just slighly better at prediction than random guessing. In GBMs, these models are boosted in the training process, and are usually low height decision trees.

XGBoost is a Scalable and Flexible Gradient Boosting library, which supports regression, classification, ranking and user defined objectives. It is widely used in winning Kaggle solutions!

LightGBM is a fast, distributed, high performance gradient boosting framework developed by Microsoft Research, based on decision tree algorithms and is used for different machine learning tasks.

Whoami

I'm Juliano Garcia and this is my undergraduate thesis webpage. You can contact me via my Github profile, Linkedin, Quora profile, or the old email:

julianogarcia_1997@hotmail.com