Forecasting with Trees (2021)

usgroup · on March 6, 2024

I think the skill-level required to start grinding boosted trees is relatively low compared to NNs. There's lots you can do without special hardware. Its more democratic. It trains very quickly compared to NNs. It works for big and small data. The implementations are very fancy at this stage. You can customise loss, splits, the base algorithm. Inference is fast. And so on. Boosted trees have a lot going for them in the black box model space.

leecarraher · on March 6, 2024

additionally trees have fewer, often computably optimal parameters, whereas DNN often require extensive hyperparameter tuning and even neural architecture search(how many rows, what activations, what optimization method...). Furthermore, trees are generally more interpretable. There have been some recent interesting papers relating random forest trees to adaptive smoothers, that tries to understand why they beat dnns on tabular data.

wbeckler · on March 6, 2024

Are you saying that the prevalence of trees among good-performing solutions is not related to superior performance of trees over other architectures, but rather that more people are trying them out and they will show up in the winning solutions more often because of the implementation rate?

borroka · on March 6, 2024

I haven't followed prediction contests for a while because, frankly, the field has moved on (more sideways, actually, with LLMs).

When I used to follow, until a few years ago, the winning models were ensembles of ensembles (e.g., RF is an ensemble). The fact that the best single models are ensembles, or evolutions of ensembles, is therefore not surprising.

When dealing with numerical data, squeezing blood from the stones, which is what happens in the latter stages of the prediction competition, is very rarely worth squeezing in the real world. When the model is not mechanistic but only correlative (almost all models are not purely correlative or mechanistic, anyway), getting to the last decimal place of mean absolute error or a similar metric requires building an increasingly complex structure over which we have little control upon a building that has its foundation of sand. All it takes is a little wind, such as a change in the distribution of data over time-which always happens-and unstable structures are bound to collapse.

gwern · on March 6, 2024

More on M5: https://www.sciencedirect.com/science/article/pii/S016920702...

iambateman · on March 6, 2024

This talks about tensorflow and I’ve been looking at scikit’s random forest regression.

I have about one million rows of tabular data, with 15 features, to make price predictions.

Is there a definitively better choice between the two?

bradhilton · on March 6, 2024

Absolutely, you should look at XGBoost, LightGBM, and CatBoost.

XGBoost is the og and the most feature-rich.

LightGBM is the fastest and what I use for my case (millions of rows of data with over 100 features).

CatBoost could be good depending on the nature of your data, for example if you have a lot of categorical types.

EDIT: Alos, they all support GPU training, but I haven't been able to make that faster than just using more CPU cores.

wongarsu · on March 6, 2024

To add to that, I'd primarily consider LightGBM if you want to tweak it a lot, and CatBoost if you want great out-of-the-box results. Both are significant improvements over XGBoost, especially in training speed. I would only really consider XGBoost if you don't want software primarily developed by either Microsoft or Yandex.

All three are lightyears ahead of naive random forest implementations, and are in very active development.

iambateman · on March 6, 2024

Super helpful, thank you!

bdjsiqoocwk · on March 7, 2024

Exactly what's the problem with sci-kit's random forest?

iambateman · on March 6, 2024

Very, very helpful – thanks Brad!

tigerlily · on March 6, 2024

Anyone know of any practical ways to get started with this?

schattschneider · on March 6, 2024

Here is a short intro on the theory of Gradient Boosted Decision Trees: https://developers.google.com/machine-learning/decision-fore...

And here is a practical intro to it, you can run it right in your browser if you open it in Colab: https://www.tensorflow.org/decision_forests/tutorials/beginn...

hackerlight · on March 6, 2024

Kaggle.com. Sign up to a competition or download a dataset.

Install python 3.11, and the libraries sklearn, lightgbm, pandas, matplotlib and numpy

Ask a LLM to write a python script that loads the data and fits a model to the data and summarizes/plots some results.

Jupyter Lab with autoreload, and using a python virtual environment are recommended.

ianspektor · on March 7, 2024

(temporian developer here)

Here's a simplified version of the approach (i.e. performing strong feature engineering, then converting the multivariate time series data to a panel/tabular dataset and training a boosting trees model on it), using temporian (a much improved alternative to pandas for working with temporal data) and xgboost: https://temporian.readthedocs.io/en/stable/tutorials/m5_comp...

abhgh · on March 6, 2024

If you're talking about Gradient Boosting, I've a somewhat popular answer on Quora from ages ago [1].

[1] https://qr.ae/pKJPbm

felixleungsc · on March 6, 2024

Maybe check out this recap on the M5 competition? It has links to notebooks and some of the top solutions.

https://www.kaggle.com/competitions/m5-forecasting-accuracy/...

octonion137 · on March 6, 2024

https://xgboost.readthedocs.io/en/stable/get_started.html

iambateman · on March 6, 2024

i'm using this tutorial to get started: https://www.youtube.com/watch?v=Wqmtf9SA_kk

mnky9800n · on March 6, 2024

I used to be an xgboost bro but these days I'm schilling for catboost. Anyways both have lots of examples online. The truth is without a problem interesting to you there's not much reason to learn about them unless you simply find gradient boosting algorithm elegant. Otherwise I would take some intro to machine learning course.

bdjsiqoocwk · on March 7, 2024

> used to be an xgboost bro

You sound like a moron.

mnky9800n · on March 7, 2024

Thanks internet stranger.