Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Forecasting with Trees (2021) (amazon.science)
64 points by tosh on March 6, 2024 | hide | past | favorite | 22 comments


I think the skill-level required to start grinding boosted trees is relatively low compared to NNs. There's lots you can do without special hardware. Its more democratic. It trains very quickly compared to NNs. It works for big and small data. The implementations are very fancy at this stage. You can customise loss, splits, the base algorithm. Inference is fast. And so on. Boosted trees have a lot going for them in the black box model space.


additionally trees have fewer, often computably optimal parameters, whereas DNN often require extensive hyperparameter tuning and even neural architecture search(how many rows, what activations, what optimization method...). Furthermore, trees are generally more interpretable. There have been some recent interesting papers relating random forest trees to adaptive smoothers, that tries to understand why they beat dnns on tabular data.


Are you saying that the prevalence of trees among good-performing solutions is not related to superior performance of trees over other architectures, but rather that more people are trying them out and they will show up in the winning solutions more often because of the implementation rate?


I haven't followed prediction contests for a while because, frankly, the field has moved on (more sideways, actually, with LLMs).

When I used to follow, until a few years ago, the winning models were ensembles of ensembles (e.g., RF is an ensemble). The fact that the best single models are ensembles, or evolutions of ensembles, is therefore not surprising.

When dealing with numerical data, squeezing blood from the stones, which is what happens in the latter stages of the prediction competition, is very rarely worth squeezing in the real world. When the model is not mechanistic but only correlative (almost all models are not purely correlative or mechanistic, anyway), getting to the last decimal place of mean absolute error or a similar metric requires building an increasingly complex structure over which we have little control upon a building that has its foundation of sand. All it takes is a little wind, such as a change in the distribution of data over time-which always happens-and unstable structures are bound to collapse.



This talks about tensorflow and I’ve been looking at scikit’s random forest regression.

I have about one million rows of tabular data, with 15 features, to make price predictions.

Is there a definitively better choice between the two?


Absolutely, you should look at XGBoost, LightGBM, and CatBoost.

XGBoost is the og and the most feature-rich.

LightGBM is the fastest and what I use for my case (millions of rows of data with over 100 features).

CatBoost could be good depending on the nature of your data, for example if you have a lot of categorical types.

EDIT: Alos, they all support GPU training, but I haven't been able to make that faster than just using more CPU cores.


To add to that, I'd primarily consider LightGBM if you want to tweak it a lot, and CatBoost if you want great out-of-the-box results. Both are significant improvements over XGBoost, especially in training speed. I would only really consider XGBoost if you don't want software primarily developed by either Microsoft or Yandex.

All three are lightyears ahead of naive random forest implementations, and are in very active development.


Super helpful, thank you!


Exactly what's the problem with sci-kit's random forest?


Very, very helpful – thanks Brad!


Anyone know of any practical ways to get started with this?


Here is a short intro on the theory of Gradient Boosted Decision Trees: https://developers.google.com/machine-learning/decision-fore...

And here is a practical intro to it, you can run it right in your browser if you open it in Colab: https://www.tensorflow.org/decision_forests/tutorials/beginn...


Kaggle.com. Sign up to a competition or download a dataset.

Install python 3.11, and the libraries sklearn, lightgbm, pandas, matplotlib and numpy

Ask a LLM to write a python script that loads the data and fits a model to the data and summarizes/plots some results.

Jupyter Lab with autoreload, and using a python virtual environment are recommended.


(temporian developer here)

Here's a simplified version of the approach (i.e. performing strong feature engineering, then converting the multivariate time series data to a panel/tabular dataset and training a boosting trees model on it), using temporian (a much improved alternative to pandas for working with temporal data) and xgboost: https://temporian.readthedocs.io/en/stable/tutorials/m5_comp...


If you're talking about Gradient Boosting, I've a somewhat popular answer on Quora from ages ago [1].

[1] https://qr.ae/pKJPbm


Maybe check out this recap on the M5 competition? It has links to notebooks and some of the top solutions.

https://www.kaggle.com/competitions/m5-forecasting-accuracy/...



i'm using this tutorial to get started: https://www.youtube.com/watch?v=Wqmtf9SA_kk


I used to be an xgboost bro but these days I'm schilling for catboost. Anyways both have lots of examples online. The truth is without a problem interesting to you there's not much reason to learn about them unless you simply find gradient boosting algorithm elegant. Otherwise I would take some intro to machine learning course.


> used to be an xgboost bro

You sound like a moron.


Thanks internet stranger.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: