My 2c (not exhaustive for what you want to do, probably): 1) Get some statistics...

kqr · on Aug 16, 2022

I second learning the statistics/probability basics.

Your first model should always be something that predicts a constant value, or maybe in really complicated cases something like a linear/logistic regression. Then you have a baseline to compare more advanced approaches to. But in order to understand how to use linear regression well, you need to understand how it works in the first place.

Also experiment structure, sampling design, hypothesis testing, etc. will tell you a lot about what conclusions you can and cannot draw from observational data, which is what a lot of ML is about.

spywaregorilla · on Aug 16, 2022

While stats and probability are very good, I can't say you need more than a good 101 level course for either. Really you're just looking for some good reasoning skills about how distributions and probability works.

This example: > It's full of people (you can see a lot of analyses on Kaggle) that "do machine learning" but make very silly mistakes (e.g. turn categorical data into a float and use it as a continuous variable when training a model).

Doesn't seem related to stats or probability at all to me. Just critical thinking skills

thegginthesky · on Aug 16, 2022

I prefer to use the approach "learn as you go". There's nothing to block someone from learning some basics, a library like Scikit Learn and then learn by doing examples.

But if anyone wants to become an expert into ML/DS, learning statistics and probability is fundamental. Books like A First Course in Probability, Introduction to Statistical Learning and Elements of Statistical Learning, to name a few, are very important.

A lot of the mistakes done in practice are based on a lack of understanding in sampling techniques, how statistical metrics can be misleading and so on.

First thing I learned in statistics is the difference between quantitative and qualitative information. If someone knows this before hopping into Kaggle, they know that categorical features can't be used as continuous features.

data_maan · on Aug 16, 2022

Depends on your goal. If you want to read/implement things from COLT article a 101 stats/probability won't really cut it.

Though for applied papers that's sometimes enough.

But heed Larry Wasserman's advice: "Using fancy tools like neural nets, boosting, and support vector machines without understanding basic statistics is like doing brain surgery before knowing how to use a band-aid."

alanfranz · on Aug 16, 2022

You need to understand that a category cannot be magically transformed into a float. Yeah, maybe not the best example on my part.

For the 101 level, I agree; I’d say you need a good understanding of basic probs and stats, rather than a vague understanding of advanced topics.

If you can spot when somebody says something about a dataset, the assertion is true only if the data is normally distributed, but there’s no checking about the actual distribution, you’re probably good to go.

nkzd · on Aug 16, 2022

I am a CS major but I was always bad at math. Can you recommend your favorite resources for learning probability and statistics basics?

alanfranz · on Aug 16, 2022

“Introduction to statistical learning” is probably the good goto resource to start with. There’s a decent open source book on probability which I use when I need more in-depth understanding, but I don’t remember the title right now.

One clarification: you don’t need an extreme understanding of stats, probability, linear algebra, imho. If you already took college level classes, you’re likely to be fine.

poulsbohemian · on Aug 17, 2022

I was a CS major who made a 20 year career in software who was always told I was bad at math and struggled all through school with math. Somewhere about ten years into my career I began to realize that it wasn't really that I was bad with math, but that the way math is taught just doesn't work for most people. And, a lot of that math that you were expected to learn is really only directly applicable in very specific circumstances that you might not encounter in your career - which is not to say that the mental exercise of learning them weren't worthwhile!

So my point is, if it isn't making sense the way you are being taught, go explore other avenues. There's no way these machine learning algorithms would have made sense to me as a 20-something undergrad, but as a 40-something who can explore them via software rather than a whiteboard, they really aren't that complicated to get started.

lglm · on Aug 22, 2022

As far as MOOCs are concerned there are many great courses. I compiled a list on https://courseskipper.com/best-machine-learning-courses-for-... and suggest that you just read descriptions and see if anything seems like a good starting point for you

icedchai · on Aug 16, 2022

Can you recommend a good basic stats/probability course? The last one I took was roughly in 1997 ;)

jvans · on Aug 16, 2022

Richard McElreath's Statistical rethinking is an absolute masterpiece. https://xcelab.net/rm/statistical-rethinking/ on statistics

kwoff · on Aug 16, 2022

Thanks. BTW, I found there's a much cheaper ($80 -> $27) paperback version published a few days ago.

ford · on Aug 17, 2022

Can you share the link?

kwoff · on Aug 17, 2022

I just searched amazon for the title plus "paperback". Now that I look at it again, it says "by MAN (Author)" whereas the hard cover is "by Richard McElreath (Author)", so it's looking possibly scammy to me now, so caveat emptor...

kwoff · on Aug 19, 2022

I received the book today. I'm still not sure that it isn't a scam... The back cover is empty; otherwise, the quality is fine. I don't know if this is associated with Richard McElreath. Good luck to everyone.

skripp · on Aug 16, 2022

MIT 6.041[1] is a very good course I can recommend. Not sure if MITx 6.431x on edx is the same, but it's the same teacher in any case.

[1] https://www.youtube.com/watch?v=j9WZyLZCBzs&list=PLUl4u3cNGP...

bitL · on Aug 16, 2022

OMSCS is less intense than SCPD. OMSCS has major deliveries approximately every two-three weeks whereas SCPD every week with a similar depth. UTexas' MSDSO is even more relaxed. So if you want to save time, Stanford SCPD it is.

liquorice · on Aug 16, 2022

They are probably not silly mistakes. Label encoding can be very useful for tree based models when the categories are ordinal, or when there are a high amount of categories.

alanfranz · on Aug 16, 2022

They are most of the times. You get a prediction with a meaningless float (unless the categories are ordinal, which isn’t so common), and categories can change their assigned number (happens in lots of analyses) at every run since they’re not properly sorted. Crawl a few notebooks, I spotted that error quite often.

abbusfoflouotne · on Aug 16, 2022