Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

My 2c (not exhaustive for what you want to do, probably):

1) Get some statistics/probability basics. It's full of people (you can see a lot of analyses on Kaggle) that "do machine learning" but make very silly mistakes (e.g. turn categorical data into a float and use it as a continuous variable when training a model).

2) take a look at traditional machine learning approaches. Nowadays you're swamped by DL (a lot of good suggestions on this thread, I won't chime in), and you miss the fact that, sometimes, a simple decision tree, or dimensionality reduction approaches (e.g. PCA or ICA) can yield an incredible value in a very short time on huge datasets.

I had written a fairly short post about it when I finished my georgia tech path https://www.franzoni.eu/machine-learning-a-sound-primer/

3) It can take a lot of time to become effective in ML, effective as in, what you _manually create_ is as effective as picking an existing trained model, fine tune it, and use it. This can be frustrating: low hanging fruits are pretty powerful and you don't need to understand a lot about ML algorithms to pick them up.

4) Consider MOOCs or online classes. I took Georgia Tech OMSCS, I can vouch for it and some classes force you to be a data scientist and read papers as well, and you can have "real world" recognition and discuss with your peers, which is useful!



I second learning the statistics/probability basics.

Your first model should always be something that predicts a constant value, or maybe in really complicated cases something like a linear/logistic regression. Then you have a baseline to compare more advanced approaches to. But in order to understand how to use linear regression well, you need to understand how it works in the first place.

Also experiment structure, sampling design, hypothesis testing, etc. will tell you a lot about what conclusions you can and cannot draw from observational data, which is what a lot of ML is about.


While stats and probability are very good, I can't say you need more than a good 101 level course for either. Really you're just looking for some good reasoning skills about how distributions and probability works.

This example: > It's full of people (you can see a lot of analyses on Kaggle) that "do machine learning" but make very silly mistakes (e.g. turn categorical data into a float and use it as a continuous variable when training a model).

Doesn't seem related to stats or probability at all to me. Just critical thinking skills


I prefer to use the approach "learn as you go". There's nothing to block someone from learning some basics, a library like Scikit Learn and then learn by doing examples.

But if anyone wants to become an expert into ML/DS, learning statistics and probability is fundamental. Books like A First Course in Probability, Introduction to Statistical Learning and Elements of Statistical Learning, to name a few, are very important.

A lot of the mistakes done in practice are based on a lack of understanding in sampling techniques, how statistical metrics can be misleading and so on.

First thing I learned in statistics is the difference between quantitative and qualitative information. If someone knows this before hopping into Kaggle, they know that categorical features can't be used as continuous features.


Depends on your goal. If you want to read/implement things from COLT article a 101 stats/probability won't really cut it.

Though for applied papers that's sometimes enough.

But heed Larry Wasserman's advice: "Using fancy tools like neural nets, boosting, and support vector machines without understanding basic statistics is like doing brain surgery before knowing how to use a band-aid."


You need to understand that a category cannot be magically transformed into a float. Yeah, maybe not the best example on my part.

For the 101 level, I agree; I’d say you need a good understanding of basic probs and stats, rather than a vague understanding of advanced topics.

If you can spot when somebody says something about a dataset, the assertion is true only if the data is normally distributed, but there’s no checking about the actual distribution, you’re probably good to go.


I am a CS major but I was always bad at math. Can you recommend your favorite resources for learning probability and statistics basics?


“Introduction to statistical learning” is probably the good goto resource to start with. There’s a decent open source book on probability which I use when I need more in-depth understanding, but I don’t remember the title right now.

One clarification: you don’t need an extreme understanding of stats, probability, linear algebra, imho. If you already took college level classes, you’re likely to be fine.


I was a CS major who made a 20 year career in software who was always told I was bad at math and struggled all through school with math. Somewhere about ten years into my career I began to realize that it wasn't really that I was bad with math, but that the way math is taught just doesn't work for most people. And, a lot of that math that you were expected to learn is really only directly applicable in very specific circumstances that you might not encounter in your career - which is not to say that the mental exercise of learning them weren't worthwhile!

So my point is, if it isn't making sense the way you are being taught, go explore other avenues. There's no way these machine learning algorithms would have made sense to me as a 20-something undergrad, but as a 40-something who can explore them via software rather than a whiteboard, they really aren't that complicated to get started.


As far as MOOCs are concerned there are many great courses. I compiled a list on https://courseskipper.com/best-machine-learning-courses-for-... and suggest that you just read descriptions and see if anything seems like a good starting point for you


Can you recommend a good basic stats/probability course? The last one I took was roughly in 1997 ;)


Richard McElreath's Statistical rethinking is an absolute masterpiece. https://xcelab.net/rm/statistical-rethinking/ on statistics


Thanks. BTW, I found there's a much cheaper ($80 -> $27) paperback version published a few days ago.


Can you share the link?


I just searched amazon for the title plus "paperback". Now that I look at it again, it says "by MAN (Author)" whereas the hard cover is "by Richard McElreath (Author)", so it's looking possibly scammy to me now, so caveat emptor...


I received the book today. I'm still not sure that it isn't a scam... The back cover is empty; otherwise, the quality is fine. I don't know if this is associated with Richard McElreath. Good luck to everyone.


MIT 6.041[1] is a very good course I can recommend. Not sure if MITx 6.431x on edx is the same, but it's the same teacher in any case.

[1] https://www.youtube.com/watch?v=j9WZyLZCBzs&list=PLUl4u3cNGP...


OMSCS is less intense than SCPD. OMSCS has major deliveries approximately every two-three weeks whereas SCPD every week with a similar depth. UTexas' MSDSO is even more relaxed. So if you want to save time, Stanford SCPD it is.


They are probably not silly mistakes. Label encoding can be very useful for tree based models when the categories are ordinal, or when there are a high amount of categories.


They are most of the times. You get a prediction with a meaningless float (unless the categories are ordinal, which isn’t so common), and categories can change their assigned number (happens in lots of analyses) at every run since they’re not properly sorted. Crawl a few notebooks, I spotted that error quite often.


THWG




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: