Data mining Reddit posts reveals how to ask for a favor and get it

philh · on May 24, 2014

> It turns out that their algorithm makes a successful prediction about 70 per cent of the time. That’s far from perfect but much better than random guessing which is right only half the time.

From the graph, it looks like only about 27% of requests are fulfilled in the best case (jobs). In which case I can do better than 70% just by constantly predicting "no".

(I assume that this is just bad reporting. I haven't read the reference.)

Edit: skimmed the referenced article. Average success rate is 24.6%. The 70% they give (well, 67.2%) is "the probability that a classifier will rank a randomly chosen positive instance over a randomly chosen negative one".

im3w1l · on May 24, 2014

They got a ROC AUC score of 0.67. This means that for a randomly chosen denied request and a randomly chosen accepted request, they will give a higher score to the accepted request 67% of times.

ronaldx · on May 24, 2014

This makes the article correct (although makes no allowance for sampling error, nor for overfitting the available data).

Even so, I still dislike the use of this figure, as ROC AUC overstates a poor score: it hints at the model being correct 67% of the time.

In fact, the model is only knowledgeably correct 33% of the time, and just guessing the rest (taking its binary-choice 'score' up to 67%).

Irishsteve · on May 24, 2014

Below 0.8 is pretty much random

colanderman · on May 24, 2014

No, reread the parent's statement. The test involves distinguishing exactly one known accepted request and exactly one known rejected request. Probability of an acceptance does not come into play here; hence a random algorithm would choose the accepted request correctly 50% of the time.

What the algorithm performs poorly at is determining whether any single arbitrary request is accepted or rejected; that's the test that would require around 80% success rate.

Irishsteve · on May 24, 2014

Eh - ok - let me rephrase. anything below 0.8 with ROC curves is usually considered to be very poor.

Can't quite recall the paper that gives the details. Think it might be this one http://www.hpl.hp.com/techreports/2003/HPL-2003-4.pdf

andreasvc · on May 25, 2014

From the PDF you cite:

"Since the AUC is a portion of the area of the unit square, its value will always be between 0 and 1.0. However, because random guessing produces the diagonal line between (0, 0) and (1, 1), which has an area of 0.5, no realistic classifier should have an AUC less than 0.5."

So it appears that not 0.8 but 0.5 is the randomness threshold; therefore 0.7 is not so bad.

taralx · on May 27, 2014

Unfortunately not. IIRC, any classifier with AUC < 0.5 can be improved by just inverting its output. Real classification quality requires being significantly above 0.5 -- 0.8 is a common threshold.

wingerlang · on May 24, 2014

Off topic now, but is the title confusing? The word "Dating", I can't make sense of why it is there and/or what it means.

dewey · on May 24, 2014

Typo, should be "Data Mining" not really that confusing in this context though.

etfb · on May 24, 2014

Ha! I didn't even notice the error - read it as "Data Mining" two or three times.

wingerlang · on May 24, 2014

Aha. In hindsight, that does seem logical.

UVB-76 · on May 24, 2014

Please tell me this is a typo, and "dating" has not become an acceptable shortening of "data mining"

psykovsky · on May 24, 2014

If that was the case the word "mining" would've been omitted, and it wasn't. The title reads "Dating Mining..."

anoncow · on May 24, 2014

DATING as short for DATa minING? A lot of data miners would welcome the new terminology.

Varcht · on May 24, 2014

I propose "data'n"...

maxerickson · on May 24, 2014

damning.

stewbrew · on May 24, 2014

I'd rather expect it to be the work of auto-correction or spell checking.

userbinator · on May 24, 2014

I read it as "Dating -- Mining Reddit Posts...", and expected something a little different but still related to data-mining.

wingerlang · on May 24, 2014

That was my initial reading too, hence the confusion.

scottfr · on May 24, 2014

Random pet-peeve. I dislike how the graph in the article has markers at each decile.

This is model-generated data so they could put an infinite of markers at arbitrary locations. The use of markers implies to the viewer that is a data-generated figure, which it is not.

ronaldx · on May 24, 2014

I was burned by this, thanks.

pervycreeper · on May 24, 2014

>We find that Reddit users with higher status overall (higher karma) [...] are significantly more likely to receive help

There could be a common cause to these two factors (a certain way of writing, for instance), that could explain the correlation. I can't imagine someone basing a judgement on a stat reported on someone's user page.

sliverstorm · on May 24, 2014

I expect the higher score makes you seem reputable. Less chance you're trying to game the system with multiple accounts.

maxerickson · on May 24, 2014

GP means that things like being an effective communicator and knowing how to play to the crowd will factor into karma and into asking a question that gets answered.

That doesn't mean score-viewing doesn't happen, but it isn't likely to be the only mechanism at play.

tedks · on May 24, 2014

Reddit Enhancement Suite, a ubiquitous browser extension, shows the link karma, comment karma, and account age of any username that you hover over.

So, it's totally possible that, like others have said, people deciding which request to fulfill take karma/account age into consideration to avoid gaming the system with multiple accounts.

B-Con · on May 25, 2014

That seems fairly self explaining. By the definition of having more karma, they're better acquainted with the site and have successful experience appealing to the audience.

rjvir · on May 24, 2014

Do you think someone with an account of say 10 karma points and 2 comments on a month old account is equally likely to receive help, ceteris paribus?

pervycreeper · on May 24, 2014

I'm sure it could be a factor (nothing's preventing someone from doing that, after all) but my original point was essentially as characterized in this post below https://news.ycombinator.com/item?id=7793607

The tone of the article, I think, was suggestive of a causal relation, which surely doesn't (necessarily or plausibly) hold.

001sky · on May 24, 2014

even if it is "threshold" its simply a binary relation and not an explanatory variable at the margin. so, yes...you might be able to weed out some data...but of what is left its unliklely to differeniate, except by proxy (as alluded to in the other comments) for something else.

erikb · on May 24, 2014

beside the other reasons, a simple alternative might also be that users with high reputation usually are known by the regulars. If you recognize someone’s name for posting a lot of interesting stuff on your favourite website you probably help him out before other guys.

colanderman · on May 24, 2014

I'd expect better from a publication titled "MIT Technology Review" than to start a Y-axis oh-so-close to – but not quite at – zero.

Makes "craving" seem to perform much worse than it actually does at a status of 0.

pvdm · on May 24, 2014

"MIT Technology Review" has more similarity to BusinessWeek than the actual university.

CurtMonash · on May 24, 2014

It's been a long time since a date of mine was satisfied with pizza.

nwenzel · on May 24, 2014

> Althoff and co used a standard machine learning algorithm to comb through all the possible correlations.

What exactly is a "standard machine learning algorithm"?

I'm sure that probably means that they used something from scikit-learn but "comb[ing] through correlations" isn't as simple as clicking the Go button.

The rest of the article does start to get into labeling, holding out a test set, and some of the data cleanup (the real combing working).

I guess I was just hoping for more detail of how it worked and not that it worked. I get that this wasn't meant to be a PhD thesis on supervised machine learning, but the mechanics of data analysis are really interesting as a process of discovery.

Curious to know what others think. How did the balance of how vs that work for you?

jebus989 · on May 24, 2014

It's a logistic regression model, a basic statistical technique which wouldn't have even come under "machine learning" a few years ago. Later they use some kind of LASSO regression to penalise the inclusion of redundant features.

"Combing through the correlations", it seems, literally means calculating the (Pearson) correlation between two variables (success vs. an input feature) and adding some interpretation, as they do on p6 of the arXiv paper. For test/training data, it looks like they just used a 30/70% split rather than k-fold cross-validation and holdout, but I'm sure it makes no difference either way and in this case (as often) is trivial to design and implement. Presumably their AUROC could be increased just by dropping in an SVM or a Random Forest in place of the logistic regression.

From what I've skimmed of the paper you're over-estimating the complexity of the study.

md2be · on May 24, 2014

Great Comment: Having studied graduate statistics within the stats department and a data mining within the CS dept, It amazing how well the CS crowd has rebranded statistics into something you can talk about at a bar without people's eyes glazing over.

wodenokoto · on May 25, 2014

Is train/test split part of a normal statistics?

jlrubin · on May 25, 2014

Here's a project I put out in January, 65% accuracy, that does the same thing. These guys actually contacted me back then, surprised to see no mention.

And it has a live demo!

http://randomacts.media.mit.edu

lotsofmangos · on May 24, 2014

This is why the coming AI overlords will not need to kill anyone. They will just be really convincing instead.

bwooceli · on May 24, 2014

A more interesting analysis would be to identify the correlative features that describe the giver of pizza vs the asker.

danbruc · on May 24, 2014

Internet access but no food - strange world.

pbhjpbhj · on May 24, 2014

Are you suggesting that these people aren't being truthful about their need, or, are you saying it's amazing that internet access is so [relatively] cheap? Or... ?

Internet access costs less per month than a takeaway pizza in my country. Libraries and other community centres provide free access to computers with broadband internet.

One of the surprises for me when, some years ago, I found myself in a developing nation (with unstable power supplies and non-potable tap water) was that every street corner in the town I was staying in had a (hand painted) advert for an internet cafe. It wasn't cheap compared to food prices - but since then globally food prices seem to have gone up and internet prices gone down considerably.

danbruc · on May 24, 2014

I just wanted to express how bizarre this situation is but not imply anything about (possible) reasons. My first thought was like just terminate the internet contract, sell the phone or computer or whatever you are using and buy you 50 kg of rice. That would be even healthier than pizza!

But I decided not to phrase it this way because this is obviously a very simplistic view and the people on the internet will tell you that in full detail even if you are aware of it. They might have free access to internet. It might be a temporary situation and not be economical to sell and later buy back the computer. Corn is much cheaper than rice. You have to cook rice and that requires energy. Rice alone is not healthy. Maybe they just do this to get in contact with others not because of having no money for food. They might need the computer for work. They just got robbed late at night, no money left but they still have a phone.

pcrh · on May 25, 2014

Further, a single pizza is unlikely to make much difference to one's nutritional status. I see these sorts of things as closer to emotional support than life-saving interventions.

coralreef · on May 24, 2014

Hmm, these people could be making the more rational choice. Depending on what they do for money, internet access could provide a higher return than food.

danbruc · on May 24, 2014

Higher return than staying alive? (Ignoring the fact that they can possibly dual use the internet to get free stuff, especially food, making it really a rational choice.)