Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Text Matching Using Cosine Similarity (kanoki.org)
102 points by pplonski86 on Dec 27, 2018 | hide | past | favorite | 19 comments


In this world of content marketing in the form of thinly sliced salami it would be nice to see something that takes the time to tell a complete story as opposed to this.

Right now on the front page there is another article about "deep learning" classifiers which are competitive with this technology and it would be nice to see an objective comparison. (e.g. a co-worker made a 0.93 accurate text classifier in 2004 w/ bag o' words and the SVM)


its nice to use when you don't have labeled data such as plagiarism detection. I use it a lot in the security space to see if a user has changed their behaviors by looking at the cosine distance of resources used over a period of time vs the next.


Keep in mind that any similarity measure is mapping a high dimensional space onto a one dimensional space. The volume of vectors within a cosine of another vector is an N dimensional cone with volume rapidly decaying towards 0 for N > 5 (assuming positive cosine). Therefore cosine is not a particularly good metric in high dimensional spaces.


For n ~ 250k to 1.5M, I've use Manhattan distance (as opposed to Euclidean) in an analysis of neuroimaging functional data...but I am interested in people's takes on choosing a distance metric..There are so many exotics out there.


What if my model is really good and maps to the angle between two vectors almost perfectly?


Kind of, w2v is cosine similarity driven and there embeddings often go up to 300 dimensions.


> For a novice it looks a pretty simple job of using some Fuzzy string matching tools and get this done.

Nice.

That annoying quip aside, as with many things in data processing, it's case to case. In TF-IDF you lose ordering information by definition. This is probably fine for this use case but it does mean that if ordering does matter since a set of stores share the same words.in different ordering, this will fail to resolve the difference. The author says he did due diligence on the data but there are other ways this can fall short. For example ["Walmart", "5280"] compared to ["Store", "5280"] is going to not be so similar as one would want due to the down-weighting of the identifying number in TF-IDF. So imo the disadvantage mentioned for using BoW over TF-IDF is actually not a disadvantage sometimes. As with everything it depends on your problem and data.

To the author I would hope in the future you remove statements like "to a novice, it seems easy to use X". There is nothing novice about going into a problem with an idea and trying it if it seems to fit the use case.


One interesting fact to note is that there is a link between the cosine rule and probability- here’s an explanation:

https://www.johndcook.com/blog/2010/06/17/covariance-and-law...


The link goes even deeper than that. Cosines have an intimate relationship with Euclidean distance, and the fundamental statistics concept of variance is in turn intimately related to Euclidean distance... and Gaussian distributions (perhaps the most continuous distribution family because of the Central Limit Theorem) are parameterized directly by mean and variance. And Gaussian models have a convenient habit of reducing to straightforward linear algebra as a result. The fact that Gaussian problems also happen to be nicely differentiable is yet another bonus. Oh, and least-squares linear regression (read: fit a line to optimize the Euclidean distance between your prediction vector and the data) is equivalent to a Gaussian maximum likelihood model.

Basically, everything boils down to high-school trigonometry, because it's the most natural way to define distances in our world. I still marvel at it.


> The link goes even deeper than that. Cosines have an intimate relationship with Euclidean distance, and the fundamental statistics concept of variance is in turn intimately related to Euclidean distance... and Gaussian distributions (perhaps the most continuous distribution family because of the Central Limit Theorem) are parameterized directly by mean and variance. And Gaussian models have a convenient habit of reducing to straightforward linear algebra as a result. The fact that Gaussian problems also happen to be nicely differentiable is yet another bonus. Oh, and least-squares linear regression (read: fit a line to optimize the Euclidean distance between your prediction vector and the data) is equivalent to a Gaussian maximum likelihood model.

Thanks for this insight! Can you also suggest a book or other materials that I can read to understand this in greater depth?


Note that I meant to write

perhaps the most important continuous distribution family because of the Central Limit Theorem

instead of

perhaps the most continuous distribution family because of the Central Limit Theorem

As for a single book, not really. See my response to the sibling comment for some more insight.


I really hate this kind of numerology. cosines don't have intimate relationship with euclidean distance - the dot product and the norm aren't equivalent https://math.stackexchange.com/questions/528864/is-every-nor.... I also have no idea what you mean by normal distribucións being most continuous because CLT? and then what's linearity got to do with continuity? normal distribucións are convenient to work with simply because of the exponential (other exponential families are also nice to work with). again don't know what you mean by normals being nicely differentiable. lots of PDFs are differentiable (in fact most functions are easy to differentiate). finally you only have to farther than an undergrad book to see what most things are actually not high school trig (e.g. time dependent models like SDEs or Bayesian inference with intractable marginals)


I'm not talking in generalities. I'm talking about the standard inner product space on R^n that most people use (either implicitly or explicitly) every day.

You are right that the normal distribution is not continuous because of the CLT. That was a typo. I meant "most important continuous" distribution.

Yes, other exponential distributions are convenient to work with, but the Gaussian maximum-likelihood estimate has a closed-form expression defined with just linear algebra. That's really damn nice in my opinion.

The point is that distance (in most real-world applications) is equivalent to the length of the hypotenuse of a right triangle in Euclidean space. So in my opinion, yes, high-school trigonometry governs a lot of advanced mathematics, including in statistics and machine learning.

Call it numerology if you want, I'm just trying to get people excited about math here.


When I was in the middle of the first paragraph I was prompted to sign a newsletter so I gave up reading.


Hacker News has been going downhill for some time now, and articles like this are perfect examples. This article:

- Lacks a snippet of the data in question

- Lacks proper notation ("Vector(A) = [5, 0, 2]"? never seen anything like this)

- Has grammar and formatting mistakes everywhere, even failing to properly copy-paste Wikipedia's end quotes

- Uses a batteries-included sklearn implementation which doesn't go into any interesting details


There are already plenty of blog posts explaining the same topic (eg : 2013 post - https://janav.wordpress.com/2013/10/27/tf-idf-and-cosine-sim...). I don't know why people have the urge to write a blog post about simple things. I don't think there is anything new in the post that I can not find already.

This is just stupid, people writing about stuff that has been there for some time as if it is new. I Can't believe this link is there on the front page of HN. Sorry if it seemed harsh, but somebody had to say it.


Your comment is ridiculous on its face. The reduction of your argument is that all topics should be written about once. Well, thankfully, that's not the case. There is a saying, the best way to learn something is to teach. And that is what wiring a blog post is - teaching. We are all better off having multiple voices describe the same thing in multiple ways. I hope to see your writing on simple things one day too.

Whether or not this specific article should be on the front page of HN is a different question.


My point is "what's the value added with this article?" is it - ease of explanation? - numerical example? - code?

I can find plenty of blog posts which do a better job at all these criteria with a simple google search.

I don't see any value added with this post.

Maybe the author wrote it to teach whatever he has learnt, but it's not worthy of HN front page.


The value add is not for people who have already seen the article from 2013 that you linked. It's for people who have never been exposed to this idea before.

You could have added value just as easily by saying something like "this article from 2013 does a good job of explaining the same topic: ..."

But you chose to be rude.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: