Semantic Similarity II: Probabilistic Measures

The problem with vector space based measures is that, aside from the cosine, they operate on binary data. The cosine, on the other hand, assumes a Euclidean space which is not well-motivated when dealing with word counts.

A better way of viewing word counts is by representing them as probability distributions.

Then we can compare two probability distributions using the following measures: KL Divergence, Information Radius (Irad) and L1 Norm.

The problem with vector space based measures is that, aside from the cosine, they operate on binary data. The cosine, on the other hand, assumes a Euclidean space which is not well-motivated when dealing with word counts.