Semantic Similarity II: Probabilistic Measures
The problem with vector space based measures is that, aside from the cosine, they operate on binary data. The cosine, on the other hand, assumes a Euclidean space which is not well-motivated when dealing with word counts.
A better way of viewing word counts is by representing them as probability distributions.
Then we can compare two probability distributions using the following measures: KL Divergence, Information Radius (Irad) and L1 Norm.