viernes, 7 de septiembre de 2012


Data Mining

Locality Sensitive Hashing for Documents
Even though we can use minhashing to compress large documents into small
signatures and preserve the expected similarity of any pair of documents, it
still may be impossible to find the pairs with greatest similarity efficiently.
The reason is that the number of pairs of documents may be too large, even if there
are not too many documents.
Example : Suppose we have a million documents, and we use signatures
of length 250. Then we use 1000 bytes per document for the signatures, and
the entire data fits in a gigabyte – less than a typical main memory of a laptop.
However, there are 1,000,000 / 2  or half a trillion pairs of documents. If it takes a
microsecond to compute the similarity of two signatures, then it takes almost
six days to compute all the similarities on that laptop.
If our goal is to compute the similarity of every pair, there is nothing we
can do to reduce the work, although parallelism can reduce the elapsed time.
However, often we want only the most similar pairs or all pairs that are above
some lower bound in similarity. If so, then we need to focus our attention only
on pairs that are likely to be similar, without investigating every pair.
There is a general theory of how to provide such focus, called locality-sensitive hashing
(LSH) or near-neighbor search. In this section we shall consider a specific form
of LSH, designed for the particular problem we have been studying: documents,
represented by shingle-sets, then minhashed to short signatures. In Section 3.6
we present the general theory of locality-sensitive hashing and a number of
applications and related techniques.