资 源 简 介
Introduction
SketchSortJ(1,2) is a software for all pairs similarity search. It takes as an input data points and outputs approximate neighbor pairs within a Jaccard distance (1.0 - Jaccard-similarity).
First, the input data points are mapped to sketches by minwise independent permutations, also called minhash, and then neighbor pairs of sketches within a Hamming distance are enumerated by the multiple sorting method (3). Finally, the Jaccard distances for such neighbor pairs are calculated. If the Jaccard distance for a neighbor pair is no more than a user-specified threshold , the neighbor pair is outputted. One might worry about missed nearest neighbor pairs by our method. A theoretical bound of the expectation of missing edge ratio is derived. It enables us to set parameters so as to limit the empirical missing edge ratio as small as possible.
Quick Start
To compile SketchSort , please type the followings: