资 源 简 介
Search System for Giga-scale N-gram Corpus
The SSGNC is a search system designed for N-gram corpus of around 100GB. The first version was designed for the Google N-gram Corpus and thus the SSGNC was short for Search System for Google N-gram Corpus. But now the system is applicable to other N-gram corpus, so currently the G of the SSGNC means the initial letter of Giga-scale.
This system uses a kind of inverted index for finding specified N-grams but the index structure natively supports only a simple search function to find N-grams containing one of the given tokens. So this system provides filtering functions to find N-grams containing all the given tokens or to handle queries containing wildcards.
Search Features
The latest SSGNC can handle the following kinds of queries.
Unordered: Unordered boolean AND query
A query "A B" matches both "A B" and "B A". N-grams containing t