资 源 简 介
Includes proper tokenization and models for very high accuracy sentence boundary detection (English only for now). The models are trained from Wall Street Journal news combined with the Brown Corpus which is intended to be widely representative of written English. Error rates on test news data are near 0.25%.
This is the source code for the paper "Sentence Boundary Detection and the Problem with the U.S." appearing at NAACL 2009.
Code written in Python.
Dan Gillick