Original developers refer that OpenNLP is an organizational center for open source projects related to natural language processing. Its primary role is to encourage and facilitate the collaboration of researchers and developers on such projects.
In my personal interests, I was very interested in their package, sentence detector. On the Web, you may find many open sources for tokenizer and parsers for one sentence but not sentence detector that will parse sentences from a given paragraph. In fact, sentence detecting can not be done by applying a simple regular expression. There are many complex confusing cases in doing so. Though I first tried to devise a regular expression due to its simplicity to parse Wiki texts in the database. See for instance,
Above regular expression (edited in C++ format) mostly works well but not for some sentence including special characters like quotes. So I did google again and found OpenNLP that works good for me. Let me explain at this page how to set it up for your use.
I will limit the working environment just for my case: Windows XP. If you are lucky like me, it won't take more than one hour to see the first result. I got ideas in doing below steps mostly from here.
C:\research\program\library\OpenNLP\opennlp-tools-1.3.0>SET JAVA_HOME=C:\PROGRAM FILES\JAVA\JDK1.6.0_03
This it it. Ant.exe will automatically read build.xml and compile everything out to output directory.
java opennlp.tools.lang.english.SentenceDetector opennlp.models/english/sentdetect/EnglishSD.bin.gz
See below batch file example. sed refers to a Unix utility for parsing text files and the programming language it uses to apply textual transformations to a sequential stream of data. Download sed for Win32 at here.
@echo off echo "<paragraph>" | sed "s/\"//g" java opennlp.tools.lang.english.SentenceDetector opennlp.models/english/sentdetect/EnglishSD.bin.gz < %1 | sed "s/^\s*/\t<sentence>/" | sed "s/\s*$/<\/sentence>/" echo "</paragraph>" | sed "s/\"//g"