Original developers refer that OpenNLP is an organizational center for open source projects related to natural language processing. Its primary role is to encourage and facilitate the collaboration of researchers and developers on such projects.

In my personal interests, I was very interested in their package, sentence detector. On the Web, you may find many open sources for tokenizer and parsers for one sentence but not sentence detector that will parse sentences from a given paragraph. In fact, sentence detecting can not be done by applying a simple regular expression. There are many complex confusing cases in doing so. Though I first tried to devise a regular expression due to its simplicity to parse Wiki texts in the database. See for instance,

(?:^|\\s|\\/)?([A-Z0-9\\'\\\"\\(\\{].+?[\\.\\?\\!]{1}(?:\\]|\\\"|\\'|\\}|^\\s)?)(?:$)?

Above regular expression (edited in C++ format) mostly works well but not for some sentence including special characters like quotes. So I did google again and found OpenNLP that works good for me. Let me explain at this page how to set it up for your use.

I will limit the working environment just for my case: Windows XP. If you are lucky like me, it won't take more than one hour to see the first result. I got ideas in doing below steps mostly from here.

Things to downloads

  • If you do not have Java, download JDK first at here. Be sure to download the first item (ex. JDK 6 Update 3).
  • Get Ant to compile OpenNLP at here. You may not know the details on Ant but if interested, visit [1].
  • Now download OpenNLP. Get the most recent OpenNLP Tools from here.
  • You also need the language models for all tools in OpenNLP. Download everything from here. When doing so, create sub directories just like at the source web site.

Things to compile

  1. Install JDK. Then you need to change your PATH setting at the Windows system panel. Add JAVA_HOME with its home directory (ex. c:\program files\java\jdk1.6.0_03). You may do this every time when you run OpenNLP like at the CMD console type in
C:\research\program\library\OpenNLP\opennlp-tools-1.3.0>SET JAVA_HOME=C:\PROGRAM FILES\JAVA\JDK1.6.0_03
  1. Compile OpenNLP: Original OpenNLP package does not com with its compiled jar and class files. So you need to run Ant to get those binaries. Unzip the Ant and OpenNLP each at their directory and run ant at the OpenNLP directory for instance
C:\ANT\ant.exe 

This it it. Ant.exe will automatically read build.xml and compile everything out to output directory.

Testing OpenNLP

  1. To run OpenNLP, again you need to modify the system variable. This time you modify the CLASS_PATH variable. Get some ideas from below examples to meet your environment.
    • SET CLASSPATH=%CLASSPATH%;C:\research\program\library\OpenNLP\opennlp-tools-1.3.0\output\opennlp-tools-1.3.0.jar
    • SET CLASSPATH=%CLASSPATH%;C:\research\program\library\OpenNLP\opennlp-tools-1.3.0\lib\trove.jar
    • SET CLASSPATH=%CLASSPATH%;C:\research\program\library\OpenNLP\opennlp-tools-1.3.0\lib\maxent-2.4.0.jar
    • SET CLASSPATH=%CLASSPATH%;C:\research\program\library\OpenNLP\opennlp-tools-1.3.0\lib\jwnl-1.3.3.jar
  2. Ok. You are all set. Test below command for instance. It is assuming that models files are under opennlp.models subdirectory. After you run the below command, type in sample sentences and press Ctrl-c to see its results.
java opennlp.tools.lang.english.SentenceDetector opennlp.models/english/sentdetect/EnglishSD.bin.gz

Automate process

See below batch file example. sed refers to a Unix utility for parsing text files and the programming language it uses to apply textual transformations to a sequential stream of data. Download sed for Win32 at here.

@echo off
echo "<paragraph>" | sed "s/\"//g"
java opennlp.tools.lang.english.SentenceDetector opennlp.models/english/sentdetect/EnglishSD.bin.gz < %1 | sed "s/^\s*/\t<sentence>/" | sed "s/\s*$/<\/sentence>/"
echo "</paragraph>" | sed "s/\"//g"