HOWTO

 
 
[Home] [About] [HOWTO] [Download] [API] [Forums] [CVS]
Warning! This HOWTO is now out of date with respect to the maxent implementation, though it still should be helpful for newcomers because it explains a bit how to use the maxent framework as well as the OpenNLP Maxent implementation. To mention some of the implementation differences just briefly, you should have a look at the EventStream interface and consider using that instead of the EventCollector class. Actually, the opennlp.grok.preprocess.sentdetect package discussed in this HOWTO has been updated to work with the maxent 1.2.0 as of Grok version 0.5.2, so you can download Grok and have a look at that to see what is different. I'll see if I can update this document sometime in the near future, but it isn't a high priority just now. If you have any questions, do not hesitate to post them on the help forum.
                                                          Jason, 2001 October 29

We've tried to make it fairly easy to build and use maxent models, but you need two things to start with: 1) an understanding of feature selection for maxent modeling, and  2) Java skills or the ability to read some example Java code and turn it into what you need.  I'll write a very basic summary of what goes on with feature selection.  For more details refer to some of the papers mentioned in here.

Features in maxent are functions from outcomes (classes) and contexts to true or false.  To take an example from Adwait Ratnaparkhi's part of speech tagger, a useful feature might be:

    feature (outcome, context)  = { 1   if  outcome=DETERMINER
                                                             {          &&  currentword(context) = "that"
                                                             { 0   otherwise

Your job, as a person creating a model of a classification task, is to select the features that will be useful in making decisions.  One thing to keep in mind, especially if you are reading any papers on maxent, is that the theoretical representation of these features is not the same as how they are represented in the implementation.  (Actually, you really don't need to know the theoretical side to start selecting features with opennlp.maxent.) If you are familiar with feature selection for Adwait Ratnaparkhi's maxent implementation, you should have no problems since our implementation uses features in the same manner as his.  Basically, features like the example above are reduced, for your purposes, to the contextual predicate portion of the feature, i.e. currentword(context)="that" (in the implementation this will further reduce to "current=that" or even just "that"). From this point on, I'll forget theory and discuss features from the perspective of the implementation, but for correctness I'll point out that whenever I say feature, I am actually talking about a contextual predicate which will expand into several features (however, this is entirely hidden from the user, so don't worry if you don't understand).

So, say you want to implement a program which uses maxent to find names in a text., such as:

He succeeds Terrence D. Daniels, formerly a W.R. Grace vice chairman, who resigned.

If you are currently looking at the word Terrence and are trying to decide if it is a name or not, examples of the kinds of features you might use are "previous=succeeds", "current=Terrence", "next=D.", and "currentWordIsCapitalized".  You might even add a feature that says that "Terrence" was seen as a name before.

Here's how this information translates into the implementation.  Let's assume that you already have a trained model for name finding available, that you have created an instance of the MaxentModel interface using that model, and that you are at currently looking at Terrence in the example sentence above.  To ask the model whether it believes that Terrence is a name or not, you send a String[] with all of the features (such as those discussed above) to the model by calling the method:

public double[] eval(String[] context);

The double[] which you get back will contain the probabilities of the various outcomes which the model has assigned based on the features which you sent it.  The indexes of the double[] are actually paired with outcomes.  For example, the outcomes associated with the probabilites might be "TRUE" for index 0 and "FALSE" for index 1.  To find the String name of a particular index outcome, call the method:

public String getOutcome(int i);

Also, if you have gotten back double[] after calling eval and are interested in only the outcome which the model assigns the highest probability, you can call the method:

public String getBestOutcome(double[] outcomes);

And this will return the String name of that most likely outcome.
 

In order to make context collection process nicely modularized, you need to implement the ContextGenerator interface:

public interface ContextGenerator {

    /**
     * Builds up the list of contextual predicates given an Object.
     */
    public String[] getContext(Object o);

}

In Grok, the Object that we usually pass is a opennlp.common.util.Pair which contains a StringBuffer and the Integer index of the position we are currently at in the StringBuffer.  However, you can pass whatever Object you like as long as your implementation of ContextGenerator can deal with it. and produce a String[] with all of the relevant features (contextual predicates) in it.  An example is given from the opennlp.grok.preprocess.sentdetect.SDContextGenerator implementation of the opennlp.maxent.ContextGenerator interface.

 /**
  * Builds up the list of features, anchored around a position within the
  * StringBuffer.
  */
 public String[] getContext(Object o) {
     StringBuffer sb = (StringBuffer)((Pair)o).a;
     int position = ((Integer)((Pair)o).b).intValue();

     int lastIndex = sb.length()-1;

     int prefixStart = PerlHelp.previousSpaceIndex(sb, position);
     int prevStart = PerlHelp.previousSpaceIndex(sb, prefixStart);

     int suffixEnd = PerlHelp.nextSpaceIndex(sb, position, lastIndex);
     int nextEnd = PerlHelp.nextSpaceIndex(sb, suffixEnd, lastIndex);

     String prefix, previous, suffix, next;

     prefix = sb.substring(prefixStart, position).trim();

     previous = sb.substring(prevStart, prefixStart).trim();

     if (position == lastIndex) {
         suffix = "";
         next = "";
     } else {
         suffix = sb.substring(position+1,suffixEnd).trim();
         next = sb.substring(suffixEnd, nextEnd).trim();
     }

     ArrayList collectFeats = new ArrayList();
     if (!prefix.equals(""))   collectFeats.add("x="+prefix);
     if (PerlHelp.capRE.isMatch(prefix)) collectFeats.add("xcap");
     if (!previous.equals("")) collectFeats.add("v="+previous);
     if (!suffix.equals(""))   collectFeats.add("s="+suffix);
     if (!next.equals(""))     collectFeats.add("n="+next);

     String[] context = new String[collectFeats.size()];
     for (int i=0; i<collectFeats.size(); i++)
         context[i] = (String)collectFeats.get(i);

     return context;
}

Basically, it just runs around the StringBuffer collecting features that we thought would be useful for the end of sentence detection task.
You might notice some odd things such as "v=" and "n=" --- these are just abbreviations for "previous" and "next". It is a good idea to use abbreviations for such features since they are generated from the data, and when you train your model, there may be several thousand of features with the form "previous=X" where X is the word preceding a possible sentence ending punctuation mark in the training data.  All of these feature names must then be saved to disk eventually, and if you use, for example, "v" instead of "previous", you'll save a significant amount of disk space.

The SDContextGenerator and the sentence detection model are then used by the method sentDetect in opennlp.grok.preprocess.SentenceDetectorME method  as follows (the ContextGenerator has the name "cgen"):

 public String[] sentDetect(String s) {
     StringBuffer sb = new StringBuffer(s);
     REMatch[] enders = PerlHelp.peqRE.getAllMatches(sb);

     int index = 0;
     String sent;
     for (int i=0; i<enders.length; i++) {
         int j = enders[i].getStartIndex();
         probs = model.eval(cgen.getContext(new Pair(sb,new Integer(j))));
         if (model.getBestOutcome(probs).equals("T")) {
              sent = sb.substring(index, j+1).trim();
              if (sent.length() > 0) sents.add(sent);
              index=j+1;
         }
     }

     if (index < sb.length()) {
         sent = sb.substring(index).trim();
         if (sent.length() > 0) sents.add(sent);
     }

     String[] sentSA = new String[sents.size()];
     for (int i=0; i<sents.size(); i++)
         sentSA[i] = ((String)sents.get(i)).trim();
     sents.clear();
     return sentSA;
}

So that is basically what you need to know to use models! Now, how do you train a new model?  For this, you'll want to implement the EventCollector interface:
public interface EventCollector {
    public Event[] getEvents();
    public Event[] getEvents(boolean evalMode);
}
A class which implements EventCollector should take the data (which it is organizing into events) as an argument to a constructor.  For most packages in opennlp.grok.preprocess, we use java.io.Reader objects, as the following segment of the opennlp.grok.preprocess.SDEventCollector shows:
 
public class SDEventCollector implements EventCollector {
    private ContextGenerator cg = new SDContextGenerator();
    private BufferedReader br;
 
    public SDEventCollector(Reader data) {
         br = new BufferedReader(data);
    }
            ...

The getEvents methods required by the interface are then implemented as follows:

public Event[] getEvents() {
     return getEvents(false);
 }

public Event[] getEvents(boolean evalMode) {
     ArrayList elist = new ArrayList();
     int numMatches;
 
     try {
         String s = br.readLine();
         while (s != null) {
             StringBuffer sb = new StringBuffer(s);
             REMatch[] enders = PerlHelp.peqRE.getAllMatches(sb);
             numMatches = enders.length;
             for (int i=0; i<numMatches; i++) {
                 int j = enders[i].getStartIndex();

                Event e;
                String[] context =
                cg.getContext(new Pair(sb, new Integer(j)));
 
                if (i == numMatches-1) {
                    e = new Event("T", context);
                } else {
                    e = new Event("F", context);
                }
 
                elist.add(e);
             }
              s = br.readLine();
         }
    } catch (Exception e) { e.printStackTrace(); }

    Event[] events = new Event[elist.size()];
     for(int i=0; i<events.length; i++)
         events[i] = (Event)elist.get(i);

     return events;
}

Basically, this just walks through the data, asks the ContextGenerator for contexts, and throws an event outcome onto it to create a opennlp.maxent.Event object.  Notice that we ignore the boolean evalMode in this implementation, which is because the SentenceDetectorME has not yet been set up for the nice automatic evaluation stuff made possible by the Evalable interface and TrainEval class.  See the opennlp.grok.preprocess.namefind and opennlp.grok.preprocess.postag packages for examples which take advantage of the evaluation code.

Once you have both your ContextGenerator and EventCollector implementations as well as your training data in hand, you can train up a model.  opennlp.maxent has an implementation of Generalized Iterative Scaling (opennlp.maxent.GIS) which you can use for this purpose.  Write some code somewhere to make a call to the method GIS.trainModel, which will ultimately save a model in a location which you have specified.

public static void trainModel(String modelpath,  String modelname, DataIndexer di, int iterations) {  ...  }

The modelpath is the directory where you want the model saved, the modelname is however you want to call the model, and the iterations are the number of times the training procedure should iterate when finding the model's parameters. You shouldn't need more than 100 iterations, and when you are first trying to create your model, you'll probably want to use fewer so that you can iron out problems without waiting each time for all those iterations, which can be quite a while depending on the task.  The DataIndexer is an object that pulls in all those events that your EventCollector has gathered and then manipulates them into a format that is much more efficient for the training procedure to work with.  There is nothing complicated here --- you just need to create a DataIndexer with the events and an integer that is the cutoff for the number of times a feature must have been seen in order to be considered in the model.

public DataIndexer(Event[] events, int cutoff){ ... }

You can also call the constructor DataIndexer(Event[] events), which assumes a cutoff of 0.  An example of code which does all of these steps to create a model follows (from opennlp.grok.preprocess.sentdetect.SentenceDetectorME):

public static void main(String[] args) {
     try {
         FileReader datafr = new FileReader(new File(args[0]));
         String outdir = args[1];
         String modelname = args[2];
         DataIndexer di =  new DataIndexer(new SDEventCollector(datafr).getEvents(), 3);
         GIS.trainModel(outdir, modelname, di, 100);
     } catch (Exception e) {
         e.printStackTrace();
     }
 
}

Once the training is done, GIS dumps the model out as two files, one containing the model's parameters in binary format and the other containing information such as the different outcomes, the outcomes which have been associated with particular features, and the features themselves.  They are saved (automatically gzipped) with the names modelname.mep.gz and modelname.mei.gz, respectively ("mep" for maxent parameters and "mei" for maxent info).

Now that you have your models dumped to disk, you can create an instance of opennlp.maxent.MaxentModel by called the constructor GISModel(String modellocation, String modelname), which assumes that the two files for the model are gzipped with the mei and mep suffixes that GIS  saved them as.  So if you had just trained a model called "MyClassificationTask" which is saved in the directory /myproject/classify/ (as the files MyClassificationTask.mep.gz and MyClassificationTask.mep.gz) , you would create your model instance by calling GISModel("/myproject/classify/", "MyClassificationTask").  Make sure that you have the trailing directory separator '/' on the location.  (Note: we use a Unix example here, but it should work for other OS types as well.)  Alternatively, you can create the model by using the constructor which takes InputStreams for the parameters and info files: GISModel(InputStream modelinfo, InputStream modelparams).  See opennlp.grok.preprocess.sentdetect.EnglishSentenceDetectorME for an example of this.

That's it! Hopefully, with this little HOWTO and the example implementations available in opennlp.grok.preprocess, you'll be able to get maxent models up and running without too much difficulty.  Please let me know if any parts of this HOWTO are particularly confusing and I'll try to make things more clear.  I would also welcome "patches" to this document if you feel like making changes yourself.  Also, feel free to take the opennlp.grok.preprocess implementations of ContextGenerator and EventCollector and modify and use them for your own purposes (they are LGPL'ed).

[Home] [About] [HOWTO] [Download] [API] [Forums] [CVS]

Email: jmb@cogsci.ed.ac.uk
2001 October 29