MainProjects
Dawid Weiss
Stemming engine for Polish
http://www.cs.put.poznan.pl/dweiss/xml/projects/lametyzator/index.xml
An update... (September 27, 2006)

A major bug has been fixed in the FSA code. All new versions and source code repository is now available as part of the Morfologik [outlink] project.

An update... (16 August, 2006)

Added a Lametyzator constructor which allows custom dictionary stream, field delimiters and encoding. Added an option for building stand-alone JAR that does not include the default polish dictionary.

An update... (26 May, 2006)

Lametyzator has been updated and includes an additional API methods needed for a cooperative effort with Marcin Miłkowski (Morfologik [outlink]).

An update... (February 3rd, 2005)

Lametyzator has been updated: a new dictionary is available (more inflected forms). Also, the API has changed a little bit (repackaged). An additional bonus is the tight (but optional) integration with Stempel [outlink]. If you put Stempel in CLASSPATH, Lametyzator will become a hybrid stemmer (also known as Stempelator -- see this technical report for an overview).

An update... (August 7th, 2004)

There is another freely available Polish stemmer -- this one algorithmical (so it can stem words it doesn't know about). Please check out Andrzej Bialecki's Stempel project [outlink] -- it is definitely worth taking a look at.

A free stemming engine for the Polish language

A stemmer is usually an algorithm, or in general a method, of bringing an inflected form of a word to its base form. For instance, English "coincidential" would be converted to "coincidence". While there are several such algorithms for English (Google for "Porter stemmer" or "Lovins stemmer"), there is a significant lack of such methods for Polish. A couple of excellent stemmers exist, but they are commercial. Because stemming usually improves quality of text mining methods, it is a pity Polish researchers are left without any to experiment.

I decided to create this project to fill this gap somehow. It is not an algorithm, like in the case of Porter's stemmer, but a dictionary method. I used existing ispell dictionaries and flection rules to generate pairs: inflected_form -> base_form. Then I created a very efficient representation of such pairs as a finite state automata, using Jan Daciuk's FSA package.

Stemming is done by traversing the automata, which is very efficient, but of course has some drawbacks. The most important drawback is that none of the words outside the dictionary are stemmed. I would like to extend the stemmer somehow, so that it would accept proper names and words outside the dictionary, but it is not easy considering complex Polish inflection rules.

The idea

The idea behind my Polish stemmer is very basic. Let's suppose we have a term A and we are looking for its base for B. Now, if we only had a mapping of all terms TERM->BASE FORM, we would be set.

Problems and solutions

Despite the simplicity of the above idea, it brings a number of new problems to solve:

  1. How to store the mapping A->B, if we know it can contain thousands, or even millions of pairs?

    I used a deterministic automaton for storing the mapping. More information about automata and an implementation of building it out of a set of words can be found on Jan Daciuk [outlink]'s web site.

    My work in this area consisted of writing a Java interpreter of the dictionaries produced by Daciuk's FSA package. The entire dictionary is compressed from 44Mb to only 1.5Mb.

  2. How to make use of the mapping efficiently?

    The data structure used for storing the mapping (deterministic automaton) can be very effectively traversed. In general, it takes N steps to determine whether a word is in the dictionary or not, where N is the number of characters in the word.

  3. Where to take the mappings from (a dictionary of inflected forms)?

    This is a difficult problem. Most dictionaries are commercial and cannot be distributed or used in open algorithms or products. It should be stated that a simple collection of words is not enough, the core problem lies in finding the mapping between an inflected form and its base form. Not many dictionaries, even commercial, provide this information.

    I decided to use the flection rules provided with Polish dictionaries of ispell. These rules decribe how to convert a base form of a term into a set of inflected terms. In ispell this information is used mainly for spell checking, I use it for generating the desired mappings mentioned above.

  4. When will the dictionary contain all the possible terms in Polish?

    Never. A language is a live thing, which constantly evolves - new words are added, inflection rules change sometimes. The most difficult problem for the dictionary is in identification of proper names: people's names, every day use objects etc. These terms also have complex inflection in Polish and they are in majority not included in ispell's dictionaries.

    Zipf's law states that even a very large dictionary will cover only a part of some random text. For instance, in Rzeczpospolita corpus, comprising about 87 million words (884 thousand unique terms), as much as 613 thousand words are present less than 5 times. In other words, 69% of the corpus is built of very low-frequency terms.

  5. Is there a solution for terms having more than one base form (example: a term 'celi' may be an inflection of base term 'cela' (prison cell), or 'cel' (target).

    Currently the dictionary returns all possible (and known) base forms of a word. Choosing the right base form is not possible without understanding the meaning of its inclined version. Stemmers usually limit their functionality to returning all possible base forms, which are then fed to a syntactical analyzer.

As one can clearly see, my stemmer has many drawbacks. I think its main advantage is in the fact that it exists. There are many stemming engines much more sophisticated than mine, created by people much more knowledgeable in the Polish language than myself, but none of them is, according to my knowledge, available free of charge. This is understandable, since Polish stemmers usually include quite complex dictionaries, which consume a great deal of effort. Scientist have right to sell them. My stemmer can be used as a cheap alternative, for student research, or in a proof of concept projects.

Demo

Due to low-interest, the on-line demo is no longer available.

Download

Lametyzator/Stempelator and FSA code are now part of the Morfologik project [outlink].

The stemmer is available free of charge now and so it will be in the future. Please follow these guidelines, however:

  1. Please notify me by e-mail if you downloaded/ used the program. My e-mail address is: dawid.weiss@cs.put.poznan.pl. Nothing is so motivating as a feedback from real users.
  2. Commercial use note: Jan Daciuk's FSA package is not free. The dictionaries I produced with the FSA are freely distributable, however, if you want to make your own automata, please contact Jan Daciuk for permission.
  3. Please put a reference to my name in your product, if you used my stemmer. I would also suggest putting Jan Daciuk's name there and a reference to Polish ispell dictionary.
Todo's (future plans)

I don't claim ownership to the below ideas, if you solve them or decide to work on them, please let me know so that I can use your results.

  1. Perform experiments on real Polish texts to check how effective the stemmer is.
  2. Check the coverage of the stemmer on a language corpus.
  3. Add new words to the mapping (and to ispell, maybe).
  4. Add an option of fuzzy analysis of terms based on suffixes present in the dictionary. Such quasi-stemmer has already been written by me and worked quite well in Carrot. If added to this stemmer, proper words could be more effectively recognized.
Acknowledgements

I would like to thank the following individuals for help, resources or both:

Known bugs

No software is ideal (maybe with an exception of Mr. Knuth's software :). The known limitations are listed below, please feel free to let me know if anything else you think doesn't work as expected.

  1. [Mirosław Prywata] Some prefixes in ispell create negation of a term, like społeczny -> a-społeczny.
  2. [Mirosław Prywata] K rule in ispell creates a different category of a word, for instance: beton->betoniarka (concrete->concrete mixer).

(c) Dawid Weiss. All rights reserved unless stated otherwise.