|
Dawid Weiss
Stemming engine for Polish
http://www.cs.put.poznan.pl/dweiss/xml/projects/lametyzator/index.xml An update... (September 27, 2006)
A major bug has been fixed in the FSA code. All new versions and source code
repository is now available as part of the Morfologik An update... (16 August, 2006) Added a Lametyzator constructor which allows custom dictionary stream, field delimiters and encoding. Added an option for building stand-alone JAR that does not include the default polish dictionary. An update... (26 May, 2006)
Lametyzator has been updated and includes an additional API methods needed
for a cooperative effort with Marcin Miłkowski (Morfologik An update... (February 3rd, 2005)
Lametyzator has been updated: a new dictionary is available (more inflected
forms). Also, the API has changed a little bit (repackaged). An additional
bonus is the tight (but optional) integration with
Stempel An update... (August 7th, 2004)
There is another freely available Polish stemmer -- this one algorithmical (so
it can stem words it doesn't know about). Please check out Andrzej Bialecki's
Stempel project A free stemming engine for the Polish language A stemmer is usually an algorithm, or in general a method, of bringing an inflected form of a word to its base form. For instance, English "coincidential" would be converted to "coincidence". While there are several such algorithms for English (Google for "Porter stemmer" or "Lovins stemmer"), there is a significant lack of such methods for Polish. A couple of excellent stemmers exist, but they are commercial. Because stemming usually improves quality of text mining methods, it is a pity Polish researchers are left without any to experiment. I decided to create this project to fill this gap somehow. It is not an algorithm, like in the case of Porter's stemmer, but a dictionary method. I used existing ispell dictionaries and flection rules to generate pairs: inflected_form -> base_form. Then I created a very efficient representation of such pairs as a finite state automata, using Jan Daciuk's FSA package. Stemming is done by traversing the automata, which is very efficient, but of course has some drawbacks. The most important drawback is that none of the words outside the dictionary are stemmed. I would like to extend the stemmer somehow, so that it would accept proper names and words outside the dictionary, but it is not easy considering complex Polish inflection rules. The idea The idea behind my Polish stemmer is very basic. Let's suppose we have a term A and we are looking for its base for B. Now, if we only had a mapping of all terms TERM->BASE FORM, we would be set. Problems and solutions Despite the simplicity of the above idea, it brings a number of new problems to solve:
As one can clearly see, my stemmer has many drawbacks. I think its main advantage is in the fact that it exists. There are many stemming engines much more sophisticated than mine, created by people much more knowledgeable in the Polish language than myself, but none of them is, according to my knowledge, available free of charge. This is understandable, since Polish stemmers usually include quite complex dictionaries, which consume a great deal of effort. Scientist have right to sell them. My stemmer can be used as a cheap alternative, for student research, or in a proof of concept projects. Demo Due to low-interest, the on-line demo is no longer available. Download Lametyzator/Stempelator and FSA code are now part of the
Morfologik project The stemmer is available free of charge now and so it will be in the future. Please follow these guidelines, however:
Todo's (future plans) I don't claim ownership to the below ideas, if you solve them or decide to work on them, please let me know so that I can use your results.
Acknowledgements I would like to thank the following individuals for help, resources or both: Known bugs No software is ideal (maybe with an exception of Mr. Knuth's software :). The known limitations are listed below, please feel free to let me know if anything else you think doesn't work as expected.
|