SEMINTEC - Semantically-enabled data mining techniques
Project description
|
The SEMINTEC project aims at the investigation of the techniques of data mining that take into account the background information about the given domain supported in the form of the ontology.
The prior knowledge (ontology) may drive the search process into interesting for the given domain directions and prune the hypothesis space, moreover it may give more insight and understanding of the obtained results.
In most of the current knowledge discovery methods, the background knowledge is implicit or doesn't have formal structure or semantics and practically can be only considered by the human analyst. We aim at using explicitly provided domain knowledge with a formal, well-defined semantics during the knowledge discovery process. In particular our goals are:
- developing algorithms for Semantic Web Mining,
- developing algorithms for data mining in relational databases with support of the ontologies as a prior knowledge,
- developing algorithms that discover patterns represented in highly expressive languages (subsets of SWRL e.g. OWL DLP, DL-safe rules).
|
|
Personel involved
|
Joanna Józefowska , Ph.D., Dr. Habil., Associate Professor
Agnieszka Ławrynowicz , Ph.D. Student
Tomasz Łukaszewski , Ph.D. Assistant Professor
|
|
Funding
|
|
The project is partially funded by the Polish Ministry of Scientific Research and Information Technology (under grant number KBN 3T11F 025 28)
|
|
Downloads
|
|
Financial ontology
|
|
For our work concerning Semantic Web Mining we need a complete knowledge base, but we faced the problem of a quite few ontologies with assertional component available online whereas it is not difficult to find ontologies with only the terminological component available.
Thus we decided to use the existing, known from the PKDD'99 Discovery Challenge, financial dataset
(http://lisp.vse.cz/pkdd99/DATA/data_berka.zip )
and on the basis of the relational schema and problem description,
we created manually a simple ontology in Protégé .
Then we imported the text files from the original dataset into the relational database and we parsed the part of the database into the instances of the ontology and appended this part manually to the ontology file created in Protégé.
|
Short description of the financial domain
The financial dataset domain describes a bank that offers services (like managing of accounts, offering loans) to private persons. The data describes the accounts of bank clients, the loans granted, the credit cards issued, etc. One client can have more accounts and more clients can manipulate with a single account. To an account more
credit cards can be issued, but at most one loan can be granted for. Also some additional demographic data about clients is publicly available like the age, sex or address. More information about the original dataset can be found on http://lisp.vse.cz/pkdd99/Challenge/berka.htm
Currently in our ontology w.r.t. the original dataset, we don't have data about transactions and we have not all of the demographic data. The current ontology is in OWL-DLP fragment. In the future we are going to publish here the ontologies in more expressive languages.
The ontology can be downloaded here (3,24 MB).
Other, small ontologies based on the financial dataset (only information about gold credit card holders):
the one that was used in the experiments for EKAW2006, can be downloaded here and
another one, without disjunctions, here.
| |
SEMINTEC-ARM (SEMINTEC-Association Rule Miner)
|
Source code in Java: semintec_src.zip
Javadoc: doc.zip
Example input file with execution setup: exampleSemintecSetup.xml
The implementation depends on KAON2.
The description of the implemented method is published in RR'2008 proceedings (look to publications list).
More detailed documentation and experimental results coming soon.
|
|
Publications & presentations
|
-
Józefowska J., Ławrynowicz A., Łukaszewski T. On Reducing Redundancy in Mining Relational Association Rules from the Semantic Web, The Second International Conference on Web Reasoning and Rule Systems (RR'2008), LNCS 5341 Springer, Karlsruhe, 2008
-
Józefowska J., Ławrynowicz A., Łukaszewski T. Combining Answer Caching with Smartcall Optimization in Mining Frequent DL-safe Queries, Late Breaking Papers Session, 18th International Conference on Inductive Logic Programming, ILP'2008, Prague
-
Józefowska J., Ławrynowicz A., Łukaszewski T. Materialized views in mining ontology instances, Proceedings of the Poster Track of the 5th European Semantic Web Conference (ESWC2008), Tenerife (Spain), 2008.
-
Józefowska J., Ławrynowicz A., Łukaszewski T. A study of the SEMINTEC approach to frequent pattern mining. In Proc. PriCKL 2007, ECML/PKDD'2007 Workshop on Prior Conceptual Knowledge in Machine Learning and Knowledge Discovery, Warszawa, 41-52
- Józefowska J., Ławrynowicz A., Łukaszewski T., Frequent pattern discovery in OWL DLP knowledge bases, Lecture Notes in Artificial Intelligence, LNAI 4248, Steffen Staab, Vojtech Svatek (eds.), Managing Knowledge in a World of Networks, EKAW 2006, Podebrady, Czech Republic, 287-302, presentation (.pdf)
- Ławrynowicz A., Pattern discovery from ABoxes of OWL DLP knowledge bases, (Poster), Fourth European Summer School on Ontological Engineering and
the Semantic Web (SSSW-06), Cercedilla, Spain, July 2006, the best poster award
- Ławrynowicz A., Pattern discovery from the ontological layer of the Semantic Web (Poster) , KnowledgeWeb PhD Symposium 2006 (KWEPSY2006)
Budva, Montenegro, 17th June 2006, collocated with 3rd European Semantic Web Conference, ESWC'2006, Thanks to Hoppers@KWeb funding
- Józefowska J., Ławrynowicz A., Łukaszewski T. Faster frequent pattern mining from the Semantic Web, New Trends in Intelligent Information Processing and Web Mining Proceedings of the International IIS: IIPWM'06 Conference, Advances in Soft Computing, Springer Verlag 2006, , 121-130
- Józefowska J., Ławrynowicz A., Łukaszewski T. Towards discovery of frequent patterns in description logics with rules, RuleML 2005 (Rules and Rule Markup Languages for the Semantic Web), Galway, Ireland, 2005, A. Adi, S. Stoutenburg, S. Tabet (eds.), LNCS 3791, Springer Verlag, 84-97
|
|
Acknowledgements
|
|
For our experiments we use KAON2 engine.
We would like to thank Boris Motik for the support.
|
|
Institute of Computing Science, Poznan University of Technology, Last modified 3rd November 2008
|
|