Publications

The copyright of the papers below is owned by the respecteive publishers. Personal use of the electronic versions here provided is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the publishers.

AuthorTitleYearJournal/ProceedingsReftypeFile
Czyzewski, A., Krawiec, F., Brzezinski, D., Porebski, P.J. and Minor, W. Detecting anomalies in X-ray diffraction images using convolutional neural networks 2021 Expert Systems with Applications
Vol. 174, pp. 114740 
article Link 
Abstract: Our understanding of life is based upon the interpretation of macromolecular structures and their dynamics. Almost 90% of currently known macromolecular models originated from electron density maps constructed using X-ray diffraction images. Even though diffraction images are critical for structure determination, due to their vast amounts and noisy, non-intuitive nature, their quality is rarely inspected. In this paper, we use recent advances in machine learning to automatically detect seven types of anomalies in X-ray diffraction images. For this purpose, we utilize a novel X-ray beam center detection algorithm, propose three different image representations, and compare the predictive performance of general-purpose classifiers and deep convolutional neural networks (CNNs). In benchmark tests on a set of 6,311 X-ray diffraction images, the proposed CNN achieved between 87% and 99% accuracy depending on the type of anomaly. Experimental results show that the proposed anomaly detection system can be considered suitable for early detection of sub-optimal data collection conditions and malfunctions at X-ray experimental stations.
BibTeX:
@article{X-ray_diffraction_annomalies_CNN,
  author = {Adam Czyzewski and Faustyna Krawiec and Dariusz Brzezinski and Przemyslaw Jerzy Porebski and Wladek Minor},
  title = {Detecting anomalies in X-ray diffraction images using convolutional neural networks},
  journal = {Expert Systems with Applications},
  year = {2021},
  volume = {174},
  pages = {114740},
  doi = {https://doi.org/10.1016/j.eswa.2021.114740}
}
Brzezinski, D., Porebski, P.J., Kowiel, M., Macnar, J. and Minor, W. Recognizing and validating ligands with CheckMyBlob 2021 Nucleic Acids Research  article Open Access
Abstract: Structure-guided drug design depends on the correct identification of ligands in crystal structures of protein complexes. However, the interpretation of the electron density maps is challenging and often burdened with confirmation bias. Ligand identification can be aided by automatic methods such as CheckMyBlob, a machine learning algorithm that learns to generalize ligand descriptions from sets of moieties deposited in the Protein Data Bank. Here, we present the CheckMyBlob web server, a platform that can identify ligands in unmodeled fragments of electron density maps or validate ligands in existing models. The server processes PDB/mmCIF and MTZ files and returns a ranking of 10 most likely ligands for each detected electron density blob along with interactive 3D visualizations. Additionally, for each prediction/validation, a plugin script is generated that enables users to conduct a detailed analysis of the server results in Coot. The CheckMyBlob web server is available at https://checkmyblob.bioreproducibility.org.
BibTeX:
@article{CheckMyBlob_Web_Server,
  author = {Brzezinski, Dariusz and Porebski, Przemyslaw J and Kowiel, Marcin and Macnar, Joanna M and Minor, Wladek},
  title = {Recognizing and validating ligands with CheckMyBlob},
  journal = {Nucleic Acids Research},
  year = {2021},
  note = {gkab296},
  doi = {https://doi.org/10.1093/nar/gkab296}
}
Brzezinski, D., Kowiel, M., Cooper, D.R., Cymborowski, M., Grabowski, M., Wlodawer, A., Dauter, Z., Shabalin, I.G., Gilski, M., Rupp, B., Jaskolski, M. and Minor, W. Covid-19.bioreproducibility.org: A web resource for SARS-CoV -2-related structural models 2021 Protein Science
Vol. 30(1) 
article Open Access  
Abstract: The COVID-19 pandemic has triggered numerous scientific activities aimed at understanding the SARS-CoV-2 virus and ultimately developing treatments. Structural biologists have already determined hundreds of experimental X-ray, cryo-EM, and NMR structures of proteins and nucleic acids related to this coronavirus, and this number is still growing. To help biomedical researchers, who may not necessarily be experts in structural biology, navigate through the flood of structural models, we have created an online resource, covid19.bioreproducibility.org, that aggregates expert-verified information about SARS-CoV-2-related macromolecular models. In this article, we describe this web resource along with the suite of tools and methodologies used for assessing the structures presented therein.
BibTeX:
@article{covid_server,
  author = {Dariusz Brzezinski and Marcin Kowiel and David R. Cooper and Marcin Cymborowski and Marek Grabowski and Alexander Wlodawer and Zbigniew Dauter and Ivan G. Shabalin and Miroslaw Gilski and Bernhard Rupp and Mariusz Jaskolski and Wladek Minor},
  title = {Covid-19.bioreproducibility.org: A web resource for SARS-CoV -2-related structural models},
  journal = {Protein Science},
  year = {2021},
  volume = {30},
  number = {1},
  doi = {https://doi.org/10.1002/pro.3959}
}
Brzezinski, D., Minku, L.L., Pewinski, T., Stefanowski, J. and Szumaczuk, A. The impact of data difficulty factors on classification of imbalanced and concept drifting data streams 2021 Knowl. Inf. Syst.
Vol. 63(6), pp. 1429-1469 
article Open Access 
Abstract: Class imbalance introduces additional challenges when learning classifiers from concept drifting data streams. Most existing work focuses on designing new algorithms for dealing with the global imbalance ratio and does not consider other data complexities. Independent research on static imbalanced data has highlighted the influential role of local data difficulty factors such as minority class decomposition and presence of unsafe types of examples. Despite often being present in real-world data, the interactions between concept drifts and local data difficulty factors have not been investigated in concept drifting data streams yet. We thoroughly study the impact of such interactions on drifting imbalanced streams. For this purpose, we put forward a new categorization of concept drifts for class imbalanced problems. Through comprehensive experiments with synthetic and real data streams, we study the influence of concept drifts, global class imbalance, local data difficulty factors, and their combinations, on predictions of representative online classifiers. Experimental results reveal the high influence of new considered factors and their local drifts, as well as differences in existing classifiers’ reactions to such factors. Combinations of multiple factors are the most challenging for classifiers. Although existing classifiers are partially capable of coping with global class imbalance, new approaches are needed to address challenges posed by imbalanced data streams.
BibTeX:
@article{Imbalanced_Stream_Difficulty_Factors,
  author = {Dariusz Brzezinski and Leandro L. Minku and Tomasz Pewinski and Jerzy Stefanowski and Artur Szumaczuk},
  title = {The impact of data difficulty factors on classification of imbalanced and concept drifting data streams},
  journal = {Knowl. Inf. Syst.},
  year = {2021},
  volume = {63},
  number = {6},
  pages = {1429--1469},
  doi = {https://doi.org/10.1007/s10115-021-01560-w}
}
Grabowski, M., Macnar, J.M., Cymborowski, M., Cooper, D.R., Shabalin, I.G., Gilski, M., Brzezinski, D., Kowiel, M., Dauter, Z., Rupp, B., Wlodawer, A., Jaskolski, M. and Minor, W. Rapid response to emerging biomedical challenges and threats 2021 IUCrJ
Vol. 8(3), pp. 395-407 
article Open Access 
Abstract: As part of the global mobilization to combat the present pandemic, almost 100000 COVID-19-related papers have been published and nearly a thousand models of macromolecules encoded by SARS-CoV-2 have been deposited in the Protein Data Bank within less than a year. The avalanche of new structural data has given rise to multiple resources dedicated to assessing the correctness and quality of structural data and models. Here, an approach to evaluate the massive amounts of such data using the resource https://covid19.bioreproducibility.org is described, which offers a template that could be used in large-scale initiatives undertaken in response to future biomedical crises. Broader use of the described methodology could considerably curtail information noise and significantly improve the reproducibility of biomedical research.
BibTeX:
@article{Rapid_response,
  author = {Grabowski, Marek and Macnar, Joanna M. and Cymborowski, Marcin and Cooper, David R. and Shabalin, Ivan G. and Gilski, Miroslaw and Brzezinski, Dariusz and Kowiel, Marcin and Dauter, Zbigniew and Rupp, Bernhard and Wlodawer, Alexander and Jaskolski, Mariusz and Minor, Wladek},
  title = {Rapid response to emerging biomedical challenges and threats},
  journal = {IUCrJ},
  year = {2021},
  volume = {8},
  number = {3},
  pages = {395--407},
  doi = {https://doi.org/10.1107/S2052252521003018}
}
Jaskolski, M., Dauter, Z., Shabalin, I.G., Gilski, M., Brzezinski, D., Kowiel, M., Rupp, B. and Wlodawer, A. Crystallographic models of SARS-CoV-2 3CLsp pro: in-depth assessment of structure quality and validation 2021 IUCrJ
Vol. 8(2), pp. 238-256 
article Open Access 
Abstract: The appearance at the end of 2019 of the new SARS-CoV-2 coronavirus led to an unprecedented response by the structural biology community, resulting in the rapid determination of many hundreds of structures of proteins encoded by the virus. As part of an effort to analyze and, if necessary, remediate these structures as deposited in the Protein Data Bank (PDB), this work presents a detailed analysis of 81 crystal structures of the main protease 3CLsp pro, an important target for the design of drugs against COVID-19. The structures of the unliganded enzyme and its complexes with a number of inhibitors were determined by multiple research groups using different experimental approaches and conditions; the resulting structures span 13 different polymorphs representing seven space groups. The structures of the enzyme itself, all determined by molecular replacement, are highly similar, with the exception of one polymorph with a different inter-domain orientation. However, a number of complexes with bound inhibitors were found to pose significant problems. Some of these could be traced to faulty definitions of geometrical restraints for ligands and to the general problem of a lack of such information in the PDB depositions. Several problems with ligand definition in the PDB itself were also noted. In several cases extensive corrections to the models were necessary to adhere to the evidence of the electron-density maps. Taken together, this analysis of a large number of structures of a single, medically important protein, all determined within less than a year using modern experimental tools, should be useful in future studies of other systems of high interest to the biomedical community.
BibTeX:
@article{SARS-CoV-2_3CLPro_review,
  author = {Jaskolski, Mariusz and Dauter, Zbigniew and Shabalin, Ivan G. and Gilski, Miroslaw and Brzezinski, Dariusz and Kowiel, Marcin and Rupp, Bernhard and Wlodawer, Alexander},
  title = {Crystallographic models of SARS-CoV-2 3CLsp pro: in-depth assessment of structure quality and validation},
  journal = {IUCrJ},
  year = {2021},
  volume = {8},
  number = {2},
  pages = {238--256},
  doi = {https://doi.org/10.1107/S2052252521001159}
}
Grabowski, M., Cooper, D.R., Brzezinski, D., Macnar, J.M., Shabalin, I.G., Cymborowski, M., Otwinowski, Z. and Minor, W. Synchrotron radiation as a tool for macromolecular X-Ray Crystallography: A XXI century perspective 2021 Nuclear Instruments and Methods in Physics Research Section B: Beam Interactions with Materials and Atoms
Vol. 489, pp. 30-40 
article Link 
Abstract: Intense X-rays available at powerful synchrotron beamlines provide macromolecular crystallographers with an incomparable tool for investigating biological phenomena on an atomic scale. The resulting insights into the mechanism’s underlying biological processes have played an essential role and shaped biomedical sciences during the last 30 years, considered the “golden age” of structural biology. In this review, we analyze selected aspects of the impact of synchrotron radiation on structural biology. Synchrotron beamlines have been used to determine over 70% of all macromolecular structures deposited into the Protein Data Bank (PDB). These structures were deposited by over 13,000 different research groups. Interestingly, despite the impressive advances in synchrotron technologies, the median resolution of macromolecular structures determined using synchrotrons has remained constant throughout the last 30 years, at about 2 Å. Similarly, the median times from the data collection to the deposition and release have not changed significantly. We describe challenges to reproducibility related to recording all relevant data and metadata during the synchrotron experiments, including diffraction images. Finally, we discuss some of the recent opinions suggesting a diminishing importance of X-ray crystallography due to impressive advances in Cryo-EM and theoretical modeling. We believe that synchrotrons of the future will increasingly evolve towards a life science center model, where X-ray crystallography, Cryo-EM, other experimental and computational resources, and knowledge are encompassed within a versatile research facility. The recent response of crystallographers to the COVID-19 pandemic suggests that X-ray crystallography conducted at synchrotron beamlines will continue to play an essential role in structural biology and drug discovery for years to come.
BibTeX:
@article{SynchrotronRadiation,
  author = {Marek Grabowski and David R. Cooper and Dariusz Brzezinski and Joanna M. Macnar and Ivan G. Shabalin and Marcin Cymborowski and Zbyszek Otwinowski and Wladek Minor},
  title = {Synchrotron radiation as a tool for macromolecular X-Ray Crystallography: A XXI century perspective},
  journal = {Nuclear Instruments and Methods in Physics Research Section B: Beam Interactions with Materials and Atoms},
  year = {2021},
  volume = {489},
  pages = {30-40},
  doi = {https://doi.org/10.1016/j.nimb.2020.12.016}
}
Shabalin, I.G., Czub, M.P., Majorek, K.A., Brzezinski, D., Grabowski, M., Cooper, D.R., Panasiuk, M., Chruszcz, M. and Minor, W. Molecular determinants of vascular transport of dexamethasone in COVID-19 therapy 2020 IUCrJ
Vol. 7(6) 
article Open Access  
Abstract: Dexamethasone, a widely used corticosteroid, has recently been reported as the first drug to increase the survival chances of patients with severe COVID-19. Therapeutic agents, including dexamethasone, are mostly transported through the body by binding to serum albumin. Here, the first structure of serum albumin in complex with dexamethasone is reported. Dexamethasone binds to drug site 7, which is also the binding site for commonly used nonsteroidal anti-inflammatory drugs and testosterone, suggesting potentially problematic binding competition. This study bridges structural findings with an analysis of publicly available clinical data from Wuhan and suggests that an adjustment of the dexamethasone regimen should be further investigated as a strategy for patients affected by two major COVID-19 risk factors: low albumin levels and diabetes.
BibTeX:
@article{dexamethasone_albumin_covid,
  author = {Ivan G. Shabalin and Mateusz P. Czub and Karolina A. Majorek and Dariusz Brzezinski and Marek Grabowski and David R. Cooper and Mateusz Panasiuk and Maksymilian Chruszcz and Wladek Minor},
  title = {Molecular determinants of vascular transport of dexamethasone in COVID-19 therapy},
  journal = {IUCrJ},
  year = {2020},
  volume = {7},
  number = {6},
  doi = {https://doi.org/10.1107/s2052252520012944}
}
Wlodawer, A., Dauter, Z., Shabalin, I., Gilski, M., Brzezinski, D., Kowiel, M., Minor, W., Rupp, B. and Jaskolski, M. Ligand-centered assessment of SARS-CoV-2 drug target models in the Protein Data Bank 2020 The FEBS Journal article Open Access
Abstract: A bright spot in the SARS-CoV-2 (CoV-2) coronavirus pandemic has been the immediate mobilization of the biomedical community, working to develop treatments and vaccines for COVID-19. Rational drug design against emerging threats depends on well-established methodology, mainly utilizing X-ray crystallography, to provide accurate structure models of the macromolecular drug targets and of their complexes with candidates for drug development. In the current crisis the structural biological community has responded by presenting structure models of CoV-2 proteins and depositing them in the Protein Data Bank (PDB), usually without time embargo and before publication. Since the structures from the first-line research are produced in an accelerated mode, there is an elevated chance of mistakes and errors, with the ultimate risk of hindering, rather than speeding-up, drug development. In the present work, we have used model-validation metrics and examined the electron density maps for the deposited models of CoV-2 proteins and a sample of related proteins available in the PDB as of 1 April 2020. We present these results with the aim of helping the biomedical community establish a better-validated pool of data. The proteins are divided into groups according to their structure and function. In most cases, no major corrections were necessary. However, in several cases significant revisions in the functionally sensitive area of protein-inhibitor complexes or for bound ions justified correction, re-refinement, and eventually re-versioning in the PDB. The re-refined coordinate files and a tool for facilitating model comparisons are available at https://covid-19.bioreproducibility.org.
BibTeX:
@article{Sars_cov_2_ligand_assessment,
  author = {Wlodawer, Alexander and Dauter, Zbigniew and Shabalin, Ivan and Gilski, Miroslaw and Brzezinski, Dariusz and Kowiel, Marcin and Minor, Wladek and Rupp, Bernhard and Jaskolski, Mariusz},
  title = {Ligand-centered assessment of SARS-CoV-2 drug target models in the Protein Data Bank},
  journal = {The FEBS Journal},
  year = {2020},
  url = {https://febs.onlinelibrary.wiley.com/doi/abs/10.1111/febs.15366},
  doi = {https://doi.org/10.1111/febs.15366}
}
Brzezinski, D., Dauter, Z., Minor, W. and Jaskolski, M. On the evolution of the quality of macromolecular models in the PDB 2020 The FEBS Journal article Open Access
Abstract: Crystallographic models of biological macromolecules have been ranked using the quality criteria associated with them in the Protein Data Bank (PDB). The outcomes of this quality analysis have been correlated with time and with the journals that published papers based on those models. The results show that the overall quality of PDB structures has substantially improved over the last ten years, but this period of progress was preceded by several years of stagnation or even depression. Moreover, the study shows that the historically observed negative correlation between journal impact and the quality of structural models presented therein seems to disappear as time progresses.
BibTeX:
@article{EvolutionOfPdbQuality,
  author = {Brzezinski, Dariusz and Dauter, Zbigniew and Minor, Wladek and Jaskolski, Mariusz},
  title = {On the evolution of the quality of macromolecular models in the PDB},
  journal = {The FEBS Journal},
  year = {2020},
  url = {https://febs.onlinelibrary.wiley.com/doi/abs/10.1111/febs.15314},
  doi = {https://doi.org/10.1111/febs.15314}
}
Kowiel, M., Brzezinski, D., Gilski, M. and Jaskolski, M. Conformation-dependent restraints for polynucleotides: the sugar moiety 2020 Nucleic Acids Research
Vol. 48(2), 962-973 
article Open Access
Abstract: Stereochemical restraints are commonly used to aid the refinement of macromolecular structures obtained by experimental methods at lower resolution. The standard restraint library for nucleic acids has not been updated for over two decades and needs revision. In this paper, geometrical restraints for nucleic acids sugars are derived using information from high-resolution crystal structures in the Cambridge Structural Database. In contrast to the existing restraints, this work shows that different parts of the sugar moiety form groups of covalent geometry dependent on various chemical and conformational factors, such as the type of ribose or the attached nucleobase, and ring puckering or rotamers of the glycosidic (χ) or side-chain (γ) torsion angles. Moreover, the geometry of the glycosidic link and the endocyclic ribose bond angles are functionally dependent on χ and sugar pucker amplitude (τm), respectively. The proposed restraints have been positively validated against data from the Nucleic Acid Database, compared with an ultrahigh-resolution Z-DNA structure in the Protein Data Bank, and tested by re-refining hundreds of crystal structures in the Protein Data Bank. The conformation-dependent sugar restraints presented in this work are publicly available in REFMAC, PHENIX and SHELXL format through a dedicated RestraintLib web server with an API function.
BibTeX:
@article{NarSugarMoiety,
  author = {Kowiel, Marcin and Brzezinski, Dariusz and Gilski, Miroslaw and Jaskolski, Mariusz},
  title = {Conformation-dependent restraints for polynucleotides: the sugar moiety},
  journal = {Nucleic Acids Research},
  year = {2020},
  volume = {48},
  number = {2},
  pages = {962--973},
  url = {https://doi.org/10.1093/nar/gkz1122},
  doi = {https://doi.org/10.1093/nar/gkz1122}
}
Brzezinski, D., Stefanowski, J., Susmaga, R. and Szczech, I. On the Dynamics of Classification Measures for Imbalanced and Streaming Data 2019 IEEE Transactions on Neural Networks and Learning Systems  article Accepted Manuscript
Abstract: As each imbalanced classification problem comes with its own set of challenges, the measure used to evaluate classifiers must be individually selected. To help researchers make this decision in an informed manner, experimental and theoretical investigations compare general properties of measures. However, existing studies do not analyze changes in measure behavior imposed by different imbalance ratios. Moreover, several characteristics of imbalanced data streams, such as the effect of dynamically changing class proportions, have not been thoroughly investigated from the perspective of different metrics. In this paper, we study measure dynamics by analyzing changes of measure values, distributions, and gradients with diverging class proportions. For this purpose, we visualize measure probability mass functions and gradients. Additionally, we put forward a histogram-based normalization method that provides a unified, probabilistic interpretation of any measure over datasets with different class distributions. The results of analyzing eight popular classification measures show that the effect class proportions have on each measure is different, and should be taken into account when evaluating classifiers. Apart from highlighting imbalance-related properties of each measure, our study shows a direct connection between class ratio changes and certain types of concept drift, which could be influential in designing new types of classifiers and drift detectors for imbalanced data streams.
BibTeX:
@article{classification_measure_dynamics,
  author = {Dariusz Brzezinski and Jerzy Stefanowski and Robert Susmaga and Izabela Szczech},
  title = {On the Dynamics of Classification Measures for Imbalanced and Streaming Data},
  journal = {IEEE Transactions on Neural Networks and Learning Systems},
  year = {2019},
  doi = {https://doi.org/10.1109/TNNLS.2019.2899061}
}
Gilski, M., Zhao, J., Kowiel, M., Brzezinski, D., Turner, D.H. and Jaskolski, M. Accurate geometrical restraints for Watson–Crick base pairs 2019 Acta Crystallographica Section B: Structural Science, Crystal Engineering and Materials
Vol. 75(2), 235-245 
article Open Access  
Abstract: Geometrical restraints provide key structural information for the determination of biomolecular structures at lower resolution by experimental methods such as crystallography or cryo-electron microscopy. In this work, restraint targets for nucleic acids bases are derived from three different sources and compared: small-molecule crystal structures in the Cambridge Structural Database (CSD), ultrahigh-resolution structures in the Protein Data Bank (PDB) and quantummechanical (QM) calculations. The best parameters are those based on CSD structures. After over two decades, the standard library of Parkinson et al. [1996), Acta Cryst. D52, 57–64] is still valid, but improvements are possible with the use of the current CSD database. The CSD-derived geometry is fully compatible with Watson–Crick base pairs, as comparisons with QM results for isolated and paired bases clearly show that the CSD targets closely correspond to proper base pairing. While the QM results are capable of distinguishing between single and paired bases, their level of accuracy is, on average, three fourths that for the CSD-derived targets when gauged by root-mean-square deviations from ultrahigh-resolution structures in the PDB. Nevertheless, the accuracy of QM results appears sufficient to provide stereochemical targets for synthetic base pairs where no reliable experimental structural information is available. To enable future tests for this approach, QM calculations are provided for isocytosine, isoguanine and the iCiG base pair.
BibTeX:
@article{watson–crick_base_pairs_restraints,
  author = {Miroslaw Gilski and Jianbo Zhao and Marcin Kowiel and Dariusz Brzezinski and Douglas H. Turner and Mariusz Jaskolski},
  title = {Accurate geometrical restraints for Watson–Crick base pairs},
  journal = {Acta Crystallographica Section B: Structural Science, Crystal Engineering and Materials},
  year = {2019},
  doi = {https://doi.org/10.1107/S2052520619002002}
}
Kowiel, M., Brzezinski, D., Porebski, P., Shabalin, I., Jaskolski, M. and Minor, W. Automatic Recognition of Ligands in Electron Density by Machine Learning 2019 Bioinformatics
Vol. 35(3), 452–-461 
article Link
Abstract:

Motivation
The correct identification of ligands in crystal structures of protein complexes is the cornerstone of structure-guided drug design. However, cognitive bias can sometimes mislead investigators into modeling fictitious compounds without solid support from the electron density maps. Ligand identification can be aided by automatic methods, but existing approaches are based on time-consuming iterative fitting.

Results
Here we report a new machine learning algorithm called CheckMyBlob that identifies ligands from experimental electron density maps. In benchmark tests on portfolios of up to 219,931 ligand binding sites containing the 200 most popular ligands found in the Protein Data Bank, CheckMyBlob markedly outperforms the existing automatic methods for ligand identification, in some cases doubling the recognition rates, while requiring significantly less time. Our work shows that machine learning can improve the automation of structure modeling and significantly accelerate the drug screening process of macromolecule-ligand complexes.

Availability
Code and data are available on GitHub at https://github.com/dabrze/CheckMyBlob.

BibTeX:
@article{CheckMyBlob,
  author = {Marcin Kowiel and Dariusz Brzezinski and Przemyslaw Porebski and Ivan Shabalin and Mariusz Jaskolski and Wladek Minor},
  title = {Automatic Recognition of Ligands in Electron Density by Machine Learning},
  journal = {Bioinformatics},
  year = {2019},
  volume = {35},
  number = {3},
  pages = {452–-461},
  url = {http://dx.doi.org/10.1093/bioinformatics/bty626},
  doi = {https://doi.org/10.1093/bioinformatics/bty626}
}
Lango, M., Brzezinski, D. and Stefanowski, J. ImWeights: Classifying Imbalance Data Using Local and Neighborhood Information 2018 2nd International Workshop on Learning with Imbalanced Domains: Theory and Applications, LIDTA 2018, Dublin, Ireland, September 10--14, 2018, Proceedings  inproceedings Pdf
Abstract: Preprocessing methods for imbalanced data transform the training data to a form more suitable for learning classifiers. Most of these methods either focus on local relationships between single training examples or analyze the global characteristics of the data, such as the class imbalance ratio in the data set. However, they do not sufficiently exploit the combination of both these views. In this paper, we put forward a new data preprocessing method called ImWeights, which weights training examples according to their local difficulty (safety) and the vicinity of larger minority clusters (gravity). Experiments with real-world data sets show that ImWeights outperforms local and global preprocessing methods, while being the least memory intensive. The introduced notion of minority cluster gravity opens new lines of research for specialized preprocessing methods and classifier modifications for imbalanced data.
BibTeX:
@inproceedings{ImWeights,
  author = {Lango, Mateusz and Brzezinski, Dariusz and and Stefanowski, Jerzy},
  title = {ImWeights: Classifying Imbalance Data Using Local and Neighborhood Information},
  booktitle = {2nd International Workshop on Learning with Imbalanced Domains: Theory and Applications, LIDTA 2018, Dublin, Ireland, September 10--14, 2018, Proceedings},
  year = {2018}
}
Brzezinski, D., Stefanowski, J., Susmaga, R. and Szczȩch, I. Visual-based analysis of classification measures and their properties for class imbalanced problems 2018 Information Sciences
Vol. 462, 242-261
article Accepted Manuscript
Abstract: Abstract With a plethora of available classification performance measures, choosing the right metric for the right task requires careful thought. To make this decision in an informed manner, one should study and compare general properties of candidate measures. However, analysing measures with respect to complete ranges of their domain values is a difficult and challenging task. In this study, we attempt to support such analyses with a specialized visualisation technique, which operates in a barycentric coordinate system using a 3D tetrahedron. Additionally, we adapt this technique to the context of imbalanced data and put forward a set of measure properties, which should be taken into account when examining a classification performance measure. As a result, we compare 22 popular measures and show important differences in their behaviour. Moreover, for parametric measures such as the Fβ and IBAα(G-mean), we analytically derive parameter thresholds that pinpoint the changes in measure properties. Finally, we provide an online visualisation tool that can aid the analysis of measure variability throughout their entire domains.
BibTeX:
@article{INS_Tetrahedron,
  author = {Dariusz Brzezinski and Jerzy Stefanowski and Robert Susmaga and Izabela Szczȩch},
  title = {Visual-based analysis of classification measures and their properties for class imbalanced problems },
  journal = {Information Sciences },
  year = {2018},
  volume = {462},
  pages = {242--261},
  url = {https://www.sciencedirect.com/science/article/pii/S0020025518304602},
  doi = {https://doi.org/10.1016/j.ins.2018.06.020}
}
Lango, M., Brzezinski, D., Firlik, S. and Stefanowski, J. Discovering Minority Sub-clusters and Local Difficulty Factors from Imbalanced Data 2017 Discovery Science: 20th International Conference, DS 2017, Proceedings. LNCS Vol. 10558, pp. 324-339, Springer. inproceedings Pdf  
Abstract: Learning classifiers from imbalanced data is particularly challenging when class imbalance is accompanied by local data difficulty factors, such as outliers, rare cases, class overlapping, or minority class decomposition. Although these issues have been highlighted in previous research, there have been no proposals of algorithms that simultaneously detect all the aforementioned difficulties in a dataset. In this paper, we put forward two extensions to popular clustering algorithms, ImKmeans and ImScan, and one novel algorithm, ImGrid, that attempt to detect minority sub-clusters, outliers, rare cases, and class overlapping. Experiments with artificial datasets show that ImGrid, which uses a Bayesian test to join similar neighboring regions, is able to re-discover simulated clusters and types of minority examples on par with competing methods, while being the least sensitive to parameter tuning.
BibTeX:
@inproceedings{DiscoveringMinoritySubclusters,
  author = {Lango, Mateusz and Brzezinski, Dariusz and Firlik, Sebastian and Stefanowski, Jerzy},
  title = {Discovering Minority Sub-clusters and Local Difficulty Factors from Imbalanced Data},
  booktitle = {Discovery Science: 20th International Conference, DS 2017, Proceedings},
  publisher = {Springer},
  year = {2017},
  volume = {10558},
  pages = {324--339}
}
Brzezinski, D., Stefanowski, J., Susmaga, R. and Szczech, I. Tetrahedron: Barycentric Measure Visualizer 2017 Machine Learning and Knowledge Discovery in Databases, Proceedings of ECML PKDD 2017, Part III. LNCS Vol. 10536, pp. 419-422, Springer. inproceedings Pdf
Demo
Abstract: Each machine learning task comes equipped with its own set of performance measures. For example, there is a plethora of classification measures that assess predictive performance, a myriad of clustering indices, and equally many rule interestingness measures. Choosing the right measure requires careful thought, as it can influence model selection and thus the performance of the final machine learning system. However, analyzing and understanding measure properties is a difficult task. Here, we present Tetrahedron, a web-based visualization tool that aids the analysis of complete ranges of performance measures based on a two-by-two contingency matrix. The tool operates in a barycentric coordinate system using a 3D tetrahedron, which can be rotated, zoomed, cut, parameterized, and animated. The application is capable of visualizing predefined measures (86 currently), as well as helping prototype new measures by visualizing user-defined formulas.
BibTeX:
@inproceedings{TetrahedronDemo,
  author = {Dariusz Brzezinski and Jerzy Stefanowski and Robert Susmaga and Izabela Szczech},
  title = {Tetrahedron: Barycentric Measure Visualizer},
  booktitle = {Machine Learning and Knowledge Discovery in Databases, Proceedings of ECML PKDD 2017, Part III},
  series = {Lecture Notes In Computer Science},
  volume = {10536},
  publisher = {Springer},
  year = {2017}
}
Piernik, M., Brzezinski, D., Morzy, T. and Morzy, M. Using Network Analysis to Improve Nearest Neighbor Classification of Non-network Data 2017 Foundations of Intelligent Systems, LNCS Vol. 10352, pp. 105-115. Proceedings of the 23rd International Symposium Intelligent Information Systems. inproceedings Pdf  
Abstract: The nearest neighbor classifier is a powerful, straightforward, and very popular approach to solving many classification problems. It also enables users to easily incorporate weights of training instances into its model, allowing users to highlight more promising examples. Instance weighting schemes proposed to date were based either on attribute values or external knowledge. In this paper, we propose a new way of weighting instances based on network analysis and centrality measures. Our method relies on transforming the training dataset into a weighted signed network and evaluating the importance of each node using a selected centrality measure. This information is then transferred back to the training dataset in the form of instance weights, which are later used during nearest neighbor classification. We consider four centrality measures appropriate for our problem and empirically evaluate our proposal on 30 popular, publicly available datasets. The results show that the proposed instance weighting enhances the predictive performance of the nearest neighbor algorithm.
BibTeX:
@inproceedings{NetworkNN,
  author = {Maciej Piernik and Dariusz Brzezinski and Tadeusz Morzy and Mikolaj Morzy},
  title = {Using Network Analysis to Improve Nearest Neighbor Classification of Non-network Data},
  booktitle = {Foundations of Intelligent Systems - Proceedings of the 23rd International Symposium, {ISMIS} 2017},
  publisher = {Springer},
  year = {2017},
  series = {Lecture Notes in Computer Science},
  volume = {10352},
  pages = {105--115},
  doi = {http://dx.doi.org/10.1007/978-3-319-60438-1_11}
}
Brzezinski, D., Grudziński, Z. and Szczęch, I. Bayesian Confirmation Measures in Rule-based Classification 2017 New Frontiers in Mining Complex Patterns, LNCS Vol. 10312, pp. 39-53. Post-Proceedings of the 5th International Workshop on New Frontiers in Mining Complex Patterns inproceedings Pdf  
Abstract: With the rapid growth of available data, learning models are also gaining in sizes. As a result, end-users are often faced with classification results that are hard to understand. This problem also involves rule-based classifiers, which usually concentrate on predictive accuracy and produce too many rules for a human expert to interpret. In this paper, we tackle the problem of pruning rule classifiers while retaining their descriptive properties. For this purpose, we analyze the use of confirmation measures as representatives of interestingness measures designed to select rules with desirable descriptive properties. To perform the analysis, we put forward the CM-CAR algorithm, which uses interestingness measures during rule pruning. Experiments involving 20 datasets show that out of 12 analyzed confirmation measures c1, F, and Z are best for rule pruning and sorting. The obtained results can be used to devise new classifiers that optimize confirmation measures during model training.
BibTeX:
@InProceedings{ConfirmationInClassification,
  Title                    = {Bayesian Confirmation Measures in Rule-based Classification},
  Author                   = {Dariusz Brzezinski and Zbigniew Grudziński and Izabela Szczęch},
  Booktitle                = {New Frontiers in Mining Complex Patterns},
  Year                     = {2017},
  Editor                   = {Appice, Annalisa and Ceci, Michelangelo and Loglisci, Corrado and Masciari, Elio and Ras, Zbigniew W.},
  Pages                    = {39--53},
  Publisher                = {Springer},
  Series                   = {Lecture Notes in Computer Science},
  Volume                   = {10312}
}
Brzezinski, D. and Stefanowski, J. Stream Classification 2017 Encyclopedia of Machine Learning and Data Mining inbook Pdf  
Abstract: Stream classification is a variant of incremental learning of classifiers that has to satisfy requirements specific for massive streams of data: restrictive processing time, limited memory, and one scan of incoming examples. Additionally, stream classifiers often have to be adaptive, as they usually act in dynamic, non-stationary environments where data and target concepts can change over time. To fulfill these requirements new solutions include dedicated data management and forgetting mechanisms, concept drift detectors that monitor the underlying changes in the stream, effective online single classifiers, and adaptive ensembles that continuously react to changes in the streams.
BibTeX:
@InBook{Stream_Classification_ML_Encyclopedia,
  Title                    = {Stream Classification},
  Author                   = {Stefanowski, Jerzy and Brzezinski, Dariusz},
  Editor                   = {Sammut, Claude and Webb, Geoffrey I.},
  Pages                    = {1191--1199},
  Publisher                = {Springer US},
  Year                     = {2017},
  Address                  = {Boston, MA},
  Abstract                 = {Stream classification is a variant of incremental learning of classifiers that has to satisfy requirements specific for massive streams of data: restrictive processing time, limited memory, and one scan of incoming examples. Additionally, stream classifiers often have to be adaptive, as they usually act in dynamic, non-stationary environments where data and target concepts can change over time. To fulfill these requirements new solutions include dedicated data management and forgetting mechanisms, concept drift detectors that monitor the underlying changes in the stream, effective online single classifiers, and adaptive ensembles that continuously react to changes in the streams.},
  Booktitle                = {Encyclopedia of Machine Learning and Data Mining},
  Doi                      = {10.1007/978-1-4899-7687-1_908},
  ISBN                     = {978-1-4899-7687-1},
  Url                      = {http://dx.doi.org/10.1007/978-1-4899-7687-1_908}
}
Brzezinski, D. and Stefanowski, J. Prequential AUC: Properties of the Area Under the ROC Curve for Data Streams with Concept Drift 2017 Knowledge and Information Systems
Vol. 52(2), 531-562
article Open Access
Abstract: Modern data-driven systems often require classifiers capable of dealing with streaming imbalanced data and concept changes. The assessment of learning algorithms in such scenarios is still a challenge, as existing online evaluation measures focus on efficiency, but are susceptible to class ratio changes over time. In case of static data, the area under the Receiver Operating Characteristics curve, or simply AUC, is a popular measure for evaluating classifiers both on balanced and imbalanced class distributions. However, the characteristics of AUC calculated on time-changing data streams have not been studied. This paper analyzes the properties of our recent proposal, an incremental algorithm that uses a sorted tree structure with a sliding window to compute AUC with forgetting. The resulting evaluation measure, called prequential AUC, is studied in terms of: visualization over time, processing speed, differences compared to AUC calculated on blocks of examples, and consistency with AUC calculated traditionally. Simulation results show that the proposed measure is statistically consistent with AUC computed traditionally on streams without drift and comparably fast to existing evaluation procedures. Finally, experiments on real-world and synthetic data showcase characteristic properties of prequential AUC compared to classification accuracy, G-mean, Kappa, Kappa M, and recall when used to evaluate classifiers on imbalanced streams with various difficulty factors.
BibTeX:
@article{,
  author = {Dariusz Brzezinski and Jerzy Stefanowski},
  title = {Prequential AUC: Properties of the Area Under the ROC Curve for Data Streams with Concept Drift},
  journal = {Knowledge and Information Systems},
  doi = {10.1007/s10115-017-1022-8},
  year = {2017},
  number = {2},
  pages = {531--562},
  volume = {52}
}
Kowiel, M., Brzezinski, D. and Jaskolski, M. Conformation-dependent restraints for polynucleotides: I. Clustering of the geometry of the phosphodiester group 2016 Nucleic Acids Research
Vol. 44(17), 8479-8489
article Open Access  
Abstract: The refinement of macromolecular structures is usually aided by prior stereochemical knowledge in the form of geometrical restraints. Such restraints are also used for the flexible sugar-phosphate backbones of nucleic acids. However, recent highly accurate structural studies of DNA suggest that the phosphate bond angles may have inadequate description in the existing stereochemical dictionaries. In this paper, we analyze the bonding deformations of the phosphodiester groups in the Cambridge Structural Database, cluster the studied fragments into six conformation-related categories, and propose a revised set of restraints for the O-P-O bond angles and distances. The proposed restraints have been positively validated against data from the Nucleic Acid Database and an ultrahigh-resolution Z-DNA structure in the Protein Data Bank. Additionally, the manual classification of PO4 geometry is compared with geometrical clusters automatically discovered by machine learning methods. The machine learning cluster analysis provides useful insights and a practical example for general applications of clustering algorithms for automatic discovery of hidden patterns of molecular geometry. Finally, we describe the implementation and application of a public-domain web server for automatic generation of the proposed restraints.
BibTeX:
@article{NAR_PO4,
  author = {Marcin Kowiel and Dariusz Brzezinski and Mariusz Jaskolski},
  title = {Conformation-dependent restraints for polynucleotides: I. Clustering of the geometry of the phosphodiester group},
  journal = {Nucleic Acids Research},
  year = {2016},
  number = {17},
  pages = {8479--8489},
  volume = {44}
}
Brzezinski, D. and Stefanowski, J. Ensemble Diversity in Evolving Data Streams 2016 Discovery Science: 19th International Conference, DS 2016, Bari, Italy, October 19-21, 2016, Proceedings. LNCS Vol. 9956, pp. 229-244, Springer. inproceedings Pdf  
Abstract: While diversity of ensembles has been studied in the context of static data, it has not still received such research interest for evolving data streams. This paper aims at analyzing the impact of concept drift on diversity measures calculated for streaming ensembles. We consider six popular diversity measures and adapt their calculations to data stream requirements. A comprehensive series of experiments reveals the potential of each measure for visualizing ensemble performance over time. Measures highlighted as capable of depicting sudden and virtual drifts over time are used as basis for detecting changes with the Page-Hinkley test. Experimental results demonstrate that the kappa interrater agreement, disagreement, and double fault measures, although designed to quantify diversity, provide a means of detecting changes competitive to that using classification accuracy.
BibTeX:
@inproceedings{EnsembleDiversity_DS2016,
  Title = {Ensemble Diversity in Evolving Data Streams},
  Author = {Dariusz Brzezinski and Jerzy Stefanowski},
  Booktitle = {Discovery Science: 19th International Conference, DS 2016. Proceedings},
  Year = {2016},
  Pages = {229--244},
  Publisher = {Springer},
  Series = {Lecture Notes in Computer Science},
  Volume = {9956}
}
Lango, M., Brzezinski, D. and Stefanowski, J. PUT at SemEval-2016 Task 4: The ABC of Twitter Sentiment Analysis 2016 Proceedings of the 10th International Workshop on Semantic Evaluation inproceedings Pdf  
Abstract: This paper describes a classification system that participated in SemEval-2016 Task 4: Sentiment Analysis in Twitter. The proposed approach competed in subtasks A, B, and C, which involved tweet polarity classification, tweet classification according to a two-point scale, and tweet classification according to a five-point scale. Our system is based on an ensemble consisting of Random Forests, SVMs, and Gradient Boosting Trees, and involves the use of a wide range of features including: ngrams, Brown clustering, sentiment lexicons, Wordnet, and part-of-speech tagging. The proposed system achieved 14th, 6th, and 3rd place in subtasks A, B, and C, respectively.
BibTeX:
@inproceedings{semeval2016_PUT,
  author = {Mateusz Lango and Dariusz Brzezinski and Jerzy Stefanowski},
  title = {PUT at SemEval-2016 Task 4: The ABC of Twitter Sentiment Analysis},
  booktitle = {Proceedings of the 10th International Workshop on Semantic Evaluation},
  year = {2016},
  pages = {126--132}
}
Piernik, M., Brzezinski, D. and Morzy, T. Clustering XML documents by patterns 2016 Knowledge and Information Systems
Vol. 46(1), 185-212
article Open Access
Abstract: Now that the use of XML is prevalent, methods for mining semi-structured documents have become even more important. In particular, one of the areas that could greatly benefit from in-depth analysis of XML’s semi-structured nature is cluster analysis. Most of the XML clustering approaches developed so far employ pairwise similarity measures. In this paper, we study clustering algorithms, which use patterns to cluster documents without the need for pairwise comparisons. We investigate the shortcomings of existing approaches and establish a new pattern-based clustering framework called XPattern, which tries to address these shortcomings. The proposed framework consists of four steps: choosing a pattern definition, pattern mining, pattern clustering, and document assignment. The framework’s distinguishing feature is the combination of pattern clustering and document-cluster assignment, which allows to group documents according to their characteristic features rather than their direct similarity. We experimentally evaluate the proposed approach by implementing an algorithm called PathXP, which mines maximal frequent paths and groups them into profiles. PathXP was found to match, in terms of accuracy, other XML clustering approaches, while requiring less parametrization and providing easily interpretable cluster representatives. Additionally, the results of an in-depth experimental study lead to general suggestions concerning pattern-based XML clustering.
BibTeX:
@article{XPattern,
  author = {Piernik, Maciej and Brzezinski, Dariusz and Morzy, Tadeusz},
  title = {Clustering XML documents by patterns},
  journal = {Knowledge and Information Systems},
  publisher = {Springer London},
  year = {2016},
  number = {1},
  pages = {185--212},
  volume = {46},
  url = {http://dx.doi.org/10.1007/s10115-015-0820-0},
  doi = {http://dx.doi.org/10.1007/s10115-015-0820-0}
}
Brzezinski, D. and Piernik, M. Structural XML Classification in Concept Drifting Data Streams 2015 New Generation Computing
Vol. 33(4), 345-366
article Accepted manuscript  
Abstract: Classification of large, static collections of XML data has been intensively studied in the last several years. Recently however, the data processing paradigm is shifting from static to streaming data, where documents have to be processed online using limited memory and class definitions can change with time in an event called concept drift. As most existing XML classifiers are capable of processing only static data, there is a need to develop new approaches dedicated for streaming environments. In this paper, we propose a new classification algorithm for XML data streams called XSC. The algorithm uses incrementally mined frequent subtrees and a tree-subtree similarity measure to classify new documents in an associative manner. The proposed approach is experimentally evaluated against eight state-of-the-art stream classifiers on real and synthetic data. The results show that XSC performs significantly better than competitive algorithms in terms of accuracy and memory usage.
BibTeX:
@article{XmlStreamNGC,
  author = {Dariusz Brzezinski and Maciej Piernik},
  title = {Structural XML Classification in Concept Drifting Data Streams},
  journal = {New Generation Computing},
  year = {2015},
  volume = {33},
  number = {4},
  pages = {345--366}
}
Brzezinski, D. Block-based and Online Ensembles for Concept-drifting Data Streams 2015 School: Poznan University of Technology  phdthesis Pdf  
Abstract: This thesis investigates classification methods for concept-drifting data streams. We propose the Accuracy Updated Ensemble (AUE) algorithm, which reacts equally well to sudden, gradual, and recurring drifts in block-based data streams. Furthermore, the thesis analyzes three generic strategies for transforming block-based ensemble classifiers into online learners: a) using a sliding window, b) adding a single incremental classifier, c) using a drift detector. Based on this analysis, we propose an online ensemble algorithm called Online Accuracy Updated Ensemble (OAUE), which reacts to several types of changes online. Finally, the thesis reviews existing evaluation methods used to assess stream classifiers and proposes a new measure, called Prequential AUC, which is capable of evaluating classifiers online, on class-imbalanced data streams. All of the proposed algorithms are experimentally compared against competitive classification and evaluation methods.
BibTeX:
@phdthesis{BrzezPhd2015,
  author = {Dariusz Brzezinski},
  title = {Block-based and Online Ensembles for Concept-drifting Data Streams},
  school = {Poznan University of Technology},
  year = {2015}
}
Brzezinski, D. and Stefanowski, J. Prequential AUC for Classifier Evaluation and Drift Detection in Evolving Data Streams 2015 New Frontiers in Mining Complex Patterns, LNCS Vol. 8983, pp. 87-101. Post-Proceedings of the 3rd International Workshop on New Frontiers in Mining Complex Patterns, Nancy, France inproceedings Pdf  
Abstract: Detecting and adapting to concept drift makes learning data stream classifiers a difficult task. It becomes even more complex when the distribution of classes in the stream becomes imbalanced. Currently, proper assessment of classifiers for such data is still a challenge, as existing evaluation measures either do not take into account class imbalance or are unable to indicate class ratio changes in time. In this paper, we advocate the use of the area under the ROC curve (AUC) in imbalanced data stream settings and propose an incremental algorithm that uses a sorted tree structure with a sliding window to compute AUC using constant time and memory. Additionally, we experimentally verify that this algorithm is capable of correctly evaluating classifiers on imbalanced streams and can be used as a basis for detecting sudden changes in class definitions and imbalance ratio.
BibTeX:
@inproceedings{nfmcp2014,
  author = {Brzezinski, Dariusz and Stefanowski, Jerzy},
  title = {Prequential {AUC} for Classifier Evaluation and Drift Detection in Evolving Data Streams},
  booktitle = {New Frontiers in Mining Complex Patterns},
  publisher = {Springer International Publishing},
  year = {2015},
  editor = {Appice, Annalisa and Ceci, Michelangelo and Loglisci, Corrado and Manco, Giuseppe and Masciari, Elio and Ras, Zbigniew W.},
  volume = {8983},
  series = {Lecture Notes in Computer Science},
  pages = {87--101},
  doi = {10.1007/978-3-319-17876-9_6},
  isbn = {978-3-319-17875-2},
  keywords = {AUC; Data stream; Class imbalance; Concept drift},
  language = {English},
  url = {http://dx.doi.org/10.1007/978-3-319-17876-9_6}
}
Piernik, M., Brzezinski, D., Morzy, T. and Lesniewska, A. XML Clustering: A Review of Structural Approaches 2015 Knowledge Engineering Review
Vol. 30(3), 297-323
article Accepted manuscript  
Abstract: With its presence in data integration, chemistry, biological and geographic systems, XML has become an important standard not only in computer science. A common problem among the mentioned applications involves structural clustering of XML documents --- an issue that has been thoroughly studied and led to the creation of a myriad of approaches. In this paper, we present a comprehensive review of structural XML clustering. First, we provide a basic introduction to the problem and highlight the main challenges in this research area. Subsequently, we divide the problem into three subtasks and discuss the most common document representations, structural similarity measures, and clustering algorithms. Additionally, we present the most popular evaluation measures, which can be used to estimate clustering quality. Finally, we analyze and compare 23 state-of-the-art approaches and arrange them in an original taxonomy. By providing an up-to-date analysis of existing structural XML clustering algorithms, we hope to showcase methods suitable for current applications and draw lines of future research.
BibTeX:
@article{XmlSurvey,
  author = {Maciej Piernik and Dariusz Brzezinski and Tadeusz Morzy and Anna Lesniewska},
  title = {XML Clustering: A Review of Structural Approaches},
  journal = {Knowledge Engineering Review},
  year = {2015},
  volume = {30},
  pages = {297--323},
  number = {3}
}
Krempl, G., Zliobaite, I., Brzezinski, D., Hüllermeier, E., Last, M., Lemaire, V., Noack, T., Shaker, A., Sievi, S., Spiliopoulou, M. and Stefanowski, J. Open Challenges for Data Stream Mining Research 2014 SIGKDD Explorations, Vol. 16(1), 1-10 article Pdf  
Abstract: Every day, huge volumes of sensory, transactional, and web data are continuously generated as streams, which need to be analyzed online as they arrive. Streaming data can be considered as one of the main sources of what is called big data. While predictive modeling for data streams and big data have received a lot of attention over the last decade, many research approaches are typically designed for well-behaved controlled problem settings, overlooking important challenges imposed by real-world applications. This article presents a discussion on eight open challenges for data stream mining. Our goal is to identify gaps between current research and meaningful applications, highlight open problems, and define new application-relevant research directions for data stream mining. The identified challenges cover the full cycle of knowledge discovery and involve such problems as: protecting data privacy, dealing with legacy systems, handling incomplete and delayed information, analysis of complex data, and evaluation of stream mining algorithms. The resulting analysis is illustrated by practical applications and provides general suggestions concerning lines of future research in data stream mining.
BibTeX:
@article{DataStreamsOpenChallenges,
  author = {Georg Krempl and Indre Zliobaite and Dariusz Brzezinski and Eyke Hüllermeier and Mark Last and Vincent Lemaire and Tino Noack and Ammar Shaker and Sonja Sievi and Myra Spiliopoulou and Jerzy Stefanowski},
  title = {Open Challenges for Data Stream Mining Research},
  journal = {SIGKDD Explorations},
  year = {2014},
  volume = {16},
  number = {1},
  pages = {1--10}
}
Brzezinski, D. and Piernik, M. Adaptive XML Stream Classification using Partial Tree-edit Distance 2014 Foundations of Intelligent Systems - 21st International Symposium, ISMIS 2014, Roskilde, Denmark, June 25-27, 2014. Proceedings, pp. 10-19 inproceedings Pdf  
Abstract: XML classification finds many applications, ranging from data integration to e-commerce. However, existing classification algorithms are designed for static XML collections, while modern information systems frequently deal with streaming data that needs to be processed on-line using limited resources. Furthermore, data stream classifiers have to be able to react to concept drifts, i.e., changes of the stream’s underlying data distribution. In this paper, we propose XStreamClass, an XML classifier capable of processing streams of documents and reacting to concept drifts. The algorithm combines incremental frequent tree mining with partial tree-edit distance and associative classification. XStreamClass was experimentally compared with four state-of-the-art data stream ensembles and provided best average classification accuracy on real and synthetic datasets simulating different drift scenarios.
BibTeX:
@inproceedings{XmlStreamPTED,
  author = {Dariusz Brzezinski and Maciej Piernik},
  title = {Adaptive {XML} Stream Classification using Partial Tree-edit Distance},
  year = {2014},
  booktitle = {Foundations of Intelligent Systems},
  editor = {Andreasen, Troels and Christiansen, Henning and Cubero, Juan-Carlos
	and Ras, Zbigniew},
  volume = {8502},
  series = {Lecture Notes in Computer Science},
  pages = {10--19},
  publisher = {Springer International Publishing},
  doi = {10.1007/978-3-319-08326-1_2},
  isbn = {978-3-319-08325-4},
  keywords = {XML; data stream; classification; concept drift},
  url = {http://dx.doi.org/10.1007/978-3-319-08326-1_2}
}
Brzezinski, D. and Stefanowski, J. Combining block-based and online methods in learning ensembles from concept drifting data streams 2014 Information Sciences, Vol. 265, 50-67 article Accepted manuscript  
Abstract: Most stream classifiers are designed to process data incrementally, run in resource-aware environments, and react to concept drifts, i.e., unforeseen changes of the stream's underlying data distribution. Ensemble classifiers have become an established research line in this field, mainly due to their modularity which offers a natural way of adapting to changes. However, in environments where class labels are available after each example, ensembles which process instances in blocks do not react to sudden changes sufficiently quickly. On the other hand, ensembles which process streams incrementally, do not take advantage of periodical adaptation mechanisms known from block-based ensembles, which offer accurate reactions to gradual and incremental changes. In this paper, we analyze if and how the characteristics of block and incremental processing can be combined to produce new types of ensemble classifiers. We consider and experimentally evaluate three general strategies for transforming a block ensemble into an incremental learner: online component evaluation, the introduction of an incremental learner, and the use of a drift detector. Based on the results of this analysis, we put forward a new incremental ensemble classifier, called Online Accuracy Updated Ensemble, which weights component classifiers based on their error in constant time and memory. The proposed algorithm was experimentally compared with four state-of-the-art online ensembles and provided best average classification accuracy on real and synthetic datasets simulating different drift scenarios.
BibTeX:
@article{BrzezStefINS2014,
  author = {Dariusz Brzezinski and Jerzy Stefanowski},
  title = {Combining block-based and online methods in learning ensembles from concept drifting data streams},
  journal = {Information Sciences},
  doi = {http://dx.doi.org/10.1016/j.ins.2013.12.011},
  year = {2014},
  volume = {265},
  pages = {50--67}
}
Brzezinski, D. and Stefanowski, J. Reacting to Different Types of Concept Drift: The Accuracy Updated Ensemble Algorithm 2014 IEEE Transactions on Neural Networks and Learning Systems
Vol. 25(1), 81-94
article Accepted manuscript  
Abstract: Data stream mining has been receiving increasing attention due to its presence in a wide range of applications such as sensor networks, banking, and telecommunication. One of the most important challenges in learning from data streams is reacting to concept drift, i.e., unforeseen changes of the stream's underlying data distribution. Several classification algorithms that cope with concept drift have been put forward, however, most of them specialize in one type of change. In this paper, we propose a new data stream classifier, called the Accuracy Updated Ensemble (AUE2), which aims at reacting equally well to different types of drift. AUE2 combines accuracy-based weighting mechanisms known from block-based ensembles with the incremental nature of Hoeffding Trees. The proposed algorithm was experimentally compared with 11 state-of-the-art stream methods, including single classifiers, block-based and online ensembles, and hybrid approaches in different drift scenarios. Out of all the compared algorithms, AUE2 provided best average classification accuracy while proving to be less memory consuming than other ensemble approaches. Experimental results show that AUE2 can be considered suitable for scenarios involving many types of drift as well as static environments.
BibTeX:
@article{BrzezStefIEEE2014,
  author = {Dariusz Brzezinski and Jerzy Stefanowski},
  title = {Reacting to Different Types of Concept Drift: The Accuracy Updated Ensemble Algorithm},
  journal = {IEEE Transactions on Neural Networks and Learning Systems},
  year = {2014},
  volume = {25},
  number = {1},
  pages = {81--94},
  doi = {http://dx.doi.org/10.1109/TNNLS.2013.2251352}
}
Brzezinski, D. and Stefanowski, J. Classifiers for Concept-drifting Data Streams: Evaluating Things That Really Matter 2013 ECML PKDD 2013 Workshop on Real-World Challenges for Data Stream Mining, September 27th, Prague, Czech Republic  inproceedings Pdf 
Abstract: When evaluating the performance of a classifier for concept-drifting data streams, two factors are crucial: prediction accuracy and the ability to adapt. The first factor could be analyzed by a simple error-rate, which can be calculated using a holdout test set, chunks of examples, or incrementally after each example [1]. More recently, Gama [2] proposed prequential accuracy as a means of evaluating data stream classifiers and enhancing drift detection methods. For imbalanced data streams, Bifet and Frank [3] proposed the use of the Kappa statistic with a sliding window to assess the classifier’s predictive abilities. However, all of the aforementioned measures, when averaged over an entire stream, loose information about the classifier’s reactions to drifts. For example, an algorithm which has very high accuracy in periods of concept stability, but drastically looses on accuracy when drifts occur can still be characterized by higher overall accuracy than an algorithm which has lower accuracy between drifts, but reacts very well to changes. If we want our algorithm to react quickly to, e.g., market changes, we should choose the second algorithm, but to do so we would have to analyze the entire graphical plot of the classifier’s prequential accuracy, which cannot be easily automated and requires user interaction. To evaluate the second factor, the ability to adapt, separate methods are needed. Some researchers evaluate the classifier’s ability to adapt by comparing drift reaction times [4]. It is important to notice that in order to calculate reaction times, usually a human expert needs to determine moments when drifts start and stop. To automate the assessment of adaptability, Shaker and Hullermeier [5] proposed an approach, called recovery analysis, which uses synthetic datasets to calculate a classifier’s reaction time. A different evaluation method, which uses artificially generated datasets was proposed by Zliobaite [6]. The author put forward three controlled permutation techniques that create datasets which can help inform about the robustness of a classifier to variations in changes. However, approaches such as [5], which calculate absolute or relative drift reaction times, require external knowledge about drifts in real streams or the use of synthetic datasets and, therefore can only be used offline. Furthermore, reaction times are always calculated separately from accuracy which makes choosing the best classifier a difficult task. Similarly, controlled permutations require generating artificial datasets and, thus, are limited to use offline, during model selection rather than on deployed models working online on real streams. The number of discussed approaches shows that the evaluation of data stream classifiers is an important topic and there is a need to develop new methods specifically for non-stationary environments. However, all of the proposed evaluation measures concentrate on a single factor instead of combining information about accuracy and adaptability. Furthermore, many methods require the creation of artificial datasets or complex user interaction, which makes these methods difficult to use online, on a deployed data stream classifier. With these challenges in mind, we propose a new aggregated measure, which: – combines information about accuracy and adaptability – works online and does not require the creation of artificial datasets – can be averaged over an entire dataset – can be parametrized according to application-related costs, which define the importance of accuracy compared to drift reactions
BibTeX:
@inproceedings{realstream2013,
  author = {Brzezinski, Dariusz and Stefanowski, Jerzy},
  title = {Classifiers for Concept-drifting Data Streams: Evaluating Things That Really Matter},
  booktitle = {ECML PKDD 2013 Workshop on Real-World Challenges for Data Stream Mining, September 27th, Prague, Czech Republic},
  year = {2013},
  pages = {10--14},
  url = {http://www.cs.put.poznan.pl/dbrzezinski/publications/RealStream2013.pdf}
}
Brzezinski, D. & Stefanowski, J. From Block-based Ensembles to Online Learners In Changing Data Streams: If- and How-To 2012 ECML PKDD 2012 Workshop on Instant Interactive Data Mining, September 24th, Bristol, UK  inproceedings Pdf
Presentation 
Abstract: Ensemble classifiers have become an established research line in the field of mining time-changing data streams. However, in environments where class labels are available after each example, ensembles which process instances in blocks do not react to sudden changes sufficiently quickly. On the other hand, existing online ensemble algorithms, which process streams incrementally, do not take advantage of periodical weighting mechanisms known from block-based ensembles, which offer accurate reactions to gradual and recurring changes. In this paper, we analyze if and how the characteristics of block and incremental processing can be combined to produce accurate ensemble classifiers. We propose and experimentally evaluate three strategies to transforming a block ensemble into an online learner: the use of a sliding window, an additional incrementally trained ensemble member, and a drift detector. The obtained results verify, which of these approaches is most effective and what characteristics of block processing are most beneficial in online environments.
BibTeX:
@inproceedings{iid2012,
  author = {Brzezinski, Dariusz and Stefanowski, Jerzy},
  title = {From Block-based Ensembles to Online Learners In Changing Data Streams: If- and How-To},
  booktitle = {ECML PKDD 2012 Workshop on Instant Interactive Data Mining, September 24th, Bristol, UK},
  year = {2012},
  url = {http://www.cs.put.poznan.pl/dbrzezinski/publications/IID2012.pdf}
}
Brzezinski, D., Lesniewska, A., Morzy, T. & Piernik, M. XCleaner: A New Method for Clustering XML Documents by Structure 2011 Control and Cybernetics
Vol. 40(3)
article Pdf  
Abstract: The XML format has become a popular method of data representation
in many different domains like databases or the Web. The
growth of popularity of XML results from its ability to encode structural
information about data records. However, this structural characteristic
of XML data sets also implies a variety of new data mining problems. In
this paper, structure-based clustering of XML documents is being considered.
We propose a new clustering algorithm that uses maximal frequent
subtrees of a set of documents in order to divide these documents into
structurally similar groups of data. The algorithm is implemented and
evaluated on real and synthetic data sets.
BibTeX:
@article{BrzezinskiSMICC,
  author = {Dariusz Brzezinski and Anna Lesniewska and Tadeusz Morzy and Maciej Piernik},
  title = {XCleaner: A New Method for Clustering XML Documents by Structure},
  journal = {Control and Cybernetics},
  year = {2011},
  volume = {40},
  number = {3},
  url = {http://www.cs.put.poznan.pl/dbrzezinski/publications/XCleaner2011.pdf}
}
Brzezinski, D. & Stefanowski, J. Accuracy Updated Ensemble for Data Streams with Concept Drift 2011 Hybrid Artificial Intelligence Systems, 6th International Conference, HAIS 2011, Wrocław, Poland, May 23-25, 2011. Proceedings, Part II, pp. 155-163 inproceedings Pdf
Presentation 
Abstract: In this paper we study the problem of constructing accurate block-based ensemble classifiers from time evolving data streams. AWE is the best-known representative of these ensembles. We propose a new algorithm called Accuracy Updated Ensemble (AUE), which extends AWE by using online component classifiers and updating them according to the current distribution. Additional modifications of weighting functions solve problems with undesired classifier excluding seen in AWE. Experiments with several evolving data sets show that, while still requiring constant processing time and memory, AUE is more accurate than AWE.
BibTeX:
@inproceedings{DBLP:conf/hais/BrzezinskiS11,
  author    = {Dariusz Brzezinski and
               Jerzy Stefanowski},
  title     = {Accuracy Updated Ensemble for Data Streams with Concept
               Drift},
  booktitle = {HAIS (2)},
  year      = {2011},
  pages     = {155--163},
  ee        = {http://dx.doi.org/10.1007/978-3-642-21222-2_19},
  crossref  = {DBLP:conf/hais/2011-2},
  bibsource = {DBLP, http://dblp.uni-trier.de}
}
@proceedings{DBLP:conf/hais/2011-2,
  editor    = {Emilio Corchado and
               Marek Kurzynski and
               Michal Wozniak},
  title     = {Hybrid Artificial Intelligent Systems - 6th International
               Conference, HAIS 2011, Wroclaw, Poland, May 23-25, 2011,
               Proceedings, Part II},
  booktitle = {HAIS (2)},
  publisher = {Springer},
  series    = {Lecture Notes in Computer Science},
  volume    = {6679},
  year      = {2011},
  isbn      = {978-3-642-21221-5},
  ee        = {http://dx.doi.org/10.1007/978-3-642-21222-2},
  bibsource = {DBLP, http://dblp.uni-trier.de}
}
Brzezinski, D., Lesniewska, A., Morzy, T. & Piernik, M. XCleaner: A New Method for Clustering XML Documents by Structure 2010 Congress of Young IT Scientists 2010 other Poster 
Brzezinski, D. Mining Data Streams with Concept Drift 2010 School: Poznan University of Technology  mastersthesis Pdf  
Abstract: In today’s information society, computer users are used to gathering and sharing data anytime and
anywhere. This concerns applications such as social networks, banking, telecommunication, health
care, research, and entertainment, among others. As a result, a huge amount of data related to
all human activity is gathered for storage and processing purposes. These data sets may contain
interesting and useful knowledge represented by hidden patterns, but due to the volume of the
gathered data it is impossible to manually extract that knowledge. That is why data mining and
knowledge discovery methods have been proposed to automatically acquire interesting, non-trivial,
previously unknown and ultimately understandable patterns from very large data sets [26, 14].
Typical data mining tasks include association mining, classification, and clustering, which all have
been perfected for over two decades.
A recent report [35] estimated that the digital universe in 2007 was 281 billion gigabytes large
and it is forecast that it will reach 5 times that size until 2011. The same report states that by
2011 half of the produced data will not have a permanent home. This is partially due to a new
class of emerging applications - applications in which data is generated at very high rates in the
form of transient data streams. Data streams can be viewed as a sequence of relational tuples (e.g.,
call records, web page visits, sensor readings) that arrive continuously at time-varying, possibly
unbound streams. Due to their speed and size it is impossible to store them permanently [45].
Data stream application domains include network monitoring, security, telecommunication data
management, web applications, and sensor networks. The introduction of this new class of applications
has opened an interesting line of research problems including novel approaches to knowledge
discovery called data stream mining.
Current research in data mining is mainly devoted to static environments, where patterns
hidden in data are fixed and each data tuple can be accessed more than once. The most popular
data mining task is classification, defined as generalizing a known structure to apply it to new
data [26]. Traditional classification techniques give great results in static environments however,
they fail to successfully process data streams because of two factors: their overwhelming volume
and their distinctive feature - concept drift. Concept drift is a term used to describe changes
in the learned structure that occur over time. These changes mainly involve substitutions of
one classification task with another, but also include steady trends and minor fluctuations of the
underlying probability distributions [54]. For most traditional classifiers the occurrence of concept
drift leads to a drastic drop in classification accuracy. That is why recently, new classification
algorithms dedicated to data streams have been proposed.
The recognition of concept drift in data streams has led to sliding-window approaches that
model a forgetting process, which allows to limit the number of processed data and to react to
changes. Different approaches to mining data streams with concept drift include instance selection
methods, drift detection, ensemble classifiers, option trees and using Hoeffding boundaries to
estimate classifier performance.
Recently, a framework called Massive Online Analysis (MOA) for implementing algorithms
and running experiments on evolving data streams has been developed [12, 11]. It includes a
collection of offline and online data stream mining methods as well as tools for their evaluation.
MOA is a new environment that can facilitate and consequently accelerate the development of new
time-evolving stream classifiers.
The aim of this thesis is to review and compare single classifier and ensemble approaches to data
stream mining. We test time and memory costs, as well as classification accuracy, of representative
algorithms from both approaches. The experimental comparison of one of the algorithms, called
Accuracy Weighted Ensemble, with other selected classifiers has, to our knowledge, not been previously
done. Additionally, we propose and evaluate a new algorithm called Accuracy Diversified
Ensemble, which selects, weights, and updates ensemble members according to the current stream
distribution. For our experiments we use the Massive Online Analysis environment and extend
it by attribute filtering and data chunk evaluation procedures. We also verify the framework’s
capability to become the first commonly used software environment for research on learning from
evolving data streams.
BibTeX:
@mastersthesis{BrzezMs2010,
  author = {Dariusz Brzezinski},
  title = {Mining Data Streams with Concept Drift},
  school = {Poznan University of Technology},
  year = {2010},
  url = {http://www.cs.put.poznan.pl/dbrzezinski/publications/ConceptDrift.pdf}
}
Brzezinski, D., Lesniewska, A., Morzy, T. & Piernik, M. Clustering XML Documents by Patterns 2010 III Polish National Conference on Data Processing Technologies, KKNTPD 2010, Poznan, Poland, 2010. Proceedings, pp. 297-308 inproceedings  
Abstract: With the vastly growing data resources on the Internet, XML is
one of the most important standards, providing invaluable help for document
management. Not only does it provide enhancements to document exchange
and storage, but it is also helpful with a variety of information retrieval tasks.
Document clustering is one of the most interesting research areas that utilize
XML's semi-structural nature. In this paper, we propose a new XML clustering
method that relies solely on document structure. We propose sets of patterns
as cluster representations and pattern-document links as a way of assigning
documents to clusters. The presented approach is tested over real and synthetic data sets with promising results.
BibTeX:
@inproceedings{BrzezKKNTPD2010,
  author = {Dariusz Brzezinski and Anna Lesniewska and Tadeusz Morzy and Maciej Piernik},
  title = {Clustering XML Documents by Patterns},
  booktitle = {KKNTPD 2010 - III Polish National Conference on Data Processing Technologies},
  year = {2010}
}

Contact

Dariusz Brzeziński, Ph.D., D.Sc.
Institute of Computing Science
Poznan University of Technology
Piotrowo 2,
60-965 Poznan, Poland

BTiCW, room 2.7.13/5
(+48) 61 665-30-57
Tuesday 9:30-11:15