21st International Workshop On Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP 2019)

Invited talk by Prof. Norman Paton (University of Manchester, UK)

Automating Data Preparation: Can We? Should We? Must We?

Abstract:
Obtaining value from data through analysis often requires significant prior effort on data preparation. Data preparation covers the discovery, selection, integration and cleaning of existing data sets into a form that is suitable for analysis. Data preparation, also known as data wrangling or extract transform load, is reported as taking 80% of the time of data scientists. How can this time be reduced? Can it be reduced by automation? There have been significant results on the automation of individual steps within the data wrangling process, and there are now a few proposals for end-to-end automation. This paper reviews the state-of-the-art, and asks the following questions: Can we automate data preparation - what techniques are already available? Should we - what data preparation activities seem likely to be able to be carried out better by software than by human experts? Must we - what data preparation challenges cannot realistically be carried out by manual approaches?

Bio:
Norman Paton is a Professor of Computer Science at the University of Manchester, where he co-leads the Information Management Group. He has had a range of roles at Manchester, including as Head of School. He works principally on databases and distributed information management. Current research interests include pay-as-you-go data integration, data wrangling and adaptive systems. He has been an investigator on over 40 research grants from the UK research councils, the EU and industry, and has published around 250 refereed articles.

Invited talk by Michal Bodziony (IBM Poland)

ETL in Big Data Architectures: IBM Approach to Design and Optimization of ETL Workflows

Abstract:
Optimisation of data integration workflows is more and more complex process. In BigData world, with hybrid cloud architectures and plurality of non-relational datasources the legacy solutions are no longer enough. In order to ensure efficient data analysis many new have to be applied: machine learning, data visualisation, clouds optimized data movements etc.
There is a plurality of dimensions in which ETL itself can be optimised. To name a few: execution time, resource consumption, simplicity of maintenance, reusability and finally total cost of ownership (affected by all the others). On the other side, optimisation can be achieved by many means: datasources tuning, hardware upgrades, parallelisation, pushdown mechanisms and other ELT definition rewritings. Many such optimisations can be already done automatically with proper tooling. There is still a lot of methods waiting for us to be invented and implemented.
Several application of ML to auto-generation of integration flows or integration definitions optimisation looks very promising. With good data governance integration can be less expensive and have better user experience.

Bio:
Michal Bodziony is a senior performance specialist at IBM. During 20 years of professional experience he played roles of software and performance architect in several projects, always focused on best performance. For several years he was driving architecture of Optim Performance Manager tooling. Then for few years he was a Performance Architect for IBM Pure Data for Analytics (a.k.a Netezza). Recent years he is involved in development of IBM Unified Governance & Integration (focused on ETL performance and overall portfolio security). He is an author of several patents and many publications mostly focused on performance optimisation.