11-10-2017 The new semester has begun :)

The aim and the scope of the course

The aim of the course: To get to know the latest technologies and algorithms for mining of massive datasets.

The scope of the course: We will learn about MapReduce and scalable algorithms for:

The course is mainly based on parts of the Mining of Massive Datasets book.

Main information about the course

Time and place

Schedule of lectures

11-10-2017 Mining massive data sets [pdf]
18-10-2017 Classification and regression I [pdf]
25-10-2017 Classification and regression II [pdf]
08-11-2017 Classification and regression III [pdf]
22-11-2017 Classification and regression IV [pdf]
29-11-2017 Classification and regression V [pdf]
06-12-2017 Evolution of database systems [pdf]
13-12-2017 MapReduce in Spark [pdf]
20-12-2017 Algorithms in MapReduce [pdf]
03-01-2018 Multi-dimensional Index Structures [pdf]
10-01-2018 Finding Similar Items I [pdf]
17-01-2017 Finding Similar Items II [pdf]

Schedule of labs

11-10-2017 Bonferroni's principle [pdf]
18-10-2017 Solving problems by simulations[pdf]
25-10-2017 Classification and regression - Naive Bayes I [pdf]
08-11-2017 Classification and regression - Naive Bayes II [pdf] [code]
15-11-2017 Classification and regression - Naive Bayes III [pdf]
22-11-2017 Classification and regression - Testing classifiers [pdf] [code]
29-11-2017 Classification and regression - Decision boundary [pdf]
06-12-2017 Classification and Regression - Cross-Validation [pdf] [code]
13-12-2017 Dimensional modeling and data transformation [pdf] [] [] [report-1.pdf] [report-1.tex]
20-12-2017 MapReduce in Spark [pdf] [] [matrix M] [vector x] [vector v]
03-01-2018 MapReduce in Spark: Relational algebra and matrix multiplication [pdf] [] [data for matrix multiplication]
10-01-2018 Neareast Neighbor Search [pdf] [] [result.txt]
17-01-2017 Approximate Neareast Neighbor Search [pdf]


Test : 75% (min. 50%)
Labs : 25% (min. 50%)
Regular tasks and home works : 100% (min. 50%)

Bonus points for all: up to 10 percent points.


90% 5.0
80% 4.5
70% 4.0
60% 3.5
50% 3.0


J. Leskovec, A. Rajaraman, J. D. Ullman, Mining of Massive Datasets, Cambridge University Press, 2014,

H. Garcia-Molina, J. D. Ullman, J. Widom, Database Systems: The Complete Book. Second Edition. Pearson Prentice Hall, 2009.

J.Lin, Ch. Dyer, Data-Intensive Text Processing with MapReduce. Morgan and Claypool Publishers, 2010,

T. Hastie, R. Tibshirani, J. Friedman, Elements of Statistical Learning: Second Edition. Springer, 2009,

Ch. Lam, Hadoop in Action, Manning Publications Co., 2011.