29-01-2019 The quiz repetition will take place on Wednesday, 14:00, room 13CW.
29-01-2019 The results from the first quiz with the final evaluation can be found here.
29-01-2019 Last labs with Krzysztof Dembczyński will take place on Wednesday, January 30, 12:00 (the time has changed!!!), lab 43.
15-01-2019 Final quiz will take place on Monday, January 21, 13:30, room 2CW (Quiz repetition on Wednesday, January 30).
Remember that 1/20 is not 0.2 :)
14-01-2019 Quiz questions for practicing can be found here.
11-01-2019 Additional lecture preparing for the final quiz will take place on Tuesday, January 15, 18:30 room L122BT.
28-11-2018 Additional lecture will take place on Tuesday, December 18, 18:30 room L122BT.
26-11-2018 Because of the NIPS conference the lecture and all labs are cancelled for the next week (from 03-12-2018 to 07-12-2018).
29-10-2018 A sheet with current scores is available here.
23-10-2018 The next lecture has been moved from Monday, October 29, to Tuesday, October 30, 18:30, room L122BT.
17-10-2018 Office hours cancelled on Thursday, October 18, because of the PP-RAI conference (please email me if you want to meet).
08-10-2018 The new semester has finally begun :)

The aim and the scope of the course

The aim of the course: To get to know technologies and algorithms for processing massive datasets.

The scope of the course: We will learn how to organize, store, access, and process massive datasets:

The course is based on parts of the Mining of Massive Datasets book.

Information about the course

Time and Place

Schedule of Lectures

08-10-2018 Processing of massive data sets [pdf]
15-10-2018 Evolution of database systems [pdf]
22-10-2018 Dimensional modeling [pdf]
29-10-2018 Processing of massive data sets I [pdf]
05-11-2018 Processing of massive data sets II [pdf]
19-11-2018 MapReduce in Spark I [pdf]
26-11-2018 MapReduce in Spark II [pdf]
10-12-2018 Approximate query processing I [pdf]
17-12-2018 Approximate query processing II [pdf]
18-12-2018 Finding similar items I [pdf]
07-01-2019 Finding similar items II [pdf]
14-01-2019 Finding similar items III [pdf]

Schedule of Labs

08-10-2018 Bonferroni's principle [pdf]
15-10-2018 Solving problems by simulations [pdf]
22-10-2018 Data transformation [pdf] [] [] [docker example]
29-10-2018 Dimensional modeling [pdf]
05-11-2018 Data transformation in bash [pdf]
19-11-2018 MapReduce in Spark I [pdf] [] [matrix M] [vector x] [vector v]
26-11-2018 MapReduce in Spark II [pdf] [data for matrix multiplication]
26-11-2018 MapReduce in Spark III - A bonus exercise [pdf]
10-12-2018 Bloom filters [pdf] [code]
17-12-2018 Nearest neighbor search [pdf] [facts-nns.csv.gz] [nearestneighbors-100-2018.result]
07-01-2019 Minhash signatures [pdf]
14-01-2019 Approximate nearest neighbor search [pdf]


Test : 75% (min. 50%)
Labs : 25% (min. 50%)
Regular exercises and homeworks : 100% (min. 50%)

Bonus points for all: up to 10 points.

A sheet with current scores is available here

90% 5.0
80% 4.5
70% 4.0
60% 3.5
50% 3.0


J. Leskovec, A. Rajaraman, J. D. Ullman, Mining of Massive Datasets, Cambridge University Press, 2014,

R. Kimball, M. Ross, The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling, John Wiley & Sons, 2002

H. Garcia-Molina, J. D. Ullman, J. Widom, Database Systems: The Complete Book. Second Edition. Pearson Prentice Hall, 2009.

J.Lin, Ch. Dyer, Data-Intensive Text Processing with MapReduce. Morgan and Claypool Publishers, 2010,