News

05-09-2019 The last chance to pass the course: Thursday, September 12, 11:00-12:00, office room 2 CW (Instiute of Computing Science).
29-06-2019 The last chance for students below 50 points is on Thursday, July 4, 10:00-12:00, office room 2 CW (Instiute of Computing Science).
29-06-2019 The sheets below with scores have been updated. Please verify your final score.
19-06-2019 A sheet with (current) final scores is available here.
A sheet with scores from the first quiz is available here.
A sheet with current scores from labs is available here.
28-05-2019 The first quiz will take place on Saturday, June 15, 15:10, room CW 8
The second quiz will take place on Saturday, June 29, 9:00, room CW 13
The last lab meeting will take place on Saturday, June 29, 14:00, room CW 45
Quiz questions for practicing can be found here.
23-02-2019 The new semester has finally begun :)

The aim and the scope of the course

The aim of the course: To get to know technologies and algorithms for processing massive datasets.

The scope of the course: We will learn how to organize, store, access, and process massive datasets:

The course is mainly based on the first 4 chapters of the Mining of Massive Datasets book.


Information about the course

Schedule of Lectures

23-02-2019 Introduction to massive datasets [pdf]
02-03-2019 Processing of massive datasets [pdf]
02-03-2019 Distributed processing and MapReduce [pdf]
30-03-2019 MapReduce in Spark [pdf]
30-03-2019 Approximate query processing [pdf]
25-05-2019 Finding similar items I [pdf]
25-05-2019 Finding similar items II [pdf]

Schedule of Labs

02-03-2019 Dimensional modeling and data transformation in bash [pdf] [unique_tracks.zip] [triplets_sample_20p.zip]
30-03-2019 MapReduce in Spark [pdf] [all-shakespeare.zip] [matrix M] [vector x] [vector v] [data for matrix multiplication]
12-05-2019 Hash functions and Bloom filters [pdf] [facts-nns.csv.gz]
15-06-2019 Nearest neighbor search [pdf] [facts-nns.csv.gz] [nearestneighbors-100-2018.result]

Evaluation

Lecture:
Test : 75% (min. 50%)
Labs : 25% (min. 50%)
Labs:
Regular exercises and homeworks : 100% (min. 50%)

Bonus points for all: up to 10 points.

Scale:
90% 5.0
80% 4.5
70% 4.0
60% 3.5
50% 3.0

Bibliography

J. Leskovec, A. Rajaraman, J. D. Ullman, Mining of Massive Datasets, Cambridge University Press, 2014, http://www.mmds.org.

R. Kimball, M. Ross, The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling, John Wiley & Sons, 2002

H. Garcia-Molina, J. D. Ullman, J. Widom, Database Systems: The Complete Book. Second Edition. Pearson Prentice Hall, 2009.