News

17-01-2019 The final exam will take place on Monday 13:00, January 28, room L122 BT.
The last lab meeting will take place on Monday 17:00, January 28, lab 43.
19-11-2018 A sheet with current scores is available here.
08-01-2019 We swap two lectures with the course "Enterprise distributed systems":
- First change: 22-11-2018, 13:30 swapped with 13-12-2018, 11:45
- Second change: 06-12-2018, 13:30 swapped with 20-12-2018, 11:45
19-11-2018 We swap two lectures with the course "Enterprise distributed systems":
31-10-2018 The lab tasks from today are mandatory for both groups, however, they will not be evaluated during the next meeting.
Instead, the tasks will be extended during the next meeting and the final evaluation will concern both meetings.
17-10-2018 Office hours cancelled on Thursday, October 18, because of the PP-RAI conference (please email me if you want to meet).
04-10-2018 The new semester has begun :)

The aim and the scope of the course

The aim of the course: To get to know the latest technologies and algorithms for mining of massive datasets.

The scope of the course: We will learn about scalable algorithms for:

The course is mainly based on parts of the Mining of Massive Datasets book.


Main information about the course

Time and place


Schedule of lectures

04-10-2018 Mining massive data sets [pdf]
11-10-2018 Classification and regression I [pdf]
17-10-2018 Classification and regression II [pdf]
08-11-2018 Classification and regression III [pdf]
15-11-2018 Classification and regression IV [pdf]
29-11-2018 Classification and regression V [pdf]
13-12-2018 Evolution of database systems [pdf]
13-12-2018 MapReduce [pdf]
20-12-2018 MapReduce in Spark [pdf]
10-01-2019 Finding Similar Items I [pdf]
16-01-2019 Finding Similar Items II [pdf]
17-01-2019 Finding Similar Items III [pdf]

Schedule of labs

10-10-2018 Bonferroni's principle [pdf]
18-10-2018 Solving problems by simulations [pdf]
25-10-2018 Classification and regression - Introduction to scikit-learn [pdf]
31-10-2018 Classification and regression - Naive Bayes I [pdf] [code]
07-11-2018 Classification and regression - Naive Bayes II [pdf]
14-11-2018 Classification and regression - Testing classifiers [pdf] [code]
29-11-2018 Classification and regression - Decision boundary [pdf]
06-12-2018 Classification and Regression - Cross-Validation [pdf] [code]
13-12-2018 Processing of massive datasets [pdf] [unique_tracks.zip] [triplets_sample_20p.zip]
19-12-2018 Multidimensional modeling and data transformation in bash [pdf]
20-12-2018 MapReduce in Spark I [pdf] [all-shakespeare.zip] [matrix M] [vector x] [vector v]
03-01-2018 MapReduce in Spark II [pdf] [data for matrix multiplication]
10-01-2018 Exact and approximate neareast neighbor search [pdf] [facts-nns.csv.gz] [result.txt]

Evaluation

Lecture:
Test : 75% (min. 50%)
Labs : 25% (min. 50%)
Labs:
Regular tasks and home works : 100% (min. 50%)

Bonus points for all: up to 10 percent points.

A sheet with current scores is available here.

Scale:
90% 5.0
80% 4.5
70% 4.0
60% 3.5
50% 3.0

Bibliography

J. Leskovec, A. Rajaraman, J. D. Ullman, Mining of Massive Datasets, Cambridge University Press, 2014, http://infolab.stanford.edu/~ullman/mmds.html.

H. Garcia-Molina, J. D. Ullman, J. Widom, Database Systems: The Complete Book. Second Edition. Pearson Prentice Hall, 2009.

J.Lin, Ch. Dyer, Data-Intensive Text Processing with MapReduce. Morgan and Claypool Publishers, 2010, http://lintool.github.com/MapReduceAlgorithms/.

T. Hastie, R. Tibshirani, J. Friedman, Elements of Statistical Learning: Second Edition. Springer, 2009, http://www-stat.stanford.edu/~tibs/ElemStatLearn/.

Ch. Lam, Hadoop in Action, Manning Publications Co., 2011.