Mining of Massive Datasets

News

17-01-2019	The final exam will take place on Monday 13:00, January 28, room L122 BT.
	The last lab meeting will take place on Monday 17:00, January 28, lab 43.
19-11-2018	A sheet with current scores is available here.
08-01-2019	We swap two lectures with the course "Enterprise distributed systems":
	- First change: 22-11-2018, 13:30 swapped with 13-12-2018, 11:45
	- Second change: 06-12-2018, 13:30 swapped with 20-12-2018, 11:45
19-11-2018	We swap two lectures with the course "Enterprise distributed systems":
31-10-2018	The lab tasks from today are mandatory for both groups, however, they will not be evaluated during the next meeting.
	Instead, the tasks will be extended during the next meeting and the final evaluation will concern both meetings.
17-10-2018	Office hours cancelled on Thursday, October 18, because of the PP-RAI conference (please email me if you want to meet).
04-10-2018	The new semester has begun :)

The aim of the course: To get to know the latest technologies and algorithms for mining of massive datasets.

The scope of the course: We will learn about scalable algorithms for:

The course is mainly based on parts of the Mining of Massive Datasets book.

04-10-2018	Mining massive data sets [pdf]
11-10-2018	Classification and regression I [pdf]
17-10-2018	Classification and regression II [pdf]
08-11-2018	Classification and regression III [pdf]
15-11-2018	Classification and regression IV [pdf]
29-11-2018	Classification and regression V [pdf]
13-12-2018	Evolution of database systems [pdf]
13-12-2018	MapReduce [pdf]
20-12-2018	MapReduce in Spark [pdf]
10-01-2019	Finding Similar Items I [pdf]
16-01-2019	Finding Similar Items II [pdf]
17-01-2019	Finding Similar Items III [pdf]

10-10-2018	Bonferroni's principle [pdf]
18-10-2018	Solving problems by simulations [pdf]
25-10-2018	Classification and regression - Introduction to scikit-learn [pdf]
31-10-2018	Classification and regression - Naive Bayes I [pdf] [code]
07-11-2018	Classification and regression - Naive Bayes II [pdf]
14-11-2018	Classification and regression - Testing classifiers [pdf] [code]
29-11-2018	Classification and regression - Decision boundary [pdf]
06-12-2018	Classification and Regression - Cross-Validation [pdf] [code]
13-12-2018	Processing of massive datasets [pdf] [unique_tracks.zip] [triplets_sample_20p.zip]
19-12-2018	Multidimensional modeling and data transformation in bash [pdf]
20-12-2018	MapReduce in Spark I [pdf] [all-shakespeare.zip] [matrix M] [vector x] [vector v]
03-01-2018	MapReduce in Spark II [pdf] [data for matrix multiplication]
10-01-2018	Exact and approximate neareast neighbor search [pdf] [facts-nns.csv.gz] [result.txt]

Lecture:

Test :	75%	(min. 50%)
Labs :	25%	(min. 50%)

Labs:

Regular tasks and home works :

100%

(min. 50%)

Bonus points for all: up to 10 percent points.

A sheet with current scores is available here.

Scale:

J. Leskovec, A. Rajaraman, J. D. Ullman, Mining of Massive Datasets, Cambridge University Press, 2014, http://infolab.stanford.edu/~ullman/mmds.html.

H. Garcia-Molina, J. D. Ullman, J. Widom, Database Systems: The Complete Book. Second Edition. Pearson Prentice Hall, 2009.

J.Lin, Ch. Dyer, Data-Intensive Text Processing with MapReduce. Morgan and Claypool Publishers, 2010, http://lintool.github.com/MapReduceAlgorithms/.

T. Hastie, R. Tibshirani, J. Friedman, Elements of Statistical Learning: Second Edition. Springer, 2009, http://www-stat.stanford.edu/~tibs/ElemStatLearn/.

Ch. Lam, Hadoop in Action, Manning Publications Co., 2011.