Processing of Massive Datasets

News

29-01-2019	The quiz repetition will take place on Wednesday, 14:00, room 13CW.
29-01-2019	The results from the first quiz with the final evaluation can be found here.
29-01-2019	Last labs with Krzysztof Dembczyński will take place on Wednesday, January 30, 12:00 (the time has changed!!!), lab 43.
15-01-2019	Final quiz will take place on Monday, January 21, 13:30, room 2CW (Quiz repetition on Wednesday, January 30).
	Remember that 1/20 is not 0.2 :)
14-01-2019	Quiz questions for practicing can be found here.
11-01-2019	Additional lecture preparing for the final quiz will take place on Tuesday, January 15, 18:30 room L122BT.
28-11-2018	Additional lecture will take place on Tuesday, December 18, 18:30 room L122BT.
26-11-2018	Because of the NIPS conference the lecture and all labs are cancelled for the next week (from 03-12-2018 to 07-12-2018).
29-10-2018	A sheet with current scores is available here.
23-10-2018	The next lecture has been moved from Monday, October 29, to Tuesday, October 30, 18:30, room L122BT.
17-10-2018	Office hours cancelled on Thursday, October 18, because of the PP-RAI conference (please email me if you want to meet).
08-10-2018	The new semester has finally begun :)

The aim of the course: To get to know technologies and algorithms for processing massive datasets.

The scope of the course: We will learn how to organize, store, access, and process massive datasets:

The course is based on parts of the Mining of Massive Datasets book.

Office hours: Thursday, 10:00-12:00, office room 2 CW (Institute of Computing Science)

08-10-2018	Processing of massive data sets [pdf]
15-10-2018	Evolution of database systems [pdf]
22-10-2018	Dimensional modeling [pdf]
29-10-2018	Processing of massive data sets I [pdf]
05-11-2018	Processing of massive data sets II [pdf]
19-11-2018	MapReduce in Spark I [pdf]
26-11-2018	MapReduce in Spark II [pdf]
10-12-2018	Approximate query processing I [pdf]
17-12-2018	Approximate query processing II [pdf]
18-12-2018	Finding similar items I [pdf]
07-01-2019	Finding similar items II [pdf]
14-01-2019	Finding similar items III [pdf]

08-10-2018	Bonferroni's principle [pdf]
15-10-2018	Solving problems by simulations [pdf]
22-10-2018	Data transformation [pdf] [unique_tracks.zip] [triplets_sample_20p.zip] [docker example]
29-10-2018	Dimensional modeling [pdf]
05-11-2018	Data transformation in bash [pdf]
19-11-2018	MapReduce in Spark I [pdf] [all-shakespeare.zip] [matrix M] [vector x] [vector v]
26-11-2018	MapReduce in Spark II [pdf] [data for matrix multiplication]
26-11-2018	MapReduce in Spark III - A bonus exercise [pdf]
10-12-2018	Bloom filters [pdf] [code]
17-12-2018	Nearest neighbor search [pdf] [facts-nns.csv.gz] [nearestneighbors-100-2018.result]
07-01-2019	Minhash signatures [pdf]
14-01-2019	Approximate nearest neighbor search [pdf]

Lecture:

Test :	75%	(min. 50%)
Labs :	25%	(min. 50%)

Labs:

Regular exercises and homeworks :

100%

(min. 50%)

Bonus points for all: up to 10 points.

A sheet with current scores is available here

Scale:

J. Leskovec, A. Rajaraman, J. D. Ullman, Mining of Massive Datasets, Cambridge University Press, 2014, http://www.mmds.org.

R. Kimball, M. Ross, The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling, John Wiley & Sons, 2002

H. Garcia-Molina, J. D. Ullman, J. Widom, Database Systems: The Complete Book. Second Edition. Pearson Prentice Hall, 2009.

J.Lin, Ch. Dyer, Data-Intensive Text Processing with MapReduce. Morgan and Claypool Publishers, 2010, http://lintool.github.com/MapReduceAlgorithms/.