Big Data Frameworks

582740
5
Hajautetut järjestelmät ja tietoliikenne
Syventävät opinnot

Koe

08.05.2015 09.00 B123
Vuosi Lukukausi Päivämäärä Periodi Kieli Vastuuhenkilö
2015 kevät 10.03-28.04. 4-4 Englanti Sasu Tarkoma

Luennot

Aika Huone Luennoija Päivämäärä
Ti 12-14 D122 Sasu Tarkoma 10.03.2015-28.04.2015

Harjoitusryhmät

Group: 1
Aika Huone Ohjaaja Päivämäärä Huomioitavaa
Pe 10-12 D122 Mohammad Hoque 13.03.2015—13.03.2015
Pe 10-12 D122 Mohammad Hoque 16.03.2015—24.04.2015
Ke 10-12 D122 Mohammad Hoque 29.04.2015—29.04.2015

Yleistä

This course examines current and emerging Big Data frameworks with focus on Data Science applications. The course starts with an introduction to MapReduce-based systems and then focuses on Spark and the Berkeley Data Analytics (BDAS) architecture. The course covers traditional MapReduce processes, streaming operation, machine learning and SQL integration. The course consists of the lectures and the assignments. 

The course has an IRCnet channel #tkt-bdf.

Assignments are given by Ella Peltonen, Eemil Lagerspetz, and Mohammad Hoque.

Kurssin suorittaminen

The course consists of the lectures and the course assignments. The assignments are based on the Spark Big Data framework and the Scala programming language. 

Exercises

Instead of the first week exercise session, we have a Spark coding tutorial on Friday 13.3. at 10-12. Please bring your laptop with you, if you have one. You can install the latest Spark version beforehand.

The Scala & Spark Tutorial 13.03.2015 slides are available here: http://is.gd/bigdatascala

The first exercise set is now out: link. Deadline is strictly 19.3. 2pm, returnings via Moodle. The first exercises will be discussed on Friday 20.3.

You can find the answers for Exercie set 1 here (Not yet complete).

The second exercise set is available there. Deadline is 26.3. 2pm, please return your answers via Moodle. These exercises have been discussed on Friday 27.3., when there will also be a Q&A for the exercise set three. Some hints included to the exercise set. Extended deadline 2.4. 2pm. Maximum number of points will be 5 if you use this opportunity. You can pick and do 5 that you are sure of, or do all 6 if you're not sure about one of them.

Access answers for Exercise Set 2 here (Not yet complete).

The third exercise set is now published. Deadline is 9.4. 2pm, please return via Moodle. These exercises will be discussed on Friday 10.4. after Easter. Because of the Easter break, we will not have an exercise on 3.4. Extended deadline 16.4. 2pm. Maximum number of points will be 5 if you use this opportunity. Please, return the entire solution set, also the exercises you are happy with from the first round.

On Friday 17.4., there is a Q&A session instead of the exercise session. Prepare your questions beforehand.

The fourth (and last) exercise set is published. Deadline is 23.4. 2pm and returnings via Moodle as always. These exercises will be discussed on Friday 24.4. Nota that there will be no extension for this last exercise set.

Tentative lecture outline

10.3. Introduction and the Big Data Challenge

17.3. MapReduce and Spark: Overview. MapReduce details. 

24.3.  MapReduce Optimizations and AlgorithmsSpark Internals.  Spark background material (not presented). 

31.3. Distributed algorithms for Big DataElastic Data Processing by Lirim Osmani, Developing Spark Algorithms by Eemil Lagerspetz

7.4. Easter break

14.4. MLBase, MLLib, and GraphX and Streaming Spark

21.4. Two industry presentations (Nokia and F-Secure) on Big Data and Spark

28.4. Spark and bioinformatics.  Summary

 
Course exam
 
Separate exams (for those who have completed the assignments) 
 
 
 
 

Kirjallisuus ja materiaali

Course is based on the lectures, assignments, and additional material available on the Web. The key source of information is the Apache Spark Web site:

http://spark.apache.org

http://spark.apache.org/docs/latest/programming-guide.html

 

Reading list (for the exam):

Originally OSDI 2004.  CACM Volume 51 Issue 1, January 2008. 
 
 
3.HaLoop: Efficient Iterative Data Processing on Large Clusters by Yingyi Bu et al. In VLDB'10: The 36th International Conference on Very Large Data Bases, Singapore, 24-30 September, 2010.
 
4. MLbase: A Distributed Machine-learning System. Tim Kraska et al. CIDR 2013. 
 
Additional material (not directly part of exam material):
 
 
 
 
 
9 Feng Li, Beng Chin Ooi, M. Tamer Özsu, and Sai Wu. 2014. Distributed data management using MapReduce. ACM Comput. Surv. 46, 3, Article 31 (January 2014), 42 pages.