Big Data Frameworks
Koe
Vuosi | Lukukausi | Päivämäärä | Periodi | Kieli | Vastuuhenkilö |
---|---|---|---|---|---|
2015 | kevät | 10.03-28.04. | 4-4 | Englanti | Sasu Tarkoma |
Luennot
Aika | Huone | Luennoija | Päivämäärä |
---|---|---|---|
Ti 12-14 | D122 | Sasu Tarkoma | 10.03.2015-28.04.2015 |
Harjoitusryhmät
Aika | Huone | Ohjaaja | Päivämäärä | Huomioitavaa |
---|---|---|---|---|
Pe 10-12 | D122 | Mohammad Hoque | 13.03.2015—13.03.2015 | |
Pe 10-12 | D122 | Mohammad Hoque | 16.03.2015—24.04.2015 | |
Ke 10-12 | D122 | Mohammad Hoque | 29.04.2015—29.04.2015 |
Yleistä
This course examines current and emerging Big Data frameworks with focus on Data Science applications. The course starts with an introduction to MapReduce-based systems and then focuses on Spark and the Berkeley Data Analytics (BDAS) architecture. The course covers traditional MapReduce processes, streaming operation, machine learning and SQL integration. The course consists of the lectures and the assignments.
The course has an IRCnet channel #tkt-bdf.
Assignments are given by Ella Peltonen, Eemil Lagerspetz, and Mohammad Hoque.
Kurssin suorittaminen
The course consists of the lectures and the course assignments. The assignments are based on the Spark Big Data framework and the Scala programming language.
Exercises
Instead of the first week exercise session, we have a Spark coding tutorial on Friday 13.3. at 10-12. Please bring your laptop with you, if you have one. You can install the latest Spark version beforehand.
The Scala & Spark Tutorial 13.03.2015 slides are available here: http://is.gd/bigdatascala
The first exercise set is now out: link. Deadline is strictly 19.3. 2pm, returnings via Moodle. The first exercises will be discussed on Friday 20.3.
You can find the answers for Exercie set 1 here (Not yet complete).
The second exercise set is available there. Deadline is 26.3. 2pm, please return your answers via Moodle. These exercises have been discussed on Friday 27.3., when there will also be a Q&A for the exercise set three. Some hints included to the exercise set. Extended deadline 2.4. 2pm. Maximum number of points will be 5 if you use this opportunity. You can pick and do 5 that you are sure of, or do all 6 if you're not sure about one of them.
Access answers for Exercise Set 2 here (Not yet complete).
The third exercise set is now published. Deadline is 9.4. 2pm, please return via Moodle. These exercises will be discussed on Friday 10.4. after Easter. Because of the Easter break, we will not have an exercise on 3.4. Extended deadline 16.4. 2pm. Maximum number of points will be 5 if you use this opportunity. Please, return the entire solution set, also the exercises you are happy with from the first round.
On Friday 17.4., there is a Q&A session instead of the exercise session. Prepare your questions beforehand.
The fourth (and last) exercise set is published. Deadline is 23.4. 2pm and returnings via Moodle as always. These exercises will be discussed on Friday 24.4. Nota that there will be no extension for this last exercise set.
Tentative lecture outline
10.3. Introduction and the Big Data Challenge
17.3. MapReduce and Spark: Overview. MapReduce details.
24.3. MapReduce Optimizations and Algorithms. Spark Internals. Spark background material (not presented).
31.3. Distributed algorithms for Big Data: Elastic Data Processing by Lirim Osmani, Developing Spark Algorithms by Eemil Lagerspetz
7.4. Easter break
14.4. MLBase, MLLib, and GraphX and Streaming Spark
21.4. Two industry presentations (Nokia and F-Secure) on Big Data and Spark
28.4. Spark and bioinformatics. Summary
- Results of the 16.6.2015 exam
- 18.9.2015 16:00 in B123
- 1.12.2015 16:00 in B123
Kirjallisuus ja materiaali
Course is based on the lectures, assignments, and additional material available on the Web. The key source of information is the Apache Spark Web site:
http://spark.apache.org/docs/latest/programming-guide.html
Reading list (for the exam):