Big Data Frameworks

Hajautetut järjestelmät ja tietoliikenne
Syventävät opinnot


13.05.2016 09.00 A111
Vuosi Lukukausi Päivämäärä Periodi Kieli Vastuuhenkilö
2016 kevät 15.03-03.05. 4-4 Englanti Mohammad Hoque


Aika Huone Luennoija Päivämäärä
Ti 12-14 C222 Mohammad Hoque 15.03.2016-03.05.2016


Group: 1
Aika Huone Ohjaaja Päivämäärä Huomioitavaa
To 10-12 C222 Mohammad Hoque 14.03.2016—06.05.2016

Ilmoittautuminen tälle kurssille alkaa tiistaina 16.2. klo 9.00.

Registration for this course starts on Tuesday 16th of February at 9.00.

Information for international students

This course examines the current and emerging Big Data Franeworks with focus on Data Science Applications. The course begins with an introduction to Data Scicence. Then focuces on the internals of Berkeley Data Analysis Framework, Spark, and Big Data Machine Learning (ML) pipelines. The course consists of  lectures and assignments.



The course consists of the lectures and assignments. At the end of the course there will be final exam. The assignments are based on Spark  Data Analysis Framework and Scala Programming Language.

Kurssin suorittaminen


In first week exercise session, we have a Spark coding tutorial on Thursday 17.3. at 10-12. Please bring your laptop with you, if you have one. You can install the latest Spark version beforehand. The Spark Instructions are available here. We will be using Spark 1.6.0 with Scala 2.10.x.  Last year instructions  slides are available here: Scala By Example is here. All the exercises must be submitted via moodle.


Set 1. Self Assesment and Scala Review (Dead Line 24.03.2016, Check the second link). The purpose is to familiarize  with the basic mathematical and statistical terms for this course. Please, skip the Python Quiz in the Self-Assesment.

         (a) Self Assesment

         (b) Understanding Data Set and Scala Review Exercise

Set 2. Spark Preliminary (A Set of Spark Exercises)

Set 3. Linear Algebra and (A Set of Spark Exercises, Deadline 20.04.2016)

Set 4. Advanced Spark Application Optimization and Classification


15.03.2016  General Info, Course Overview - Data Science

22.03.2016  MapReduce Paradigm, and Spark Internals

29.03.2016  Easter Break

05.04.2016  Spark Programming and Algorithms   by Dr. Eemil Lagerspetz

12.04.2016  Machine Learning on Big Data - Part I (Prediction)

19.04.2016  Shuffling, Partitioning and closure

26.04.2016  Spark MLlib and Streaming Spark Internals

03.05.2016  Data Processing and Exam




Kirjallisuus ja materiaali

Reading List for Exam

(1) Lecture Slides

(2) Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Originally OSDI 2004. CACM Volume 51 Issue 1, January 2008.

(3) Resilient Distributed Datasets: A Fault-Tolerant     Abstraction for In-Memory Cluster Computing. Matei     Zaharia et al. NSDI (2012).

(4) MLbase: A Distributed Machine-learning System. Tim Kraska et al. CIDR 2013.

(5) 9 Feng Li, Beng Chin Ooi, M. Tamer Özsu, and Sai Wu. 2014. Distributed data management using MapReduce. ACM Comput. Surv. 46, 3, Article 31 (January 2014), 42 pages.