Big Data Frameworks

Perustiedot

Kurssikoodi: 582740

Opintopisteet: 5

Erikoistumislinja: Hajautetut järjestelmät ja tietoliikenne

Taso: Syventävät opinnot

Kuvaus:

Koe

08.05.2015 09.00 B123

Vuosi	Lukukausi	Päivämäärä	Periodi	Kieli	Vastuuhenkilö
2015	kevät	10.03-28.04.	4-4	Englanti	Sasu Tarkoma

Luennot

Aika	Huone	Luennoija	Päivämäärä
Ti 12-14	D122	Sasu Tarkoma	10.03.2015-28.04.2015

Harjoitusryhmät

Group: 1
Aika	Huone	Ohjaaja	Päivämäärä
Pe 10-12	D122	Mohammad Hoque	13.03.2015—13.03.2015
Pe 10-12	D122	Mohammad Hoque	16.03.2015—24.04.2015
Ke 10-12	D122	Mohammad Hoque	29.04.2015—29.04.2015

Yleistä

This course examines current and emerging Big Data frameworks with focus on Data Science applications. The course starts with an introduction to MapReduce-based systems and then focuses on Spark and the Berkeley Data Analytics (BDAS) architecture. The course covers traditional MapReduce processes, streaming operation, machine learning and SQL integration. The course consists of the lectures and the assignments.

The course has an IRCnet channel #tkt-bdf.

Assignments are given by Ella Peltonen, Eemil Lagerspetz, and Mohammad Hoque.

Kurssin suorittaminen

The course consists of the lectures and the course assignments. The assignments are based on the Spark Big Data framework and the Scala programming language.

Exercises

Instead of the first week exercise session, we have a Spark coding tutorial on Friday 13.3. at 10-12. Please bring your laptop with you, if you have one. You can install the latest Spark version beforehand.

The Scala & Spark Tutorial 13.03.2015 slides are available here: http://is.gd/bigdatascala

The first exercise set is now out: link. Deadline is strictly 19.3. 2pm, returnings via Moodle. The first exercises will be discussed on Friday 20.3.

You can find the answers for Exercie set 1 here (Not yet complete).

The second exercise set is available there. Deadline is 26.3. 2pm, please return your answers via Moodle. These exercises have been discussed on Friday 27.3., when there will also be a Q&A for the exercise set three. Some hints included to the exercise set. Extended deadline 2.4. 2pm. Maximum number of points will be 5 if you use this opportunity. You can pick and do 5 that you are sure of, or do all 6 if you're not sure about one of them.

Access answers for Exercise Set 2 here (Not yet complete).

The third exercise set is now published. Deadline is 9.4. 2pm, please return via Moodle. These exercises will be discussed on Friday 10.4. after Easter. Because of the Easter break, we will not have an exercise on 3.4. Extended deadline 16.4. 2pm. Maximum number of points will be 5 if you use this opportunity. Please, return the entire solution set, also the exercises you are happy with from the first round.

On Friday 17.4., there is a Q&A session instead of the exercise session. Prepare your questions beforehand.

The fourth (and last) exercise set is published. Deadline is 23.4. 2pm and returnings via Moodle as always. These exercises will be discussed on Friday 24.4. Nota that there will be no extension for this last exercise set.

Tentative lecture outline

10.3. Introduction and the Big Data Challenge

17.3. MapReduce and Spark: Overview. MapReduce details.

24.3. MapReduce Optimizations and Algorithms. Spark Internals. Spark background material (not presented).

31.3. Distributed algorithms for Big Data: Elastic Data Processing by Lirim Osmani, Developing Spark Algorithms by Eemil Lagerspetz

7.4. Easter break

14.4. MLBase, MLLib, and GraphX and Streaming Spark

21.4. Two industry presentations (Nokia and F-Secure) on Big Data and Spark

28.4. Spark and bioinformatics. Summary

Course exam

- Results of the 8.5. exam.

Separate exams (for those who have completed the assignments)

Results of the 16.6.2015 exam
18.9.2015 16:00 in B123
1.12.2015 16:00 in B123

Kirjallisuus ja materiaali

Course is based on the lectures, assignments, and additional material available on the Web. The key source of information is the Apache Spark Web site:

http://spark.apache.org

http://spark.apache.org/docs/latest/programming-guide.html

Reading list (for the exam):

1. MapReduce: Simplified Data Processing on Large Clusters. Jeffrey Dean and Sanjay Ghemawat.

Originally OSDI 2004. CACM Volume 51 Issue 1, January 2008.

2.Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia et al. NSDI (2012)

3.HaLoop: Efficient Iterative Data Processing on Large Clusters by Yingyi Bu et al. In VLDB'10: The 36th International Conference on Very Large Data Bases, Singapore, 24-30 September, 2010.

4. MLbase: A Distri buted Machine-learning System. Tim Kraska et al. CIDR 2013.

Additional material (not directly part of exam material):

5 https://developer.yahoo.com/hadoop/tutorial/module4.htm

6 http://www.slideshare.net/liancheng/dtcc-14-spark-runtime-internals?next_slideshow=1

7 http://horicky.blogspot.fi/2013/12/spark-low-latency-massively-parallel.html

8 http://mesos.apache.org/documentation/latest/mesos-architecture /

9 Feng Li, Beng Chin Ooi, M. Tamer Özsu, and Sai Wu. 2014. Distributed data management using MapReduce. ACM Comput. Surv. 46, 3, Article 31 (January 2014), 42 pages.

Osoite: Tietojenkäsittelytieteen laitos, PL 68 (Gustaf Hällströmin katu 2b), 00014 Helsingin yliopisto
Aukioloajat: Normaalisti syys- ja kevätlukukausien aikana ma - pe klo 7.45-19.45.
Puhelin: 0294 1911 (yliopiston vaihde)
Sähköposti: Palveluosoitteet
Faksi: 09 876 4314

Kirjaudu sivulle | Webmaster

Department of Computer Science [pre 2018 site]

Helsingin Yliopisto

Matemaattis-luonnontieteellinen tiedekunta