Big Data Frameworks

Perustiedot

Kurssikoodi: 582740

Opintopisteet: 5

Erikoistumislinja: Hajautetut järjestelmät ja tietoliikenne

Taso: Syventävät opinnot

Kuvaus:

Koe

13.05.2016 09.00 A111

Vuosi	Lukukausi	Päivämäärä	Periodi	Kieli	Vastuuhenkilö
2016	kevät	15.03-03.05.	4-4	Englanti	Mohammad Hoque

Luennot

Aika	Huone	Luennoija	Päivämäärä
Ti 12-14	C222	Mohammad Hoque	15.03.2016-03.05.2016

Harjoitusryhmät

Group: 1
Aika	Huone	Ohjaaja	Päivämäärä	Huomioitavaa
To 10-12	C222	Mohammad Hoque	14.03.2016—06.05.2016

Huom:

Ilmoittautuminen tälle kurssille alkaa tiistaina 16.2. klo 9.00.

Note:

Registration for this course starts on Tuesday 16th of February at 9.00.

Information for international students

This course examines the current and emerging Big Data Franeworks with focus on Data Science Applications. The course begins with an introduction to Data Scicence. Then focuces on the internals of Berkeley Data Analysis Framework, Spark, and Big Data Machine Learning (ML) pipelines. The course consists of lectures and assignments.

Yleistä

The course consists of the lectures and assignments. At the end of the course there will be final exam. The assignments are based on Spark Data Analysis Framework and Scala Programming Language.

Kurssin suorittaminen

Exercises

In first week exercise session, we have a Spark coding tutorial on Thursday 17.3. at 10-12. Please bring your laptop with you, if you have one. You can install the latest Spark version beforehand. The Spark Instructions are available here. We will be using Spark 1.6.0 with Scala 2.10.x. Last year instructions slides are available here: http://is.gd/bigdatascala. Scala By Example is here. All the exercises must be submitted via moodle.

Set 1. Self Assesment and Scala Review (Dead Line 24.03.2016, Check the second link). The purpose is to familiarize with the basic mathematical and statistical terms for this course. Please, skip the Python Quiz in the Self-Assesment.

(a) Self Assesment

(b) Understanding Data Set and Scala Review Exercise

Set 2. Spark Preliminary (A Set of Spark Exercises)

Set 3. Linear Algebra and (A Set of Spark Exercises, Deadline 20.04.2016)

Set 4. Advanced Spark Application Optimization and Classification

Lectures

15.03.2016 General Info, Course Overview - Data Science

22.03.2016 MapReduce Paradigm, and Spark Internals

29.03.2016 Easter Break

05.04.2016 Spark Programming and Algorithms by Dr. Eemil Lagerspetz

12.04.2016 Machine Learning on Big Data - Part I (Prediction)

19.04.2016 Shuffling, Partitioning and closure

26.04.2016 Spark MLlib and Streaming Spark Internals

03.05.2016 Data Processing and Exam

Kirjallisuus ja materiaali

Reading List for Exam

(1) Lecture Slides

(2) Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Originally OSDI 2004. CACM Volume 51 Issue 1, January 2008. http://dl.acm.org/citation.cfm?id=1327492.

(3) Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia et al. NSDI (2012).
http://usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf

(4) MLbase: A Distributed Machine-learning System. Tim Kraska et al. CIDR 2013. http://www.cs.ucla.edu/~ameet/mlbase.pdf

(5) 9 Feng Li, Beng Chin Ooi, M. Tamer Özsu, and Sai Wu. 2014. Distributed data management using MapReduce. ACM Comput. Surv. 46, 3, Article 31 (January 2014), 42 pages.

Osoite: Tietojenkäsittelytieteen laitos, PL 68 (Gustaf Hällströmin katu 2b), 00014 Helsingin yliopisto
Aukioloajat: Normaalisti syys- ja kevätlukukausien aikana ma - pe klo 7.45-19.45.
Puhelin: 0294 1911 (yliopiston vaihde)
Sähköposti: Palveluosoitteet
Faksi: 09 876 4314

Kirjaudu sivulle | Webmaster

Department of Computer Science [pre 2018 site]

Helsingin Yliopisto

Matemaattis-luonnontieteellinen tiedekunta