Big Data Frameworks

Basic information

Course code: 582740

Credit units: 5

Subprogramme: Networking and Services

Level: Advanced studies

Description:

Exam

13.05.2016 09.00 A111

Year	Semester	Date	Period	Language	In charge
2016	spring	15.03-03.05.	4-4	English	Mohammad Hoque

Lectures

Time	Room	Lecturer	Date
Tue 12-14	C222	Mohammad Hoque	15.03.2016-03.05.2016

Exercise groups

Group: 1
Time	Room	Instructor	Date	Observe
Thu 10-12	C222	Mohammad Hoque	14.03.2016—06.05.2016

Huom:

Ilmoittautuminen tälle kurssille alkaa tiistaina 16.2. klo 9.00.

Note:

Registration for this course starts on Tuesday 16th of February at 9.00.

Information for international students

This course examines the current and emerging Big Data Franeworks with focus on Data Science Applications. The course begins with an introduction to Data Scicence. Then focuces on the internals of Berkeley Data Analysis Framework, Spark, and Big Data Machine Learning (ML) pipelines. The course consists of lectures and assignments.

General

The course consists of the lectures and assignments. At the end of the course there will be final exam. The assignments are based on Spark Data Analysis Framework and Scala Programming Language.

Completing the course

Exercises

In first week exercise session, we have a Spark coding tutorial on Thursday 17.3. at 10-12. Please bring your laptop with you, if you have one. You can install the latest Spark version beforehand. The Spark Instructions are available here. We will be using Spark 1.6.0 with Scala 2.10.x. Last year instructions slides are available here: http://is.gd/bigdatascala. Scala By Example is here. All the exercises must be submitted via moodle.

Set 1. Self Assesment and Scala Review (Dead Line 24.03.2016, Check the second link). The purpose is to familiarize with the basic mathematical and statistical terms for this course. Please, skip the Python Quiz in the Self-Assesment.

(a) Self Assesment

(b) Understanding Data Set and Scala Review Exercise

Set 2. Spark Preliminary (A Set of Spark Exercises)

Set 3. Linear Algebra and (A Set of Spark Exercises, Deadline 20.04.2016)

Set 4. Advanced Spark Application Optimization and Classification

Lectures

15.03.2016 General Info, Course Overview - Data Science

22.03.2016 MapReduce Paradigm, and Spark Internals

29.03.2016 Easter Break

05.04.2016 Spark Programming and Algorithms by Dr. Eemil Lagerspetz

12.04.2016 Machine Learning on Big Data - Part I (Prediction)

19.04.2016 Shuffling, Partitioning and closure

26.04.2016 Spark MLlib and Streaming Spark Internals

03.05.2016 Data Processing and Exam

Literature and material

Reading List for Exam

(1) Lecture Slides

(2) Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Originally OSDI 2004. CACM Volume 51 Issue 1, January 2008. http://dl.acm.org/citation.cfm?id=1327492.

(3) Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia et al. NSDI (2012).
http://usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf

(4) MLbase: A Distributed Machine-learning System. Tim Kraska et al. CIDR 2013. http://www.cs.ucla.edu/~ameet/mlbase.pdf

(5) 9 Feng Li, Beng Chin Ooi, M. Tamer Özsu, and Sai Wu. 2014. Distributed data management using MapReduce. ACM Comput. Surv. 46, 3, Article 31 (January 2014), 42 pages.

Address: Department of Computer Science, P.O. 68 (Gustaf Hällströmin katu 2b), FI-00014 UNIVERSITY OF HELSINKI, FINLAND
Opening Hours: During spring and autumn semesters Mon - Fri 7.45 - 19.45 (7.45 am - 7.45 pm)
Phone: +358 9 1911 (University switch)
General e-mail: info [at] cs.helsinki.fi
Fax: +358 9 876 4314

Department of Computer Science [pre 2018 site]

University of Helsinki

Faculty of Science