Big Data Frameworks
Koe
Vuosi | Lukukausi | Päivämäärä | Periodi | Kieli | Vastuuhenkilö |
---|---|---|---|---|---|
2016 | kevät | 15.03-03.05. | 4-4 | Englanti | Mohammad Hoque |
Luennot
Aika | Huone | Luennoija | Päivämäärä |
---|---|---|---|
Ti 12-14 | C222 | Mohammad Hoque | 15.03.2016-03.05.2016 |
Harjoitusryhmät
Aika | Huone | Ohjaaja | Päivämäärä | Huomioitavaa |
---|---|---|---|---|
To 10-12 | C222 | Mohammad Hoque | 14.03.2016—06.05.2016 |
Ilmoittautuminen tälle kurssille alkaa tiistaina 16.2. klo 9.00.
Registration for this course starts on Tuesday 16th of February at 9.00.
Information for international students
This course examines the current and emerging Big Data Franeworks with focus on Data Science Applications. The course begins with an introduction to Data Scicence. Then focuces on the internals of Berkeley Data Analysis Framework, Spark, and Big Data Machine Learning (ML) pipelines. The course consists of lectures and assignments.
Yleistä
The course consists of the lectures and assignments. At the end of the course there will be final exam. The assignments are based on Spark Data Analysis Framework and Scala Programming Language.
Kurssin suorittaminen
Exercises
In first week exercise session, we have a Spark coding tutorial on Thursday 17.3. at 10-12. Please bring your laptop with you, if you have one. You can install the latest Spark version beforehand. The Spark Instructions are available here. We will be using Spark 1.6.0 with Scala 2.10.x. Last year instructions slides are available here: http://is.gd/bigdatascala. Scala By Example is here. All the exercises must be submitted via moodle.
Set 1. Self Assesment and Scala Review (Dead Line 24.03.2016, Check the second link). The purpose is to familiarize with the basic mathematical and statistical terms for this course. Please, skip the Python Quiz in the Self-Assesment.
(a) Self Assesment
(b) Understanding Data Set and Scala Review Exercise
Set 2. Spark Preliminary (A Set of Spark Exercises)
Set 3. Linear Algebra and (A Set of Spark Exercises, Deadline 20.04.2016)
Set 4. Advanced Spark Application Optimization and Classification
Lectures
15.03.2016 General Info, Course Overview - Data Science
22.03.2016 MapReduce Paradigm, and Spark Internals
29.03.2016 Easter Break
05.04.2016 Spark Programming and Algorithms by Dr. Eemil Lagerspetz
12.04.2016 Machine Learning on Big Data - Part I (Prediction)
19.04.2016 Shuffling, Partitioning and closure
26.04.2016 Spark MLlib and Streaming Spark Internals
03.05.2016 Data Processing and Exam
Kirjallisuus ja materiaali
Reading List for Exam
(1) Lecture Slides
(2) Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Originally OSDI 2004. CACM Volume 51 Issue 1, January 2008. http://dl.acm.org/citation.cfm?id=1327492.
(3) Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia et al. NSDI (2012).
http://usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf
(4) MLbase: A Distributed Machine-learning System. Tim Kraska et al. CIDR 2013. http://www.cs.ucla.edu/~ameet/mlbase.pdf
(5) 9 Feng Li, Beng Chin Ooi, M. Tamer Özsu, and Sai Wu. 2014. Distributed data management using MapReduce. ACM Comput. Surv. 46, 3, Article 31 (January 2014), 42 pages.