Big Data Frameworks
Koe
Vuosi | Lukukausi | Päivämäärä | Periodi | Kieli | Vastuuhenkilö |
---|---|---|---|---|---|
2017 | kevät | 14.03-04.05. | 4-4 | Englanti | Mohammad Hoque |
Luennot
Aika | Huone | Luennoija | Päivämäärä |
---|---|---|---|
Ti 12-14 | C222 | Mohammad Hoque | 14.03.2017-11.04.2017 |
Ti 12-14 | C222 | Mohammad Hoque | 25.04.2017-02.05.2017 |
Harjoitusryhmät
Aika | Huone | Ohjaaja | Päivämäärä | Huomioitavaa |
---|---|---|---|---|
To 10-12 | C222 | Mohammad Hoque | 16.03.2017—06.04.2017 | |
To 10-12 | C222 | Mohammad Hoque | 20.04.2017—04.05.2017 |
Ilmoittautuminen tälle kurssille alkaa tiistaina 16.2. klo 9.00.
Registration for this course starts on Tuesday 16th of February at 9.00.
Information for international students
This course examines the current and emerging Big Data Franeworks with focus on Data Science Applications. The course begins with an introduction to Data Scicence. Then focuces on the internals of Berkeley Data Analysis Framework, Spark, and Big Data Machine Learning (ML) pipelines. The course consists of lectures and assignments.The learning goals of this course are outlined here.
Yleistä
The course consists of the lectures and assignments. At the end of the course there will be final exam. The assignments are based on Spark Data Analysis Framework and Scala Programming Language.
Kurssin suorittaminen
Exercise Sessions (In Progress)
In first week exercise session, we have a Spark coding tutorial on Thursday 16.3. at 10-12. Please bring your laptop with you on every exercise sessions, if you have one. You can install the latest Spark version beforehand. The Spark Instructions are available here. We will be using Spark 2.0.0 or higher with Scala 2.10.x or higher and Python. Students can choose either Scala or Python. Scala By Example is here and another nice tutorial is here.
16.03.2017 Basic Scala/Python and Spark programming
Goal: The goal is to get introduced with Spark Cluster environment. The students will learn how to set up a cluster with Spark. They will set up the Spark programming environment. The students will learn to use Py and Scala interactive shells to write code. Next, they will set up or configure Eclipse for programming with Spark. Finally, the students will do experiment with some basic Spark programs written in Scala/Python.
23.03.2017 Spark SQL Dataframes and examples
Goal: The students will do experiment with some basic Spark programs written in Scala/Python. There will be 15-20 exercises. Few exercises to load data from different sources. A number of exercises with Spark configuration, and paritioning. 5-7 exerccises with basic Saprk transformations and actions. Finally some exercises on Spark-sql.
30.03.2017 Spark SQL Dataframes and Spark Optimizations
Goal: The students will do experiment with some basic Spark programs written in Scala/Python. There will be 15-20 exercises. Few exercises to load data from different sources. A number of exercises with Spark configuration, and paritioning. 5-7 exerccises with basic Saprk transformations and actions. Finally some exercises on Spark-sql.
06.04.2017 Distributed File Systems and Streaming
Goal: The stduents will configure HDFS and Tachyon file systems in their computers and then use those in Spark program. We also have some hands on Spark Streaming. At the end there will be some example solutions for the Assignment 2.
13.04.2017 Easter Break
20.04.2017 DIscussion on previous exercises
02.05.2017 Spark Machine Learning Algorithms and discussion on previous exercises (at 12:00PM)
Assignments (In Progress)
All the assignments must be submitted via moodle. The assignments can be in either in Scala or Python.
Set 1. Self Assesment and Spark Reading (Deadline 22.03.2017, 11:59PM). The purpose is to familiarize with the basic mathematical and statistical terms for this course. Please, skip the Python Quiz in the Self-Assesment. Reading material Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. This will also help to understand the next lecture.
(a) Self Assesment
(b) Understanding Data Set and Spark Basics
Set 2. Spark Preliminary (A Set of Spark Exercises and familiarizing with some important terms, Deadline 31.03.2017 11:59PM)
Set 3. Spark Optimizations (A Set of Spark Exercises towards optimizing the performance of Spark applications, Deadline 09.04.2017 11:55PM)
Set 4. Machine Learning and Time Series Applications (A Set of Basic Spark Exercises, Deadline 23.04.2017 11:55PM)
Set 5. Advanced Spark Applications: Machine Learning (Three exercises on Machine Learning, Deadline 07.05.2017 11:55 PM)
Lectures (In Progress)
14.03.2017 General Info, Course and Scala Overview By Dr. Eemil Lagerspetz (Please bring your laptops)
21.03.2017 Big Data Frameworks Overview and MapReduce
28.03.2017 Spark Programming (By Dr. Eemil Lagerspetz (Please bring your laptops)) and Optimizaitons
04.04.2017 Spark Internals and File Systems
11.04.2017 Spark Streaming
18.04.2017 Easter Break
25.04.2017 Machine Learning on Big Data
Kirjallisuus ja materiaali
Exam Materials (The followings are the tentative reading materials. They will be discussed in the course materials as well.)
- Lecture slides
- Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
- http://static.usenix.org/events/nsdi11/tech/full_papers/Hindman_new.pdf
- HaLoop: Efficient Iterative Data Processing on Large Clusters
- MLbase: A Distributed Machine-learning System.
- Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks
- The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing
- Discretized Streams: Fault-Tolerant Streaming Computation at Scale