Big Data Frameworks

582740
5
Hajautetut järjestelmät ja tietoliikenne
Syventävät opinnot

Koe

12.05.2017 09.00 B123
Vuosi Lukukausi Päivämäärä Periodi Kieli Vastuuhenkilö
2017 kevät 14.03-04.05. 4-4 Englanti Mohammad Hoque

Luennot

Aika Huone Luennoija Päivämäärä
Ti 12-14 C222 Mohammad Hoque 14.03.2017-11.04.2017
Ti 12-14 C222 Mohammad Hoque 25.04.2017-02.05.2017

Harjoitusryhmät

Group: 1
Aika Huone Ohjaaja Päivämäärä Huomioitavaa
To 10-12 C222 Mohammad Hoque 16.03.2017—06.04.2017
To 10-12 C222 Mohammad Hoque 20.04.2017—04.05.2017

Ilmoittautuminen tälle kurssille alkaa tiistaina 16.2. klo 9.00.

Registration for this course starts on Tuesday 16th of February at 9.00.

Information for international students

This course examines the current and emerging Big Data Franeworks with focus on Data Science Applications. The course begins with an introduction to Data Scicence. Then focuces on the internals of Berkeley Data Analysis Framework, Spark, and Big Data Machine Learning (ML) pipelines. The course consists of  lectures and assignments.The learning goals of this course are outlined here.

Yleistä

The course consists of the lectures and assignments. At the end of the course there will be final exam. The assignments are based on Spark  Data Analysis Framework and Scala Programming Language. 

Kurssin suorittaminen

Exercise Sessions (In Progress)

In first week exercise session, we have a Spark coding tutorial on Thursday 16.3. at 10-12. Please bring your laptop with you on every exercise sessions, if you have one. You can install the latest Spark version beforehand. The Spark Instructions are available here. We will be using Spark 2.0.0 or higher with Scala 2.10.x or higher  and Python. Students can choose either Scala or Python. Scala By Example is here and another nice tutorial is here.

16.03.2017 Basic Scala/Python and Spark programming

Goal: The goal is to get introduced with Spark Cluster environment. The students will learn how to set up a cluster with Spark. They will set up the Spark programming environment. The students will learn to use Py and Scala interactive shells to write code. Next, they will set up or configure  Eclipse for programming with Spark. Finally, the students  will do experiment with some  basic Spark programs written in Scala/Python. 

Saprk Python Instructions

Spark Scala Instructiosn

Spark Examples

 

23.03.2017 Spark SQL Dataframes and examples 

Goal:  The students  will do experiment with some  basic Spark programs written in Scala/Python. There will be 15-20 exercises. Few exercises to load data from different sources. A number of exercises with Spark configuration, and paritioning. 5-7 exerccises with basic Saprk transformations and actions.  Finally some exercises on Spark-sql.

 

30.03.2017 Spark SQL Dataframes and Spark Optimizations 

Goal:  The students  will do experiment with some  basic Spark programs written in Scala/Python. There will be 15-20 exercises. Few exercises to load data from different sources. A number of exercises with Spark configuration, and paritioning. 5-7 exerccises with basic Saprk transformations and actions.  Finally some exercises on Spark-sql.

 

06.04.2017 Distributed File Systems and Streaming

Goal: The stduents will configure HDFS and Tachyon file systems in their computers and then use those in Spark program. We also have some hands on Spark Streaming. At the end there will be some example solutions for the Assignment 2. 

13.04.2017 Easter Break

20.04.2017 DIscussion on previous exercises 

02.05.2017 Spark Machine Learning Algorithms and discussion on previous exercises (at 12:00PM)

Assignments (In Progress)

All the assignments  must be submitted via moodle. The assignments can be in either in Scala or Python.

 

Set 1. Self Assesment and Spark  Reading (Deadline 22.03.2017, 11:59PM). The purpose is to familiarize  with the basic mathematical and statistical terms for this course. Please, skip the Python Quiz in the Self-Assesment. Reading material Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster ComputingThis will also help to understand the next lecture.

         (a) Self Assesment

         (b) Understanding Data Set and Spark Basics

Set 2. Spark Preliminary (A Set of Spark Exercises and familiarizing with some important terms, Deadline 31.03.2017 11:59PM)

Set 3. Spark Optimizations (A Set of Spark Exercises towards optimizing the performance of Spark applications,  Deadline 09.04.2017 11:55PM)

Set 4. Machine Learning and Time Series Applications (A Set of Basic Spark Exercises, Deadline 23.04.2017 11:55PM)

Set 5. Advanced Spark Applications: Machine Learning (Three exercises on Machine Learning, Deadline 07.05.2017 11:55 PM)

 

Lectures (In Progress)

14.03.2017  General InfoCourse and Scala Overview By Dr. Eemil Lagerspetz (Please bring your laptops)

21.03.2017  Big Data Frameworks Overview and MapReduce  

28.03.2017  Spark Programming (By Dr. Eemil Lagerspetz (Please bring your laptops)) and Optimizaitons

04.04.2017  Spark Internals and File Systems

11.04.2017  Spark Streaming

18.04.2017  Easter Break

25.04.2017  Machine Learning on Big Data

 

Kirjallisuus ja materiaali

Exam Materials (The followings are the tentative reading materials. They will be discussed in the course materials as well.)

  1. Lecture slides
  2. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
  3. http://static.usenix.org/events/nsdi11/tech/full_papers/Hindman_new.pdf
  4. HaLoop: Efficient Iterative Data Processing on Large Clusters
  5. MLbase: A Distributed Machine-learning System. 
  6. Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks
  7. The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing
  8. Discretized Streams: Fault-Tolerant Streaming Computation at Scale