Spark Code Camp

582738
2
Hajautetut järjestelmät ja tietoliikenne
Syventävät opinnot
Spark Code Camp is a one-week project that introduces students to the Apache Spark cluster computing environment, https://spark.apache.org/. The NODES group in our Department uses Spark in multiple projects where there is a need for fast distributed data analysis, e.g., in Carat http://carat.cs.berkeley.edu/, and it is becoming popular also in industry. During the code camp, students will implement a small Spark project in groups of two or three. In addition to the Spark tutorials available online, there will be a short lecture about the system in the beginning of the week. The code camp ends with the demo session on Friday. There will also be help available with questions during the week.
Vuosi Lukukausi Päivämäärä Periodi Kieli Vastuuhenkilö
2014 kesä 25.08-29.08. 6-6 Englanti Sasu Tarkoma

Luennot

Aika Huone Luennoija Päivämäärä
Ma 9-12 C220 Sasu Tarkoma 25.08.2014-25.08.2014
Pe 12-14 C222 Sasu Tarkoma 29.08.2014-29.08.2014

Yleistä

Spark Code Camp 25.-29.8.

Spark Code Camp is a one-week project that introduces students to the Apache Spark cluster computing environment, https://spark.apache.org/. The NODES group in our Department uses Spark in multiple projects where there is a need for fast distributed data analysis, e.g., in Carat http://carat.cs.berkeley.edu/, and it is becoming popular also in industry.

During the code camp, students will implement a small Spark project in groups of two or three. In addition to the Spark tutorials available online, there will be a short lecture about the system in the beginning of the week. The code camp ends with the demo session on Friday. There will also be help available with questions during the week.

The groups can define their own topics, or pick one of the following ideas:

  • A distributed implementation of a well-known data mining algorithm: frequent itemset mining, association rules, etc. Many traditional algorithms could be useful also in the distributed environment, if you can find a good way to optimize them without heavy shared memory or communication usage.
  • A distributed implementation of some machine learning algorithm. In the MLlib project (https://spark.apache.org/mllib/), there are some implementations ready for distributed machine learning. You can look at them and start to implement some more (http://en.wikipedia.org/wiki/List_of_machine_learning_algorithms)
  • Streaming Spark. Stream processing is even more important nowadays when data sets are growing rapidly. In this project, the streaming system should update some easy statistics, e.g. average, variance, count of elements etc, when new data is coming in.

After the code camp, the NODES group can offer job opportunities as research assistants for particularly successful students.

Lecturers: Ella Peltonen and Eemil Lagerspetz, forename.surname@cs.helsinki.fi

Course assistants: Paula Lehtola and Mika Viinamäki

The code camp has a channel #tkt-spark on IRCnet for questions and free discussion. All the course instructions will be given in the starting lecture.

Lecture slides from Monday

Example project and links to the setting up materials

Kurssin suorittaminen

Pre-requirements are good skills in Java or Scala and readiness for independent and group work. There is no examination, and the course is graded pass or fail. The groups have to return a short document about their work (2-3 pages, including description of the work and lessons learned), and participate to the demo session on Friday. Documentation language is English, but help will be available also in Finnish.

Schedule

Introduction lecture: Monday 25.8. 9:00 - 11:00, classroom C220
During Monday, send an email to Ella about your grourp members' names and a sentence of text about your topic

Questioning sessions: daily from Monday to Friday, computer room B221 at 9-16

Demo session: Friday 29.8. at 12:15 in room C222, about 10 minutes per group
We highly recommended every group member to attend. If you have some problems with this, take contact to Ella *beforehand*.

Document deadline: Sunday 31.8. 23:59, returning via email to Ella
No delays are allowed.