Algorithms and Systems for Big Data Management

582761
5
Ohjelmistojärjestelmät
Syventävät opinnot
Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. This course will examine the current and emerging algorithms and systems for big data management. Included topics are: data sampling and sketch algorithms, NoSQL systems (e.g., key-value stores, column-oriented data stores, and document stores), XML and JSON databases. The course will consist of lectures and assignments.

Koe

21.12.2016 08.00 CK112
Vuosi Lukukausi Päivämäärä Periodi Kieli Vastuuhenkilö
2016 syksy 31.10-14.12. 2-2 Englanti Jiaheng Lu

Luennot

Aika Huone Luennoija Päivämäärä
Ma 12-14 C222 Jiaheng Lu 31.10.2016-14.12.2016
Ke 12-14 C222 Jiaheng Lu 31.10.2016-14.12.2016

Information for international students

This course is entirely in English.

Yleistä

We are in the era of “big data”. Data sets grow fast in size because they are increasingly being gathered by cheap and numerous information-sensing mobile devices, remote sensing, software logs, cameras, microphones, and wireless sensor networks. Most big data environments go beyond relational databases and traditional data warehouse platforms. The increasing focus on big data is shaping new algorithms and techniques. This course will mainly discuss some selected algorithms and systems on big data management, including  data sketches algorithms, NoSQL database systems and semi-structured data stores for XML and JSON documents, GFS and MapReduce framework.

At the end of this course, you will be able to:
 * Recognize different data elements in your own work and in everyday life problems;
 * Identify the frequent data operations required for various types of data;
 * Select a data model to suit the characteristics of your data;
 * Understand sketch techniques to handle streaming data, including Count-Min, Count-Sketch and FM Sketch;
 Understand GFS, Mapreduce and Bigtable technology
 * Gain hands-on experience for Hadoop mapreduce programming
 * Differentiate between a traditional Database Management System and a Big Data Management System
 * Appreciate why there are so many data management systems

This course is for those new to data science.  Prerequisite courses consist of an introductory course in programming (Concepts of Programming) and a course in math (Math for CS: Discrete Math). Knowledge for a traditional relational database is recommended, but not compulsory.

Feedback: Your feedback on this course is really appreciated. Please use this anonymous feedback form

 

Kurssin suorittaminen

The course consists of lectures, three exercises, two study groups and an examThe grading is based on the sum of the points from the exercises (max. 40 points) and the exam (max. 60 points).

  • 50 points is required to pass and gives the lowest grade 1
  • 85 points or more gives the highest grade 5.
Lectures:  Attending lectures is not obligatory but it is useful. Lecture notes covering key facts will be posted on this page, but there will be additional examples and explanations during the lectures. We will also explain the answers of questions in self-assessment forms in lectures.
 
Exercises:   The students should solve the problems at home and be prepared to present their solutions at the exercise session. The students are required to solve ALL problems. Give yourself as much time as possible to complete each assignment so you can complete it to the best of your ability and you don't have to rush or worry about incurring late submission penalties. Avoid leaving completing an assignment to the last minute/hour/day just in case unexpected problems occur.
 
Please enroll yourself to Moodle page to download the exercises.
 
Exercise 1.  Due: 15.11. 2016
Exercise 2.  Due: 29.11. 2016 Extended to 06.12
Exercise 3.  Due: 13.12. 2016 Extended to 19.12
 
 
Study groups:  The students read some material in advance and then discuss the material in groups during the meeting. Attending the study group meetings is mandatory. But if you have reason (such as emergency, personal illness or others) for being absent , please let the lecturer know before the meeting. 
 
 
 
Course exam:  The exam covers the lectures (including self-assessment questions) and the exercises.  The exam lasts 2.5 hours. No notes or other material is allowed in the exam.
Examination time:  21.12 Wednesday 8.00AM  Room  CK112
      
 

Renewal Exam

The renewal exam requires participation to the course and can be taken only if eligible for the course exam:

(1) Participation to all two study groups during the course; and 

(2) At least 10 solved exercise questions.

 
Separate exams
 
The separate exams do not require course participation and the grade is based on the exam score only. The separate exams cover lectures (including self assessment questions), exercises and study groups.
 
The first separate/renewal exam will be held on the 27th of January, 2017. 16.00- , B123
 

Kirjallisuus ja materiaali

All essential content can be found in the lecture notes and other material that will be posted here during the course.

 

  Date Topics Slides References Exercises session (Wednesday)
1 31.10 Introdcution to big data management Introduction [1,2,3,4]
2 07.11 Data model, relational, XML and graph Data Model Self-assessment Study group: Big data
3 14.11 JSON and Graph model  JSON Self-assessment Exercise(I)
4 21.11 Data sketches I Countmin Sketch Self-assessment [6,7,8] Study group: NoSQL databases
5 28.11 Data sketches II FMSketch Self-assessment 
Q&A 
6 05.12 MapReduce  MapReduce Self-assessment  Exercise(II)
7 12.12 GFS and Bigtable GFS Self-assessment Exercise(III)

List of associated papers:

(1) Cheikh Kacfah Emani, Nadine Cullot, Christophe Nicolle: Understandable Big Data: A survey. Computer Science Review 17: 70-81 (2015) [PDF paper]

(2) H. V. Jagadish: Big Data and Science: Myths and Reality. Big Data Research 2(2): 49-52 (2015)  [PDF paper]
 
(3) Stéphane Marchand-Maillet, Birgit Hofreiter: Big Data Management and Analysis for Business Informatics - A Survey. Enterprise Modelling and Information Systems Architectures 9(1): 90-105 (2014) [PDF paper]
 
(4) Ruogu Fang, Samira Pouyanfar, Yimin Yang, Shu-Ching Chen, S. S. Iyengar: Computational Health Informatics in the Big Data Age: A Survey. ACM Comput. Surv. 49(1): 12 (2016) [PDF paper]
 
(5) Renzo Angles, Claudio Gutiérrez: Survey of graph database models. ACM Comput. Surv. 40(1) (2008) [PDF paper]
 
(6) Graham Cormode, S. Muthukrishnan: An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55(1): 58-75 (2005) [PDF paper]
 
(7) Moses Charikar, Kevin C. Chen, Martin Farach-Colton: Finding Frequent Items in Data Streams. ICALP 2002: 693-703 [PDF paper]
 
(8) Philippe Flajolet, G. Nigel Martin: Probabilistic Counting Algorithms for Data Base Applications. J. Comput. Syst. Sci. 31(2): 182-209 (1985) [PDF paper]