Seminar on Big Data Management

58316103
3
Ohjelmistojärjestelmät
Syventävät opinnot
Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. The seminar will cover selected topics about challenges of big data management, including big data platform, querying, exploration, analysis, sampling, and cloud data management, as well as big data applications. This course will mainly use papers from recent database conferences, like SIGMOD and PVLDB.
Vuosi Lukukausi Päivämäärä Periodi Kieli Vastuuhenkilö
2016 kevät 18.01-31.05. 3-4 Englanti Jiaheng Lu

Luennot

Aika Huone Luennoija Päivämäärä
Ma 12-14 B222 Jiaheng Lu 18.01.2016-29.02.2016
Ma 12-14 B222 Jiaheng Lu 14.03.2016-02.05.2016

Information for international students

This seminar will work in English.

Also see the global learning objectives of seminars at the department: http://www.cs.helsinki.fi/en/courses/seminaarien-ennakkoilmoittautuminen/matrix

 

Yleistä

We are in the era of “big data”. Data sets grow fast in size because they are increasingly being gathered by cheap and numerous information-sensing mobile devices, remote sensing, software logs, cameras, microphones, and wireless sensor networks. Most big data environments go beyond relational databases and traditional data warehouse platforms. The increasing focus on collecting and analyzing big data is shaping new platforms and techniques. This seminar will mainly discuss new research papers in different subfields of big data management, including data querying, exploration, sampling, sharing, cleansing, big data benchmarking and applications.

 

Kurssin suorittaminen

Students complete this seminar by actively participating in its work: the work methods include studying scientific sources, writing reports and giving presentations, reading the reports of other participants and evaluating them, and actively following presentations.

Grading  The grading will be based on each student's own written work (1/3), oral presentation (1/3), and commentary as an opponent on the presentations and reports  of others as well as activeness in general (1/3). To pass the seminar, each of these components must be passed. (Active) attendance of seminar meetings is obligatory. Absence from at most two meetings is accepted (and will affect grading).

Presentation  Information about preparing your presentation

1. Introduction: please make a clear introduction to your talk.
 1.1  Why you are interested in this topic: what kind of problems do you hope to solve?
 1.2  How had the problem been studied before?
 1.3  What is the application of this problem for big data?
 
2. Related works:
   2.1   Make sure you leave sufficient time to present all related prior work. Do not assume that the audience knows the prior work,
   2.2   Present it on an intuitive level.
 
3 Main algorithms and contributions
   3.1 Show the main solutions of the paper(s).
   3.2 Present it with examples. The examples are quite important for understanding.
 
4. Your own comments and conclusion
   4.1 Present your own comments about the paper(s).
   4.2 It would be very good to identify the weak points of the paper(s) after your critical thinking.
 
Report  Please carefully read this guideline for the report writing. The scoring scheme for the report is available here.

This seminar will use the EasyChair system to manage the reports and reviews. The submission website is  https://easychair.org/conferences/?conf=sbdm2016

The deadline of the first version of the report is 7 Mar, 2016. Extended to 11  Mar, 2016
The deadline of the final version is 2 May, 2016.
 
 

Review  You need to review two reports of your peers. Please download the review form [here] and submit your feedback to the Easychair system. https://easychair.org/conferences/?conf=sbdm2016

1. Please describe the main topic of the report.
 2. Do you think that the introduction motivates the topic and place it in context?
 3. Is the structure of the report clear and cohesive? 
 4. Is the language good and polished? Are there enough illustrating figures and examples?
 5. Which places/parts in the report do you not understand, or have difficulties with?
 6. Does the report draw clear conclusions?
 7. Where and how does the thinking of the student show in the report (and not just repeating material from the original sources)? 
 8.  List three things that you personally found most interesting in the report. 
 9. Suggested improvements. List at least three suggestions on how to improve the report.
 10. More detailed comments and suggestions for improvement.
 
The deadline for submitting the review report is 21 Mar, 2016, Extended to 31  Mar, 2016
 
 

Schedule

  Introductory lecture: Monday 18 January 2016, 12-14 B222

  Office hours: Monday 15-17, Exactum A236

 

Schedule
Date Title Presenter Opponent
18/01 Introduction to Seminar [Slides][Objective] Jiaheng Lu  
25/01 Big data and NoSQL databases [Slides] Jiaheng Lu  
01/02 GFS, Mapreduce and Bigtable [Slides] Jiaheng Lu  
08/02 New trends of big data management in 2016  [Slides] Jiaheng Lu  
15/02 No meeting (Prepare for presentation)     
22/02 Cloud data management  [Slides1[Slides2]

Atthia Abrar,

Zhen Shi

Fynn Marlin Leitow,

Sandeep Panchamukhi

29/02 Graph data management [Slides] Fynn Marlin Leitow  Frans Ojaba
07/03 No Meeting this week    
14/03 Big data benchmark [Slides Frans Ojaba Heli Helskyaho
14/03 Spatial big data [Slides Joe Niemi Atthia Abrar
 21/03 Big data benchmark [Slides Sandeep Panchamukhi Akkas Haider
 21/03 Data exploration [Slides Heli Helskyaho Zhen Shi
28/03 No Meeting this week    
 04/04 Data sampling   [Slides1[Slides2]

Juhani Heliö,

Risto Tuomainen

Joe Niemi,

Lidia Pivovarova

 11/04 Report discussion    
18/04 Big data application on health and smart cities [Slides1]  [Slides2] Suleyman Akbas, Shiva Ram Shrestha Risto Tuomainen, Soumyajit Mondal
25/04 Knowledge base [Slides] Lidia Pivovarova Shiva Ram Shrestha
25/04 Q&A on report writing    

 

Kirjallisuus ja materiaali

 
Please enter your topic preferences here:
 
 
Please enter your available time for presentation and opponent here:
 
 
 
 The deadline for selecting your preferences is 29 January 2016, Friday.
 
  You may also propose your own topic based on a recently published research article that falls within the big data management. In this case, please email the teacher with more details on the topic you are interested in.
 
    The term "Big Data" tends to be used in multiple ways, often referring to both the type of data being managed as well as the technology used to store and process it. The world of Big Data approach is  increasingly being defined by the 4 Vs,  i.e. these 'Vs' become a reasonable test as to whether a Big Data approach is the right one to adopt for a new area of  big data management. The Vs are: Volume, Velocity, Variety and Value. The following topics address one or multiple V's.
 
 
 
Big data survey (Volume, Velocity, Variety and Value)
(1)  Cheikh Kacfah Emani, Nadine Cullot, Christophe Nicolle: Understandable Big Data: A survey. Computer Science Review 17: 70-81 (2015) [PDF paper]
(2) H. V. Jagadish: Big Data and Science: Myths and Reality. Big Data Research 2(2): 49-52 (2015)  [PDF paper]
 
 
Hadoop and Spark platforms  (Volume, Velocity, Variety)
 
Hadoop and Spark are two open-source platforms for big data processing.
 
(1) Juwei Shi, Jia Zou, Jiaheng Lu, Zhao Cao, Shiqiang Li, Chen Wang: MRTuner: A Toolkit to Enable Holistic Optimization for MapReduce Jobs. PVLDB 7(13): 1319-1330 (2014) [PDF paper]
(2) Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold Reinwald, Fatma Özcan: Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics. PVLDB 8(13): 2110-2121 (2015) [PDF paper]
 
Cloud data management (Volume, Velocity)
 
Cloud data management is to deploy database systems in the cloud.  How the typical properties of commercially available cloud computing platforms affect the choice of data management applications to deploy in the cloud ?
 
(1) Adam Silberstein, Russell Sears, Wenchao Zhou, Brian F. Cooper: A batch of PNUTS: experiences connecting cloud batch and serving systems. SIGMOD Conference 2011: 1101-1112 [PDF paper]
(2) Daniel J. Abadi: Data Management in the Cloud: Limitations and Opportunities. IEEE Data Eng. Bull. 32(1): 3-12 (2009) [PDF paper]
 
 
Data sampling (Volume, Velocity)
 
It is not always possible to store the big data in full, and it is faster to work with a compact summary. Data sampling is a technique to process big data.
 
(1) Ying Yan, Liang Jeff Chen, Zheng Zhang: Error-bounded Sampling for Analytics on Big Sparse Data. PVLDB 7(13): 1508-1519 (2014) [PDF paper]
(2) S. Acharya, P. B. Gibbons, and V. Poosala. Congressional samples for approximate answering of group-by queries. In SIGMOD, 2000 [PDF paper]
 
Graph data management  (Volume, Variety)
 
Graph data management has long been a topic of interest for database researchers. The topic gained renewed interest recently, motivated by the rapid emergence of new application domains including social networks and the Web of data. 
 
(1) Yu Liu, Jiaheng Lu, Hua Yang, Xiaokui Xiao, Zhewei Wei: Towards Maximum Independent Sets on Massive Graphs. PVLDB 8(13): 2122-2133 (2015) [PDF paper]
(2) Nadathur Satish, Narayanan Sundaram, Md. Mostofa Ali Patwary, Jiwon Seo, Jongsoo Park, M. Amber Hassaan, Shubho Sengupta, Zhaoming Yin, Pradeep Dubey: Navigating the maze of graph analytics frameworks using massive graph datasets. SIGMOD Conference 2014: 979-990 [PDF paper]
(3)Philippe Cudré-Mauroux, Sameh Elnikety: Graph Data Management Systems for New Application Domains. PVLDB 4(12): 1510-1511 (2011)  [PDF paper]
 
Data exploration  (Volume, Variety)
 
Data exploration is about efficiently extracting knowledge from big data even if we do not know exactly what we are looking for. 
 
(1) Marcello Buoncristiano, Giansalvatore Mecca, Elisa Quintarelli, Manuel Roveri, Donatello Santoro, Letizia Tanca: Database Challenges for Exploratory Computing. SIGMOD Record 44(2): 17-22 (2015)   [PDF paper]
(2) Stratos Idreos, Olga Papaemmanouil, Surajit Chaudhuri: Overview of Data Exploration Techniques. SIGMOD Conference 2015: 277-281  [PDF paper]
 
 
Approximate string processing  (Variety)
 
String data is ubiquitous in big data. Approximate string processing tolerates the error with string matching to address the challenge of data variety.
 
(1) Jiaheng Lu, Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang: String similarity measures and joins with synonyms. SIGMOD Conference 2013: 373-384 [PDF paper]
(2) Chen Li, Jiaheng Lu, Yiming Lu: Efficient Merging and Filtering Algorithms for Approximate String Searches. ICDE 2008: 257-266  [PDF paper]
 
Data cleansing  (Volume, Variety and Value)
 
Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database.
 
(1) Xu Chu, John Morcos, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Nan Tang, Yin Ye: KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing. SIGMOD Conference 2015: 1247-1261 [PDF paper]
(2) Zuhair Khayyat, Ihab F. Ilyas, Alekh Jindal, Samuel Madden, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, Si Yin: BigDansing: A System for Big Data Cleansing. SIGMOD Conference 2015: 1215-1230  [PDF paper]
 
Knowledge base   (Volume, Variety and Value)
 
A knowledge base (KB) contains a set of concepts, instances, and relationships.KB is very important to process and analyze big data. 
 
(1) Omkar Deshpande, Digvijay S. Lamba, Michel Tourn, Sanjib Das, Sri Subramaniam, Anand Rajaraman, Venky Harinarayan, AnHai Doan: Building, maintaining, and using knowledge bases: a report from the trenches. SIGMOD Conference 2013: 1209-1220 [pdf paper]
(2) Albert Weichselbraun, Stefan Gindl, Arno Scharl: Enriching semantic knowledge bases for opinion mining in big data applications. Knowl.-Based Syst. 69: 78-85 (2014) [pdf paper]
(3) Maria Pershina, Mohamed Yakout, Kaushik Chakrabarti: Holistic entity matching across knowledge graphs. Big Data 2015: 1585-1590 [pdf paper]
 
 
 
Big data benchmark   (Volume, Velocity, Variety)
 
Big data benchmark is to create a standard benchmark to assist in the evaluation of different big data systems.
 
(1) Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, Russell Sears: Benchmarking cloud serving systems with YCSB. SoCC 2010: 143-154  [PDF paper]
(2)Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. DeWitt, Samuel Madden, Michael Stonebraker: A comparison of approaches to large-scale data analysis. SIGMOD Conference 2009: 165-178  [PDF paper]
 
 
 Big data applications  (Volume, Velocity, Variety and Value)
 
Big data have many applications in different areas: such as science and research, public health,customer relation management, machine and device performance analysis, optimizing cities and countries, finance and banking.
 
 
(1) Paul Suganthan G. C., Chong Sun, Krishna Gayatri K., Haojun Zhang, Frank Yang, Narasimhan Rampalli, Shishir Prasad, Esteban Arcaute, Ganesh Krishnan, Rohit Deep, Vijay Raghavendra, AnHai Doan: Why Big Data Industrial Systems Need Rules and What We Can Do About It. SIGMOD Conference 2015: 265-276 [pdf paper]
(2) Javier Andréu Pérez, Carmen C. Y. Poon, Robert D. Merrifield, Stephen T. C. Wong, Guang-Zhong Yang: Big Data for Health. IEEE J. Biomedical and Health Informatics 19(4): 1193-1208 (2015)  [pdf paper]
(3) Jae-Gil Lee, Minseo Kang: Geospatial Big Data: Challenges and Opportunities. Big Data Research 2(2): 74-81 (2015) [pdf paper]
(4) Taruna Seth, Vipin Chaudhary: Big Data in Finance. Big Data - Algorithms, Analytics, and Applications 2015: 329-356 [pdf paper]
(5) Kesheng Wu, E. Wes Bethel, Ming Gu, David Leinweber, Oliver Rübel: A big data approach to analyzing market volatility. Algorithmic Finance 2(3-4): 241-267 (2013) [pdf paper]