Seminar on Big Data Management

58316103
3
Ohjelmistojärjestelmät
Syventävät opinnot
Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. The seminar will cover selected topics about challenges of big data management, including big data platform, querying, exploration, analysis, sampling, and cloud data management, as well as big data applications. This course will mainly use papers from recent database conferences, like SIGMOD and PVLDB.
Vuosi Lukukausi Päivämäärä Periodi Kieli Vastuuhenkilö
2017 kevät 16.01-01.05. 3-4 Englanti Jiaheng Lu

Luennot

Aika Huone Luennoija Päivämäärä
Ma 12-14 C220 Jiaheng Lu 16.01.2017-27.02.2017
Ma 12-14 C220 Jiaheng Lu 13.03.2017-24.04.2017

Information for international students

This seminar will be entirely in English.

Yleistä

We are in the era of “big data”. Data sets grow fast in size because they are increasingly being gathered by cheap and numerous information-sensing mobile devices, remote sensing, software logs, cameras, microphones, and wireless sensor networks. Most big data environments go beyond relational databases and traditional data warehouse platforms. The increasing focus on collecting and analyzing big data is shaping new platforms and techniques. This seminar will mainly discuss new research papers in different sub fields of big data management, including data querying, exploration, sampling, sharing, cleansing, big data benchmarking and applications. Please enroll yourself to the Moodle page of this seminar: https://moodle.helsinki.fi/course/view.php?id=22994

Kurssin suorittaminen

Students complete this seminar by actively participating in its work: including studying scientific sources, writing reports and giving presentations, reading the reports of other participants and evaluating them, and actively attending presentations.

Grading: The grading will be based on each student's own written work (1/3), oral presentation (1/3), and commentary as an opponent on the presentations and reports of others as well as activeness in general (1/3). To pass the seminar, each of these components must be passed. (Active) attendance of seminar meetings is obligatory. Absence from at most two meetings is accepted (and will affect grading).

Report:   The deadline of the first version of the report is 13 March 2017, Monday

                The deadline of the review of the reports is 3 April 2017, Monday.

                The deadline of the second (final) version of the report is 1 May 2017, Monday.

               Please read carefully the feedback from two peers and the teacher and modify your report accordingly.

              Please submit your report to  Moodle page:  https://moodle.helsinki.fi/course/view.php?id=22994. Read the instruction in the Moodle page before you prepare your report. 

 Presentation:   Download the feedback form for assessing students' presentation. 

Kirjallisuus ja materiaali

 
Please enter your topic preferences here:
 
 
Please enter your available time for presentation and opponent here:
 
 
 
 The deadline for selecting your preferences is 30 January 2017, Monday.
 
  You may also propose your own topic based on a recently published research article that falls within the big data management. In this case, please email the teacher with more details on the topic you are interested in.
 
Schedule
 
Date  Topic Presenter Opponent
16.01 Introdudction of the seminar [slides] Jiaheng Lu  
23.01 Information when you prepare your presentation [slides] Jiaheng Lu  
30.01 Information when you prepare your report [Guidelines] [Questions] Jiaheng Lu  
06.02 New trend of big data management[Slides] [Questions] Jiaheng Lu  
13.02 No meeting (Prepare for presentation)    
20.02 Big data application [Slides  Lavas Ilkka Davoudi Amin
27.02 Big data survey and application[slides1][slides2][slides3]  Ture Tsegaye Beka, Davoudi Amin,Goetsch Peter Lavas Ilkka,Chistiakov Artem, Khan Md. Nazmul Haque
13.03 Hadoop and spark system [slides1][slides2] Alcantara Beltran Jose, Kämäri Hannu  Roy Suravi Saha, Zhou Ziye
20.03 Big data benchmarking and JSON data management [slides1][slides2] Khan Md. Nazmul Haque,  Zuñiga Corrales Wladimir Alcantara Beltran Jose, Kämäri Hannu 
27.03 Data exploration [slides1][slides2]  Li Xin, Zhou Ziye Huang Biyun, Shubham Kapoor
03.04 Data cleansing and knowledge base [slides1][slides2] [slides3] Huang Biyun, Juhani Ojares, Roy Suravi Saha Lamminmäki Juho Kalevi, Halin Mikko Johannes, Sore Shewangizaw Dogda
10.04 Cloud data management [slides1][slides2] Chai Xuegang, Shubham Kapoor Zuñiga Corrales Wladimir, Goetsch Peter
17.04 No meeting (public holiday)    
24.04 Graph data and data sampling[slides1][slides2][slides3] Halin Mikko Johannes, Sore Shewangizaw Dogda, Lamminmäki Juho Kalevi Chai Xuegang, Ture Tsegaye Beka, Juhani Ojares

 

 
 
 The following topics and papers will be presented in this seminar.
 
 
Big data survey (Volume, Velocity, Variety and Value)
(1)  Cheikh Kacfah Emani, Nadine Cullot, Christophe Nicolle: Understandable Big Data: A survey. Computer Science Review 17: 70-81 (2015) [PDF paper]
(2) H. V. Jagadish: Big Data and Science: Myths and Reality. Big Data Research 2(2): 49-52 (2015)  [PDF paper]
 
 
Hadoop and Spark platforms  (Volume, Velocity, Variety)
 
Hadoop and Spark are two open-source platforms for big data processing.
 
(1) Juwei Shi, Jia Zou, Jiaheng Lu, Zhao Cao, Shiqiang Li, Chen Wang: MRTuner: A Toolkit to Enable Holistic Optimization for MapReduce Jobs. PVLDB 7(13): 1319-1330 (2014) [PDF paper]
(2) Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold Reinwald, Fatma Özcan: Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics. PVLDB 8(13): 2110-2121 (2015) [PDF paper]
 
Cloud data management (Volume, Velocity)
 
Cloud data management is to deploy database systems in the cloud.  
 
(1)  Yue Wang, Alexandra Meliou, Gerome Miklau: Lifting the Haze off the Cloud: A Consumer-Centric Market for Database Computation in the Cloud. PVLDB 10(4): 373-384 (2016)  [PDF paper]
(2) Adam Silberstein, Russell Sears, Wenchao Zhou, Brian F. Cooper: A batch of PNUTS: experiences connecting cloud batch and serving systems. SIGMOD Conference 2011: 1101-1112 [PDF paper]
(3) Daniel J. Abadi: Data Management in the Cloud: Limitations and Opportunities. IEEE Data Eng. Bull. 32(1): 3-12 (2009) [PDF paper]
 
 
Data sampling (Volume, Velocity)
 
It is not always possible to store the big data in full, and it is faster to work with a compact summary. Data sampling is a technique to process big data.
 
(1) Ying Yan, Liang Jeff Chen, Zheng Zhang: Error-bounded Sampling for Analytics on Big Sparse Data. PVLDB 7(13): 1508-1519 (2014) [PDF paper]
(2) S. Acharya, P. B. Gibbons, and V. Poosala. Congressional samples for approximate answering of group-by queries. In SIGMOD, 2000 [PDF paper]
 
Graph data management  (Volume, Variety)
 
Graph data management has long been a topic of interest for database researchers. The topic gained renewed interest recently, motivated by the rapid emergence of new application domains including social networks and the Web of data. 
 
(1) Yu Liu, Jiaheng Lu, Hua Yang, Xiaokui Xiao, Zhewei Wei: Towards Maximum Independent Sets on Massive Graphs. PVLDB 8(13): 2122-2133 (2015) [PDF paper]
(2) Nadathur Satish, Narayanan Sundaram, Md. Mostofa Ali Patwary, Jiwon Seo, Jongsoo Park, M. Amber Hassaan, Shubho Sengupta, Zhaoming Yin, Pradeep Dubey: Navigating the maze of graph analytics frameworks using massive graph datasets. SIGMOD Conference 2014: 979-990 [PDF paper]
(3)Philippe Cudré-Mauroux, Sameh Elnikety: Graph Data Management Systems for New Application Domains. PVLDB 4(12): 1510-1511 (2011)  [PDF paper]
 
Data exploration  (Volume, Variety)
 
Data exploration is about efficiently extracting knowledge from big data even if we do not know exactly what we are looking for. 
 
(1) Marcello Buoncristiano, Giansalvatore Mecca, Elisa Quintarelli, Manuel Roveri, Donatello Santoro, Letizia Tanca: Database Challenges for Exploratory Computing. SIGMOD Record 44(2): 17-22 (2015)   [PDF paper]
(2) Stratos Idreos, Olga Papaemmanouil, Surajit Chaudhuri: Overview of Data Exploration Techniques. SIGMOD Conference 2015: 277-281  [PDF paper]
 
 
Approximate string processing  (Variety)
 
String data is ubiquitous in big data. Approximate string processing tolerates the error with string matching to address the challenge of data variety.
 
(1) Jiaheng Lu, Chunbin Lin, Wei Wang, Chen Li, Haiyong Wang: String similarity measures and joins with synonyms. SIGMOD Conference 2013: 373-384 [PDF paper]
(2) Chen Li, Jiaheng Lu, Yiming Lu: Efficient Merging and Filtering Algorithms for Approximate String Searches. ICDE 2008: 257-266  [PDF paper]
 
Data cleansing  (Volume, Variety and Value)
 
Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database.
 
(1) Xu Chu, John Morcos, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Nan Tang, Yin Ye: KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing. SIGMOD Conference 2015: 1247-1261 [PDF paper]
(2) Zuhair Khayyat, Ihab F. Ilyas, Alekh Jindal, Samuel Madden, Mourad Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, Si Yin: BigDansing: A System for Big Data Cleansing. SIGMOD Conference 2015: 1215-1230  [PDF paper]
 
Knowledge base   (Volume, Variety and Value)
 
A knowledge base (KB) contains a set of concepts, instances, and relationships.KB is very important to process and analyze big data. 
 
(1) Omkar Deshpande, Digvijay S. Lamba, Michel Tourn, Sanjib Das, Sri Subramaniam, Anand Rajaraman, Venky Harinarayan, AnHai Doan: Building, maintaining, and using knowledge bases: a report from the trenches. SIGMOD Conference 2013: 1209-1220 [pdf paper]
(2) Albert Weichselbraun, Stefan Gindl, Arno Scharl: Enriching semantic knowledge bases for opinion mining in big data applications. Knowl.-Based Syst. 69: 78-85 (2014) [pdf paper]
(3) Maria Pershina, Mohamed Yakout, Kaushik Chakrabarti: Holistic entity matching across knowledge graphs. Big Data 2015: 1585-1590 [pdf paper]
 
 
 
Big data benchmark   (Volume, Velocity, Variety)
 
Big data benchmark is to create a standard benchmark to assist in the evaluation of different big data systems.
 
(1) Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, Russell Sears: Benchmarking cloud serving systems with YCSB. SoCC 2010: 143-154  [PDF paper]
(2)Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. DeWitt, Samuel Madden, Michael Stonebraker: A comparison of approaches to large-scale data analysis. SIGMOD Conference 2009: 165-178  [PDF paper]
 
 
 Big data applications  (Volume, Velocity, Variety and Value)
 
Big data have many applications in different areas: such as science and research, public health,customer relation management, machine and device performance analysis, optimizing cities and countries, finance and banking.
 
 
(1) Paul Suganthan G. C., Chong Sun, Krishna Gayatri K., Haojun Zhang, Frank Yang, Narasimhan Rampalli, Shishir Prasad, Esteban Arcaute, Ganesh Krishnan, Rohit Deep, Vijay Raghavendra, AnHai Doan: Why Big Data Industrial Systems Need Rules and What We Can Do About It. SIGMOD Conference 2015: 265-276 [pdf paper]
(2) Javier Andréu Pérez, Carmen C. Y. Poon, Robert D. Merrifield, Stephen T. C. Wong, Guang-Zhong Yang: Big Data for Health. IEEE J. Biomedical and Health Informatics 19(4): 1193-1208 (2015)  [pdf paper]
(3) Jae-Gil Lee, Minseo Kang: Geospatial Big Data: Challenges and Opportunities. Big Data Research 2(2): 74-81 (2015) [pdf paper]
(4) Taruna Seth, Vipin Chaudhary: Big Data in Finance. Big Data - Algorithms, Analytics, and Applications 2015: 329-356 [pdf paper]
(5) Kesheng Wu, E. Wes Bethel, Ming Gu, David Leinweber, Oliver Rübel: A big data approach to analyzing market volatility. Algorithmic Finance 2(3-4): 241-267 (2013) [pdf paper]