Data Integration and Large-Scale Analysis WS2024/25

(VU, 706.520 Data Integration and Large-Scale Analysis)

DIA is a 5 ECTS bachelor and master course, applicable to the bachelor programs computer science or software engineering and management, as well as the master catalog 'Data Science'. This course covers major data integration architectures, key techniques for data integration and cleaning, as well as methods for large-scale, i.e., distributed, data storage and analysis.

Lectures

In detail, the course covers the following topics, which also reflects the course calendar. All slides will be made available prior to the individual lectures, which take place Friday's 3pm in HS-i5 or virtually.

Data Integration and Preparation

In the first part of this course, we will explore essential techniques and methodologies for preparing and managing large volumes of data. These techniques form the foundation for building reliable AI models by ensuring that the data is clean, well-structured, and ready for advanced analytics.

01 Introduction and Overview

October 11 - Introduction to data integration concepts and overview of the course.

Download PDF
02 Data Warehousing, ETL, and SQL/OLAP

October 18 - Learn about data warehousing and data preparation techniques.

Download PDF Download PPTX
03 Message-oriented Middleware, EAI, and Replication

October 25 - Explore middleware concepts, enterprise application integration, and data replication.

Download PDF Download PPTX
04 Schema Matching and Mapping

November 08 - Learn about schema matching techniques and data mapping strategies.

Download PDF Download PPTX
05 Entity Linking and Deduplication

November 15 - Study entity linking and methods for deduplication.

Download PDF Download PPTX
06 Data Cleaning and Data Fusion

November 22 - Techniques for data cleaning and data fusion for integrated systems.

Download PDF Download PPTX

Large-Scale Data Management and Analysis

In the second part of this course, we will dive into cutting-edge technologies and frameworks designed to handle heterogeneous data at scale. You will learn how to leverage these tools to generate meaningful insights and analytics from distributed data systems.

07 Cloud Computing Fundamentals

November 29 - Introduction to cloud computing principles and technologies.

Download PDF
08 Cloud Resource Management and Scheduling

December 06 - Learn about resource management and scheduling in cloud environments.

Download PDF
09 Distributed Data Storage

December 13 - Explore distributed data storage systems and techniques.

Download PDF
10 Distributed, Data-Parallel Computation

December 20 - Understand distributed and data-parallel computation methods.

11 Distributed Stream Processing

January 10 - Learn about real-time distributed stream processing techniques.

12 Distributed Machine Learning Systems

January 17 - Study distributed machine learning frameworks and systems.