ICAI: The Labs – Data Management for Machine Learning
This ICAI the Labs session is focused on data management and machine learning. Two labs, AIRLab (Ahold Delhaize & UvA) and AI for Fintech lab (ING & TU Delft) present their work on Data Management. Sebastian Schelter (University of Amsterdam) and Asterios Katsifodimos (Delft University of Technology) will each present their work on data management for machine learning and discuss challenges and developments made.
12.00 (noon): Sebastian Schelter (UvA) from the AIRLab, together with Ahold Delhaize, about ‘Towards Automated Validation and Inspection of Machine Learning Pipelines’
12.20: Asterios Katsifodimos (TU Delft) from the AI for Fintech Lab together with ING, about ‘Valentine: Evaluating Matching Techniques for Dataset Discovery’
12.40: Discussion what’s next in Data management for machine learning
‘Towards Automated Validation and Inspection of Machine Learning Pipelines’
Machine Learning (ML) is increasingly used to automate impactful decisions, and the risks arising from this wide-spread use are garnering attention from policy makers, scientists, and the media. ML applications are often very brittle with respect to their input data, which leads to concerns about their reliability, accountability, and fairness. We will introduce some of the practical problems in this area and give an overview over “mlinspect”, a library developed at AIRLab Amsterdam that enables the lightweight lineage-based inspection of ML preprocessing pipelines. We show how mlinspect can be used to detect data distribution bugs in an ML pipeline. In contrast to existing work, mlinspect operates on declarative abstractions of popular data science libraries like estimator/transformer pipelines, can handle both relational and matrix data, and does not require manual code instrumentation.
PAPER and link to code
CIDR 2021: https://ssc.io/pdf/mlinspect.pdf
SIGMOD 2021: https://ssc.io/pdf/mlinspect-demo.pdf
Sebastian Schelter is an Assistant Professor with the University of Amsterdam, conducting research at the intersection of data management and machine learning. Before UvA, Sebastian has been a Faculty Fellow with the Center for Data Science at New York University and a Senior Applied Scientist at Amazon Research.
‘Valentine: Evaluating Matching Techniques for Dataset Discovery’
Data scientists today search large data lakes to discover and integrate datasets. In order to bring together disparate data sources, dataset discovery methods rely on some form of schema matching: the process of establishing correspondences between datasets. This process has been traditionally taken care with schema matching techniques. After 20 years of research in schema matching, we are still missing a benchmark for schema matching, as well as proper datasets, and proper evaluation metrics! In this talk I will present Valentine, an extensible open-source experiment suite to execute and organize large-scale automated matching experiments on tabular data. Valentine now includes implementations of 7 seminal schema matching methods that we either implemented from scratch (due to absence of open source code) or imported from open repositories. Finally, Valentine offers a data fabrication toolbox for constructing testing datasets with ground truth. I will conclude my talk with insights from a very large set of experiments we have been performing at TU Delft, focusing on the strengths and weaknesses of existing techniques, that can serve as a guide for employing schema matching in future dataset discovery methods.
Asterios Katsifodimos is an assistant professor at the Delft University of Technology. Before TU Delft, Asterios worked at the SAP Innovation Center, and as a postdoc at TU Berlin.