Work Packages Mercury Machine Learning Lab

WP1: Reinforcement learning: Model-based Exploration

New PhD student, supervised by Oliehoek & Spaan

When learning to optimize decisions, there is an inherent trade-off between exploration and exploitation. Moreover, many real-world the settings are highly non-stationary, e.g., due to (market) trends. The core question we want to answer is: How can we effectively perform exploration in the non-stationary environments? Exploration, even in stationary environments, is a core question in all of RL. Bayesian RL methods provide a promising starting point as they considering prior information and exploring optimal with respect to that prior. We will explore scaling up such methods and allowing for deep forms of exploration.

WP2: Reinforcement learning: Learning from Parallel Interactions

New PhD student, supervised by Oliehoek & Spaan

Some RL problems in the Mercury Machine Learning Lab have the property that many interactions (e.g., with customers) are taking place in parallel. These trials are often related, but not identical to one another. Standard RL techniques cannot easily exploit such experience. Here, instead, we need to learn from experience gathered by other trials that are going on in parallel. We will study how to perform model-based parallel RL with resource constraints such as budget or capacity, for instance by modelling it as a multiagent RL problem.

WP3: Bridging online and offline evaluation

New PhD student, supervised by prof. dr. M. de Rijke

Implicit feedback from user behavior (e.g., clicks, dwell times, purchases, scroll patterns) is an attractive source of training data for search engines and recommender systems. Online evaluation in the form of A/B tests may harm the user experience and is expensive in terms of engineering and logistic overhead. Off-policy evaluation, where interaction data is collected using a production system and used to estimate the performance of new ranking policies. How can we make sure that off-policy estimators of search engine and recommender system performance take into account important contextual aspects that may lead to biased evaluation results if ignored? While position and selection bias have been relatively well studied, the impact and mitigation of other dimensions such as non-stationarity, variable limited capacity inventory, and allocation requirements are not as well understood. This is where you can make important contributions!

WP4: Avoiding filter bubbles by correcting for selection bias

New PhD student, supervised by dr. J. Mooij

Data-driven recommender systems attempt to maximize the probability that a user likes a proposed item. Often, the probability that a user clicks on a proposed item is maximized instead as a proxy. When such a recommender system is retrained from new data obtained while it was active, or even trained online, it risks creating filter bubbles. A filter bubble is essentially the result of an unstable positive feedback loop in which items with a higher click probability are displayed more often to users, and hence clicked upon even more relative to other items. Filter bubbles are often undesirable. The goal of this project is to develop and test methods that correct for such undesired feedback
loops. The key in avoiding such filter bubbles is to correct for the bias of non-uniform exposure in the past. This relates with more general problems of causal domain adaptation and correcting for selection bias when making causal predictions, but additional challenges are imposed by the online non-i.i.d. setting.

WP5: Domain generalization and domain adaptation

New PhD student, supervised by dr. J. Mooij

Observational and interventional predictions obtained by data-driven machine learning methods do not always generalize well to other domains. This project addresses how one can optimally use data gathered from different contexts in order to obtain predictions for other contexts, accounting for confounding and selection bias effects. We will exploit connections with causality to address such domain generalization and adaptation problems. This should result in practical methods that make proper modeling assumptions, are statistically well founded and computationally practical. Our focus will be on making progress on several cases that are of direct practical relevance to Booking.com, but the methods we develop will be generally applicable and innovative from a theoretical perspective.

WP6: Training NLP Models for Better Generalisation

New PhD student, supervised by I. Titov & W. Aziz

Despite major progress in many NLP applications (e.g., machine translation, question answering, or dialog systems), existing NLP models fall short of human-level generalization. This project will develop methods for training NLP models that explicitly target generalization across multiple related tasks. We will exploit advances in meta-learning and transfer learning. We will also investigate interpretable representations (e.g., in the form of words and phrases in natural language) as a way to encourage transfer and information sharing across tasks, while also achieving transparency and controllability.