I've released code for my newly-published TMLR paper, "Towards Backwards-Compatible Data with Confounded Domain Adaptation", solving a key problem in using AI for biology (and science, more generally). How can you combine datasets across settings, when "what you measure" and "how you measure it" are confounded? I propose a domain adaptation method, with the goal of transforming data from different settings ("domains") to correct for technical variation. The transformed data can then be used for a wide variety of downstream tasks (prediction, exploratory data analysis, inference).
Typical domain adaptation methods try to match the data distributions from different domains, so they all look alike. But what if the domains are truly different? You shouldn't make fresh and old tissue samples look alike if the fresh ones tend to be from cancer tissue! It's even trickier because the confounding biological variable is often unavailable for test samples that you need to transform. You need to be able to transform old tissue samples to look fresh, without knowing whether they're cancerous, if that's what you need to predict!
This paper formalizes the problem and offers a concrete, flexible solution: https://openreview.net/forum?id=GSp2WC7q0r . I begin by formalizing this problem as a small modification of generalized label shift (which combines label shift and covariate shift). Then, we propose a general framework for solving it. Basically, we propose minimizing an expectation over the divergence between the conditional distribution between the different domains. This reduces the problem to learning a conditional generative model on your data.
Finally, I provide easy-to-use software that "just works". We show results on a wide variety of datasets, from single-cell data integration, to gene expression batch correction, and even to image color adaptation.
I've released code for my newly-published TMLR paper, "Towards Backwards-Compatible Data with Confounded Domain Adaptation", solving a key problem in using AI for biology (and science, more generally). How can you combine datasets across settings, when "what you measure" and "how you measure it" are confounded? I propose a domain adaptation method, with the goal of transforming data from different settings ("domains") to correct for technical variation. The transformed data can then be used for a wide variety of downstream tasks (prediction, exploratory data analysis, inference).
Typical domain adaptation methods try to match the data distributions from different domains, so they all look alike. But what if the domains are truly different? You shouldn't make fresh and old tissue samples look alike if the fresh ones tend to be from cancer tissue! It's even trickier because the confounding biological variable is often unavailable for test samples that you need to transform. You need to be able to transform old tissue samples to look fresh, without knowing whether they're cancerous, if that's what you need to predict!
This paper formalizes the problem and offers a concrete, flexible solution: https://openreview.net/forum?id=GSp2WC7q0r . I begin by formalizing this problem as a small modification of generalized label shift (which combines label shift and covariate shift). Then, we propose a general framework for solving it. Basically, we propose minimizing an expectation over the divergence between the conditional distribution between the different domains. This reduces the problem to learning a conditional generative model on your data.
Finally, I provide easy-to-use software that "just works". We show results on a wide variety of datasets, from single-cell data integration, to gene expression batch correction, and even to image color adaptation.