Addressing the bias in reference datasets for healthcare is essential for ensuring equitable healthcare outcomes across diverse populations. My research group has been working on the problem of trying to understand what current biases exist across different data reference resources that are routinely used to make health inferences. This research is in the process of being published and I have been quite vocal wherever I could in terms of what this means for undeserving populations. It is about time that action is taken in order to address the lack of underrepresentation for many global populations. That said, the solutions are not easy to attain. There needs to be a hige effort for these disparities to be reduced. A lot of the time I wonder what would be needed for these gaps to be closed. Below I provide some very high level solutions, but most importantly, there needs to be greater awareness of what this means for all humanity. Here are some suggestions for actions and solutions to consider in order to address these biases:
1. Diversification of Reference Datasets
There is a huge need for advocacy and investment in the expansion of current datasets to include underrepresented populations. This can involve both the collection of new data and the re-analysis of existing datasets with a focus on diversity.2. Inclusive Research and Development
Policies need to be encouraged and funding mechanisms promoted that require or incentivise research projects to include diverse populations in their studies. This could include standards for diversity in clinical trials and genomic research.3. Community Engagement and Collaboration
Further direct work with communities that are underrepresented in reference datasets need to be carried out. This includes building trust, understanding specific healthcare needs, and involving community members in research processes. I have recently read a great article done with Oceanians. I really liked how respectful they were about their values and culture. Of course, one could criticise the fact that there are hardly any indigenous person involved in the science, but at least I think it is a start.4. Educational Initiatives
There has to be developments of educational programs aimed at both the public and healthcare professionals to raise awareness about the importance of dataset diversity. This could also include training for researchers on how to conduct inclusive studies.5. Ethical Guidelines and Regulation
There needs to be advocacy for the establishment of ethical guidelines and regulatory standards that address dataset bias. Are there any standards? I am not aware myself. Such standards could help ensure that datasets used in healthcare research and application are representative and equitable.6. Use of Synthetic Data
This one I am not so sure of, but it is worth considering the the use of synthetic data as a means to address gaps in datasets where collecting real-world data is challenging. Synthetic data must be carefully validated, however, to ensure it does not perpetuate existing biases.7. International Collaboration
It is important to foster international collaborations to ensure that datasets are not only diverse within countries but also globally representative. Many biogeographical regions which are used as ancestry categories include a huge amount of diversity that may not necessarily represent existing biological adaptations or traits. Another point that I have read from ThankGod Ebenezer really caught my attention. I quote ThankGod here: “How can Africa, a highly diverse region, contribute less genomic data? Why are most genomes of African origin sequenced outside Africa? Why are African genomes in global databases where no African institution is signatory? Where are other African genomes sequenced in Africa?”He’s got a point: where are the local scientists involved? As mentioned by Fatumo et al., genomics professionals are more likely to tilt towards national and regional alliances in the kind of work they pursue. Therefore, trying to extrapolate work being done in North America and Europe to other countries is not always effective because of the existence of this affinity bias.8. Open Data Initiatives
Open data initiatives that make diverse datasets available to researchers worldwide should be supported. The UK Biobank, although it does not have a lot of diversity, it is however a champion of access, allowing any bona fide researcher access, regardless of their location. This is not the case for some very important datasets that are classified as “diverse”, which I hereby omit on purpose. We need open access of data so we can help democratise education and skills development across the globe.9. Transparency and Accountability
It is paramount to be able to encourage transparency in how datasets are collected, analysed, and used. This includes clear documentation of the demographic representation within datasets and the methodologies used in studies. For instance, in a recent publication in which I am involved, the Global Alliance for Genomics and Health provided a compelling example of transparency in terms of the composition of workers in genomics involved in the organisation.In conclusion, incorporating these solutions and actions into global policy could provide a a comprehensive roadmap for addressing the ethical challenges posed by biased reference datasets in healthcare. Highlighting specific case studies or “use cases” where disparities in data representation have directly impacted communities can also make a compelling argument for the need for urgent and concerted action. Such development of “use cases” affecting underrepresented populations is something in which we are in the process of publishing. Our hope is that literature like that will shed light on how unequal data representation are affecting the lives of some global communities who are unable of benefit from current precision medicine advancements.
If you cannot wait for the paper, I suggest you watch or listed the presentation below, where I give a current overview of my research around how reference datasets for healthcare are all incredibly biased. Here I chart datasets such as genome wide association diseases, pharmacogenomics, clinical trials and direct to consumer genetic testing and measure their degree of data missingness of diverse populations.
You can watch this lecture on YouTube or listen to it as a podcast.