Data collection biases.
Data generation bias.
What it is
Data generation bias is bias introduced when data comes from (i) limited acquisition sources, (ii) data collected under different standard processes, or (iii) from duplicated data due to repeat collection or acquisition.
Why it occurs
When data are collected from one or a limited type of acquisition, while not covering all acquisition types.
Why we care
If AI algorithms are trained (or tested) with only one or a limited number of acquisition types, they may not generalize to all acquisition types.
A real-life example relevant to MIDRC
Different manufacturers of CT scanners, or for the other type of data generation bias, when cases are collected by two different data repositories and mistakenly used in both training and testing.
How to mitigate it
For i) and ii), either collect a sufficient number of cases from all acquisition types or train specifically for an acquisition type and use only for that acquisition type. For iii), use guids so that data can be linked to void duplicate use.
Summary
Data generation bias is bias introduced when data comes from (i) limited acquisition sources, (ii) data collected under different standard processes, or (iii) from duplicated data due to repeat collection or acquisition.
References
https://www.ncbi.nlm.nih.gov/pmc/articles/ PMC8515002/
Belongs under
Data collection biases
Synthetic data bias.
What it is
Arises from the addition of synthetic or redundant data samples to a dataset.
Why it occurs
Synthetic data is often offered as a mitigation technique to reduce the bias in that may be present real data sets. However, since synthetic data is typically generated using a procedure or algorithm that itself needs to be trained, the biases inherent in a real data set may creep into the synthetic data. In addition, the procedures used for generating synthetic data may introduce additional biases.
Why we care
One should not assume that synthetic data can solve data bias. One should probe whether synthetic data is helpful in mitigating bias for each individual task.
A real-life example relevant to MIDRC
if the synthetic data generation is based on too few samples, and those samples all contain a particular type of bias, then the bias will creep into the generated synthetic data. A hypothetical example relevant to MIDRC would be the generation of synthetic pediatric chest radiographs from a sample of actual chest radiographs of healthy children while the intended population is COVID+ pediatric patients.
How to mitigate it
Carefully consider (e.g., with a list of known of potential biases) the kinds of biases synthetic data generation may be adding to a small or already-biased data set.
Summary
Synthetic data is often offered as a mitigation technique to reduce the bias in that may be present real data, but do not assume that synthetic data generation can automatically remedy biased data collection.
References
https://www.nature.com/articles/s41551-021-00751-8.pdf: "Synthetic data in machine learning for medicine and healthcare".
Belongs under
Data collection biases
Exclusion bias.
What it is
(i) When specific groups of user populations are excluded from training, testing or subsequent analyses, especially if the exclusion conditions for different types of outcomes (e.g., for actually positive and actually negative cases in a classification task). (ii) It can also occur when you exclude some features from the dataset used to train an AI model.
Why it occurs
(i) Population groups may be systematically excluded and hence be underrepresented in the training of an AI model. This is especially important since it can result in distortions when patients with a particular condition are excluded differently for different types of outcomes. For example, for a classification task, patients with a certain condition are excluded when the case is negative but not excluded when the case is positive. (ii) People mistakenly think that some features are irrelevant, so a subset of the data is systematically excluded due to a particular attribute.
Why we care
The model may not be able to perform accurately in a subset of population that's excluded from the data collection. This may result in biased algorithm training if exclusion bias happens in the training data, or it may result in incorrect conclusions if exclusion bias happens in the test data.
A real-life example relevant to MIDRC
Hypothetically, portable x-ray images may be excluded from the data set for a certain task. Since patient imaged with portable x-ray are likely to be sicker, this may result in distortion of the patient population with more severe disease.
How to mitigate it
(i) Ensure that the training sample is representative of the population the AI model is intended to be applied to. Make sure that exclusion conditions are carefully examined. (ii) Perform sufficient analysis before discarding features from training dataset. Addition of random data to training sample.
Summary
If exclusions are not carefully described and justified, severe bias may occur both in model training and testing. Sample selection bias caused by choosing non-random data, where a subset of the data is systematically excluded.
References
Horwitz RI, Feinstein AR. Exclusion bias and the false relationship of reserpine and breast cancer. Arch Intern Med 1985;145:1873–5.
M. Delgado-Rodriguez, Bias, Journal of Epidemiology & Community Health. 58 (2004) 635–641
Belongs under
Data collection bias, and data preparation and annotations biases
Institutional / systemic bias.
What it is
A tendency for the procedures and practices of particular institutions to operate in ways which result in certain social groups being advantaged or favored and others being disadvantaged, devalued, or treated differently. This need not be the result of any conscious prejudice or discrimination but rather of the majority simply following existing rules or norms. This is related to data collection and model development especially when the reference standard for a case is defined by how that case is managed by the institution.
Why it occurs
There may be established practices in an institution that result in different social group being managed differently. These practices may sometimes not be conspicuous unless a special effort is made to highlight them.
Why we care
This is a type of bias that may result in most open and most disconcerting types of inequities in a developed model.
A real-life example relevant to MIDRC
There may be many examples, some of them more substantial in model deployment, but I will try to focus on one example that is directly related to data collection. Assume that data is being collected to develop a model for patient admission to ICU. Assume that the model will be trained and validated with the reference standard defined as “patient admitted/not admitted to ICU”. If there is a bias in admitting patients to the ICU, this bias will taint the reference standard and then be propagated to the developed model.
How to mitigate it
Carefully review patient management at all institutions that provide data to minimize institutional bias.
Summary
The collected data, and therefore developed or tested models may show a bias if different social groups are managed differently at an institution. One needs to carefully review patient management at all institutions that provide data to minimize institutional bias.
References
D. Chandler, R. Munday, A Dictionary of Media and Communication, Oxford University Press, 2011
Belongs under
Data collection and model deployment biases
Popularity / patient-based bias.
What it is
Popularity bias is often noted with AI systems that recommend movies, etc. Popularity bias may causes changes in the data collected based on the trends of the day. Such could happen in medical imaging AI.
Why it occurs
Humans make medical decisions based on both facts and perceptions.
Why we care
if findings and statistics are skewed based on current trends, long term analyses (without such temporal corrections) may give biased findings.
A real-life example relevant to MIDRC
An example might be in women deciding to be imaged (or not) for a screening mammogram. Newspaper articles might sway a woman in her decision making.
How it is measured
See ref [1].
How to mitigate it
Ref [1]: "The effect of having few data items responsible for the majority of ratings in a Collaborative Filtering recommendation, and the complement of having majority of items responsible for few ratings given by the users, are usually referred as popularity bias. The effect is known as reflecting the preference of users for popular items, but also as a consequence of methods and metrics normally applied by these systems. Variational Autoencoders (VAE) are considered today the state-of-the-art for collaborative filtering recommenders, and can handle big and sparse data entries with robustness and high accuracy. A methodology is proposed here for characterizing the popularity bias in Movielens and Netflix datasets, and when applying VAE for generating recommendations based on them. As a first step, the long tail model is applied for segmenting items and users in three different classes (Short Head, Medium Tail and Long Tail), depending on the proportion of interactions they are associated with. In addition, a real recommendation scenario is presented for measuring the proportion of unpopular items appearing among the suggestions provided by VAE. We consider characterizing the popularity in details as a very first step for providing recommenders with the desired serendipity effect, and expanding the knowledge of these systems about new and unpopular items with few ratings."
Summary
Popularity bias occurs when current trends influence patients' decision making on medical imaging, which then subsequently affect data collection.
References
https://www.semanticscholar.org/paper/On-Measuring-Popularity-Bias-in-Collaborative-Data-Borges-Stefanidis/05109d22ad30e4b5f11f0b72c762ee19374cdd6a
Belongs under
Data collection biases
Population bias.
What it is
Arises when statistics, demographics, and user characteristics differ between the original target population and the user population represented in the actual dataset or platform.
Why it occurs
When the characteristics of a training data population are different than the characteristics of a testing data population. The characteristics may include biological differences, demographic differences, social differences (such as the impacts of socioeconomic differences to access to health care), or technical differences in image acquisition that correlate to demographic difference.
Why we care
Algorithms can be overtrained to one group, reducing their ability to be applied in a useful way to other groups. This impacts decision-making if drawn from the algorithms developed with population bias, for example increasing the numbers of false positive/false negative classification decisions. It may also foster a lack of confidence in the algorithm's utility for some populations and discourage further development and resourcing.
A real-life example relevant to MIDRC
Some socioeconomic groups may not have sought care for COVID-19 as soon as some others, and this may have made their cases more advanced before imaging was conducted. Some populations may not have had CT scanning as a part of their care, and thus their data may not be included in training models, when otherwise this would have made the dataset more diverse. More significant and serious outcomes from COVID-19 have been observed in some age ranges, races, and/or ethnicities than others.
How to mitigate it
Cross-population modeling [2].
Summary
Population bias in medical imaging occurs when the characteristics of a population from which data is used for training an algorithm are different than the characteristics of a population for which data is used for testing or decision making-from an algorithm. The characteristics may include biological differences, demographic differences, social differences (such as the impacts of socioeconomic differences in access to health care), or technical differences in image acquisition that correlate to demographic difference.
References
https://www.nature.com/articles/s41467-020-17971-2 (interesting example of attempting to make a generalizable AI model for diagnosing COVID-19 from CT images.)
https://link.springer.com/content/pdf/10.1007/ s10916-020-01562-1.pdf
Obuchowski et al. define "population bias" and "population precision" in "Quantitative imaging biomarkers: A review of statistical methods for computer algorithm comparisons" (2015, Statistical Methods in Medical Research).
Belongs under
Data collection, model development, and model evaluation biases
Temporal bias.
What it is
Temporal bias is (a) bias that arises from differences in populations and behaviors over time, (b) bias that arises from the use of data that is not representative of diagnostic clinical data, or (c) bias that arises from the correlation of reader performance and state of knowledge of the disease.
Why it occurs
A disease such as COVID-19 can evolve over time..
Why we care
Algorithms with temporal bias may not be generalizable over time, including global course of disease, individual patient trajectory, and state of clinical knowledge.
A real-life example relevant to MIDRC
A disease such as COVID-19 can evolve over time and thus a database can include subjects with different variants of the disease. Vaccinated subjects with COVID-19 could potentially have different disease features than unvaccinated ones, including the period of their disease trajectory. Clinicians doing reader studies earlier in the course of a disease after its novel appearance may have different skills and abilities in assessing the disease than those same clinicians (or others) later on.
How it is measured
Temporal validation (e.g., applying the model to a set of more recent cases after development using earlier cases), temporal validation plots (see Rapacciuolo et al.), ICC (in the case of temporal bias (c) above)
How to mitigate it
Yuan et al. [1] on case-control design: uniform sampling trajectory, well-defined baseline dates.
Summary
Temporal bias is (a) bias that arises from differences in populations and behaviors over time, (b) bias that arises from the use of data that is not representative of diagnostic clinical data, or (c) bias that arises from the correlation of reader performance and state of knowledge of the disease. These are problematic because algorithms may not be generalizable over time, including global course of disease and individual patient trajectory and state of clinical knowledge.
References
Yuan W, Beaulieu-Jones BK, Yu K-H, Lipnick SL, Palmer N, Loscalzo J, et al. Temporal bias in case-control design: preventing reliable predictions of the future. Nat Commun. 2021 Dec;12(1):1107.
Altman DG, Royston P. What do we mean by validating a prognostic model? Statist Med. 2000 Feb 29;19(4):453–73.
Rapacciuolo G, Roy DB, Gillings S, Purvis A. Temporal validation plots: quantifying how well correlative species distribution models predict species’ range changes over time. Methods Ecol Evol. 2014 May;5(5):407–20.
Riley WA, Craven T, Furberg CD. Assessment of temporal bias in longitudinal measurements of carotid intimal-medial thickness in the asymptomatic carotid artery progression study (ACAPS). Ultrasound in Medicine & Biology. 1996;22(4):405–11.
Belongs under
Data collection, model development, model evaluation, and model deployment biases
Sampling / representation / selection bias.
What it is
The patient data used for training/tuning/testing a machine learning algorithm are not representative of the patient population to which the algorithm is intended to be applied.
Why it occurs
Data collection is by convenience and availability, without sufficiently considering the clinical task and patient population. It may occur in at least two ways (a) shortcut over signal problem: the algorithm learns from confounding factors instead of the pathology of interest; (b) distribution shift problem: the algorithm learns from the pathology of interest, but because of distribution shift, the performance on the intended population has systematic deviations from estimates on study samples.
Why we care
Such data collection bias would result in performance estimates that are not generalizable to the intended patient population of the AI/ML algorithm, thereby leaving effectiveness and trustworthiness of AI/ML questionable.
A real-life example relevant to MIDRC
Data missingness correlates with disease severity; single institution as data source for a particular task.
How to mitigate it
Collection from multiple sources. Curation by matching key characteristics of the intended patient population.
Summary
The patient data used or training/tuning/testing a machine learning algorithm are not representative of the patient population to which the algorithm is intended to be applied. Such bias often arises from collection of data by convenience and availability, without sufficiently considering the clinical task and patient population.
References
DeGrave, A.J., Janizek, J.D. & Lee, SI. AI for radiographic COVID-19 detection selects shortcuts over signal. Nat Mach Intell 3, 610–619 (2021). https://doi.org/10.1038/s42256-021-00338-7
Belongs under
Data collection and model evaluation biases