Debiasing Machine Learning

laranguyen811
Aug 7, 2022
18 min read

Updated: Jan 3, 2023

In his article “Diversity And Inclusion In AI”, Steve Nouri states that there is a debate about inclusion and diversity in AI systems affecting millions of people from all backgrounds. As machine learning – a subfield of AI becomes foundational to these AI systems, debiasing machine learning models is critical for an inclusive AI. There are several ways to debias machine learners.

Reducing inequitable biases

According to Mirjam Pot, Nathalie Kieusseyan & Barbara Prainsack, biases can be inequitable in machine learning. They argue that some are damaging and should be taken care of, while others may be unproblematic and even desirable because they can contribute to overcoming inequities. Cognitive biases can be present in datasets and potentially get automated through ML technologies. An example of this is that cognitive biases influence the labelling of images. Culturally-influenced cognitive biases about people or groups may have an impact on what radiologists, or more generally, healthcare practitioners see and what they do not see in images. Through the interpretation and labelling of images, healthcare practitioners’ cognitive biases can translate into data biases. Healthcare data is likely to reflect the discrimination of certain groups of patients along the line of socioeconomic status, gender, race and other social categories. Datasets may be qualitatively biased if they include data about structural misdiagnoses because they specifically affect patients belonging to a specific social group, such as the socio-economically disadvantaged - women or racial minorities. An algorithm developed with this data has a higher probability of incorrectly detecting disease in misdiagnoses-prone populations. Developers’ cognitive biases can translate into machine biases.

Though we are all biased, we still have to take responsibility for our blind spots and try to overcome the implicit biases that we are aware of: the motto of ”fairness through awareness” means that critical scrutiny of one’s own implicit biases is the first step to being able to prevent the discriminatory practice. We have the moral obligation to mitigate the harmful consequences of implicit and unconscious bias and avoid structural discrimination. However, awareness is not enough. It is crucial to include a wide range of experiences and perspectives in the process of a wide range of experiences and perspectives in the process of generating data and developing technologies. We need to ask who is generating data, analysing datasets and building technologies, and for whose benefits.

While social and cultural factors shape some cognitive biases, others pertain to how humans process information. Culturally influenced cognitive biases, on the one hand, are inequitable if we could have avoided them altogether, for example, greater diversity amongst software developers and healthcare professionals. Cognitive biases about how humans process information can also be inequitable if we do not consider mitigating actions, including reducing workload and stress.

There are qualitative and quantitative biases. Qualitative biases happen as poorly labelled images lead to inequitable outcomes. Quantitative biases can occur when the whole dataset is skewed when it does not adequately represent all patient sib-populations in terms of numbers, for example, the underrepresentation of the elderly, ethnic minorities or women.

So as long as humans hold cognitive biases, these biases might also shape practices of data generation and ultimately, data itself. Human developers, knowingly and unknowingly, influence and shape the ML algorithms they build with their way of thinking and biases. His or her previous experiences influence the developer in a way that if he or she has already developed similar technologies, he or she might program in such a way that the new technology matches her previous results. Decisions about whether he or she should include or exclude certain variables from an algorithm and how he or she weighs them are prone to their biases.

We need to understand that biases in Machine Learning are not merely technological problems to solve with technological means. People who have access to radiology services in the first place influence the composition of the patient dataset used for research in imaging. There are significant differences in access to healthcare among social groups. The people who have the most access to radiology services are most likely the ones benefitting the most from the application of the Machine Learning technology because they are represented in the algorithm’s training data.

We should view bias as a social problem and analyse its causes and implications via a framework of equity in healthcare. A framework can also help distinguish between problematic and unproblematic biases. Not all types of biases are equally problematic. We consider some biases “good” and desirable since they can overcome present inequities in healthcare.

There is recent acknowledgement in the field of computer science that cognitive biases on the side of programmers may impact the machines they build. There has been empirical evidence that the context in which programmers are socialised impacts the technologies they build. Developers’ cognitive biases can translate into machine biases.

Automation bias could affect research and practice in radiology tremendously when humans overestimate the validity or the predictive power of information produced by an automated system such as an ML algorithm. When we apply automated decision support, such overdependence on technology can backfire. Healthcare workers may refrain from using their critical judgement. It may tempt clinicians to delegate to the results produced by the machine because the machine seems more trustworthy, “safer” to rely on, or less biased than human action. There can be social or other pressures on the decision makers to go by what the machine suggests. Automation may undermine the epistemic authority of clinicians, even themselves. Povyakalo and colleagues found that while automated support was beneficial to the detection of radiologists with less advanced image interpretation skills, they were detrimental to the detection of radiologists with advanced image interpretation skills.

Justice-oriented management of technologies and potential biases consists of education and realistic communication to radiologists and healthcare professionals in general about the workings of the technology as well as its particular capabilities, limitations and risks – also regarding justice issues. It has to be an ongoing procedure to keep healthcare professionals regularly informed about new developments and findings. We need to allow users to evaluate the outcomes of ML algorithms to comprehend the everyday added value of the technology for their clinical work to detect and understand errors and pitfalls. Healthcare professionals should engage with patients and make them feel included in decision-making. We should take healthcare professionals’ and patients’ concerns about potential biases while working with ML technologies, or being subjected to automated decision-making seriously. Monitoring and performance controls have to be implemented regularly after the initial approval of ML technologies and as long as the technology is in use for regulatory purposes.

How we reduce equitable biases in radiological services in healthcare as proposed by Mirjam Pot, Nathalie Kieusseyan and Barbara Prainsack can be transferrable and applicable to the broader healthcare industry and other industries, including financial services, technology, retail and pharmaceuticals.

Debiasing rule-based machine learning models:

In his book “The Road to Conscious Machines: The Story of AI”, Michael Wooldridge explained that knowledge-based AI depended on an essential new idea: an AI system should explicitly capture and represent human knowledge about a problem. Knowledge representation relied on rules, the most common scheme adopted. A rule in the context of AI or machine learning captures a discreet chunk of knowledge in the form of an “if…then…” expression.

Tomáš Kliegr, Štepán Bahník, Johannes Fürnkranz propose methods of debiasing rule-based machine learning models:

- Conjunction fallacy and representativeness heuristic: Charness et al. found that the number of participants committing the fallacy decreases under monetary incentives, from 58% to 33%. However, Zizzo et al. suggested that unless we simplify the decision problem, neither the monetary incentives nor feedback may decrease the fallacy rate. A reduced task complication is a precondition for monetary incentives and feedback to work. Gigerenzer and Goldstein, and Gigerenzer and Hoffrage demonstrated that we could reduce the rate of fallacies or eliminate them by presenting the problems in terms of frequencies instead of probabilities. It needs further experimental validation with a between-subject methodology. Wang et al. propose to demonstrate prototypes of patient instances for each diagnosis. In rule learning, this would mean extending the user interface with a functionality that would show instances in data supporting the rule and false positives – instances matching the antecedent but not the rule consequent.

- Misunderstanding of “and”: Sides et al. believe that “and” stops being ambiguous when we use it to connect propositions rather than categories. For example, we can use “and also” in our wording to reduce the misunderstanding of “and”. We may prefer representations visually expressing the semantics of “and” such as decision trees over rules, which do not give us such guidance

- Averaging heuristic: Zizzo et al. stated that prior knowledge of probability theory and a direct reminder of how we combine probabilities are potent tools for decreasing the incidence of the conjunction fallacy. Juslin et al. suggest that log probabilities, requiring additive rather than multiplicative integration, improve probabilistic reasoning. However, log probabilities may require additional training and lead to increased mental effort for the user. The associated costs of using logarithm formats may outweigh the benefits.

- Disjunction fallacy: people prefer a specific category to a more general category absorbing it, but only if people consider the particular category representative. An example could be making the analysts aware of the taxonomical relation between individual attributes and their values. In some cases, we can explain the benefits of a more extensive supporting sample associated with more general attributes to the analysts.

- Base-rate neglect: we can introduce users to the value of lift, which we can interpret as an improvement over the base rate given by the rule. We calculate lift as a ratio of the rule's confidence and a probability of the head of the rule, corresponding to the base rate. Gigerenzer and Hoffrage understand that the representations in terms of natural frequencies, rather than conditional probabilities, facilitate the computation of causes’ probability. It would foster correct understanding if we consistently present natural frequencies to analysts in addition to percentages.

- Insensitivity to sample size: Kachelmeier and Messier Jr found in their experiment that giving a formula for computing an appropriate sample size for substantive tests of details based on a case's description and tolerable error to auditors allowed them to select larger sample sizes compared to intuitive judgement without that provision. A user of an association rule learning algorithm can specify the minimum support threshold. The rule learning interface should also inform the user of the effects of chosen support threshold on the accuracy of the confidence estimate of the consequent rules. Relevant information should be available as part of rule learning results for algorithms and workflows where the user cannot influence the support of a discovered rule. Another possible debiasing technique might be adopting the present support as an absolute number. There is a hypothesis that the frequency format supplementing the standard ratio presentation of support and confidence may reduce the incidence of this bias but not perfectly remove it.

- Confirmation bias and positive test strategy: we have found that delaying final judgment and slowing down work may decrease confirmation bias. Therefore, the user interface for rule learning should allow the user to save or mark interesting rules and permit the user to review and edit the model later on, for example, EasyMiner. Wolf and Britt argue that providing “balanced” instructions where both for and against evidence for a hypothesis was present reduces the confirmation bias from 50% exhibited in the control group to a significantly lower 27.5%. Barberia et al. empirically validated the assumption that educating users about cognitive illusions can be an effective debiasing method for a positive test strategy. Similarly, we could also consider providing explicit guidance combined with modifications of the user interface presenting the rule learning results. For instance, we could show prior probabilities of diagnoses, when explaining decisions of machine learning systems in the medical domain. Some systems, including EasyMiner, allow showing a detailed contingency table for the selected rule even when the information displayed does not include the prior probabilities. It is sufficient for the user to compute prior probabilities. We should incorporate the prior probabilities in the primary description of the rule and the main interest measures such as support and confidence.

- Availability heuristic: alerting an analyst to why they easily recall instances matching the conditions in the rule under consideration should decrease the impact of the availability heuristic as long as we deem the reason irrelevant to the task at hand.

- Reiteration effect, effects of validity and illusory truth: ensuring that the output from rule learning does not contain redundant roles may suppress the reiteration effect on the algorithmic level. We can prune algorithms. Another possible method is presenting the result of rule learning in several layers, where we display only clusters of rules summing up the multiple sub-rules first. We can use a recent algorithm, the meta-learning method proposed by Berka to summarise multiple rules. Hess and Hagen argue that it is necessary to explain to analysts which rules share the same source, i.e, the overlap in their coverage regarding specific instances. Additionally, recently proposed techniques using domain knowledge to filter or explain rules, such as expert deduction rules by Rauch may improve explanations. The widely accepted recommendation is that to correct misinformation, it is best to address it directly, repeating the misinformation and the arguments against it. For example, in incremental machine learning settings, we revise the learning results when new data arrives or mining with formal domain knowledge.

- Mere exposure effect: there might not be any directly applicable debiasing techniques yet. However, we can use some conditions known to decrease the mere exposure effect in machine learning interfaces. We can avoid subliminal exposure by changing the operation mode of the corresponding user interfaces. An effort at a user interface to rule learning respecting these principles is the EasyMiner system, where the user can formulate the mining task as a query against data. This minimises the number of rules that users discover and the system exposes them to.

- Overconfidence and underconfidence: some recent research focuses on the hypothesis that the feeling of confidence reflects factors indirectly related to choice processes. In the rule learning context, reducing the number of rules and removing some conditions in the remaining rules may present less information and, in turn, may decrease overconfidence. We can achieve this by using feature selection or an external setting of maximum antecedent length. We can consider removing rules and conditions not passing a statistical significance test from the output. Research on debiasing overconfidence suggests the importance of education experts on principles of subjective probability judgement and the associated biases. In the rule learning context, The user interface makes rules and knowledge easily accessible, which is in “unexpectedness” or “exception” concerning the rule in question.

- Recognition heuristic: changes to user interfaces to encourage “slowing down” could help to address this bias. In real-world machine learning tasks, we can make information on the meaning of individual attributes or literals easily accessible to suppress the application of the recognition heuristic.

- Information bias: informing people about the diagnosticity of considered questions does not perfectly remove the information bias. Communicating the importance of attributes can assist the analyst in the task definition phase.

- Ambiguity aversion: to reduce ambiguity, a method we can use is accountability. In the rule learning context, In the rule learning context, it could mean instructing the analysts to justify why they considered a specific discovered rule compelling. We may only require the explanation if we have automatically detected a conflict with existing knowledge, for instance, using the approach bases on the deduction rules proposed by Rauch. We can also alleviate this bias if we make the description of the meaning of the conditions easily accessible to the analyst.

- Confusion of the inverse: Edgell et al. conclude after their study that it is ineffective in addressing the confusion of the inverse fallacy. Werner et al. suggest a problem about the use of language liable to misinterpretation in statistical textbooks teaching fundamental concepts such as independence. We may conclude that representations of rules should aim at the unambiguous meaning of the wording of the implication construct. Diaz et al. also recommend teaching probability in the next generation of textbooks aimed at the data science audience as a possible solution.

- Context and tradeoff contrast effects: removing rules demonstrating very low values of support and very high confidence or those with irrelevantly low confidence but high support may help debias the analysts. In some cases, the influence of context can improve communication. Gamberger and Lavrac introduced supporting factors to complement the explanation delivered by conventional learned rules. Kononenko suggests that medical experts found these support factors increase the plausibility of found rules.

- Negativity bias: A valid heuristic may be putting a higher weight on negative information in some cases. We should detect all suspected cases in the data processing phase and replace the corresponding attributes or values with more neutral-sounding alternatives.

- Primacy effect: Mumma and Wilson examine three types of debiasing techniques. The bias inoculation intervention includes direct training on the applicable bias or biases, consisting of information on the bias, strategies for adjustment and completing several practical assignments. The second technique was the consider-the-opposite debiasing strategy, sorting the information according to diagnosticity before we review it. The third strategy was taking notes when we revise each cue before we made the final judgement. They found bias inoculation to be the least effective, whereas the consider-the-opposite and taking notes worked equally well. In addition, we can base a possible debiasing strategy on presenting the most relevant rule first. We can also order the conditions within the rules by predictive power. However, simply reordering the rules output by these algorithms may not work in situations when rules compose a rule list that the system automatically processes for prediction purposes. The user interface can assist the analyst in annotating the individual rules following the note-taking debiasing strategy.

- Weak evidence effect: Martire suggested that the numerical expressions of evidence are most suitable for expressing uncertainty. In the rule learning context, we usually communicate it through rule confidence. Martire also found that a high level of miscommunication is associated with low-strength verbal expressions. Consequently, in the machine learning context, we suggest considering an intentional omission of weak predictors from rules either directly by the rule learner or as part of feature selection.

- Unit bias: in the scope of rule learning, informing analysts about the discriminatory power of the individual conditions may alleviate unit bias.

Recommendations for rule learning algorithms and software

There are several measures for machine learning practitioners to potentially suppress the effect of cognitive biases on the comprehension of rule-based models:

1. Adherence to conversational rules (maxims):

Ensuring that the automatically generated explanations of machine learning models do not breach the conversational rules developed by Grice.

2. Representation of a rule:

Syntactic elements: based on the results of the experiments, we need to present AND unambiguously in the rule learning context. Research has shown that AND ceases to be ambiguous when we use it to connect propositions rather than categories. We should make the communication of the implication construct IF THEN connecting antecedent and consequent explicit. Another crucial syntactic construct is negation (NOT). We should avoid using it because its processing demands more cognitive effort and because a specific piece of information that was negated may not be remembered in the long term.
Conditions: attribute-value pairs consisting of conditions are typically either formed of words with semantics relevant to the user or codes not directly meaningful. The lack of understanding of attributes and their values appearing in rules may trigger or strengthen many biases. Making information easily accessible on conditions in the rules, including their predictive power can prove to be an effective debiasing technique. We should not include redundant conditions or conditions with low relevance under the quantity conversational maxim. When conditions contain words with negative valence, we need to review these very carefully since we pay more attention to negative information that often is associated with a higher weight than positive information.

People tend to pay more attention to the information that they are exposed to first. We can sort the conditions by strength so that machine learning software can conform to the manner maxim. The output could also visually delimit conditions in the rules based on their significance or predictive power.

Interestingness measures: we should communicate the values using numerical expressions: we need to carefully consider the use of alternate verbal ones replacing specific numerical values since there is some evidence that they are prone to miscommunication.

We typically represent rule interest measures as probabilities (confidence) or ratio (lift), whereas results in cognitive science indicate that we understand natural frequencies better.

We also tend to ignore base rates and sample sizes. We should present confidence (reliability) intervals for the values of measures of interest, where applicable.

3. Rule models:

Model size: Poursabzi-Sangdeh shows in an experiment that people can simulate the results of a smaller regression model composed of two coefficients better than a larger model consisting of eight. It demonstrates that removing any unnecessary variables could improve the model interpretability even when the experiment did not find a difference in the trust in the model based on the number of coefficients it included. Rule models often consist of redundant rules or redundant conditions in the rules. Such redundancies may result in several biases, which may cause misinterpretation of the model. We can use various pruning techniques to reduce the size of a rule model. Using learning algorithms allowing the user to set or influence the resulting model size may have the same effect.

Another approach that can effectively dismiss the strong yet obvious rules confirming common knowledge using domain knowledge or constraints set by the user. Diminishing redundancies and disingenuous rules can improve adherence to the manner and quantity conversational maxims.

Rule grouping: the review argues that presenting clusters of similar rules may help to lessen cognitive biases caused by reiteration.
Rule ordering: we should sort the presented rules by strength. Nonetheless, due to the paucity of relevant research, it is unclear which specific definition of rule strength can lead to the best results regarding bias mitigation.

4. User Engagement

Some results indicate that increasing user interaction may help counter some biases:

Domain knowledge: we can selectively present domain knowledge opposing the considered rule that may help to invoke the “consider-the-opposite” debiasing strategy. Research has demonstrated that the plausibility of a model relies on compliance with monotonicity constraints. We suggest that UIs make background information on discovered rules easily accessible.
Eliciting rule annotation: activating the deliberate “System 2” is one of the most widely used debiasing strategies. An example could be requiring accountability, e.g, through visual interfaces motivating users to annotate selected rules, which will trigger the “note-taking” debiasing strategy. The created annotations would need to be checked or used by a human or algorithmically. Giving people additional time to think about the problem has been, under some circumstances, seemingly an effective debiasing strategy. We can make the selection process two-stage and allow the user to revise the selected rules.
User search for rules rather than scroll: repeating rules may affect users via the mere exposure effect even if they are exposed to them, even for a short moment, e.g, when scrolling a rule list. The UIs should deploy alternatives to scrolling in discovered rules, for example, search facilities to reduce this effect. An option for improving diversity in the preview initially shown to the user is clustering the rules and displaying a representation of each cluster.

5. Bias inoculation

Education about specific biases, for instance, brief tutorials, decreased the fallacy rate. It is called bias inoculation in the literature.
Research has shown that providing explicit guidance and education on formal logic, hypothesis testing and critical assessment of information can reduce fallacy rates in some tasks. Nevertheless, we can not recommend psychoeducational methods as a sole or sufficient measure because their effects are still controversial.

Automatic Debiased Machine Learning Of Causal And Structural Effects

We can apply it to any regression learner, including neural nets, random forests, Lasso, boosting and other high-dimensional methods. We can deploy these methods to infer causal and structural parameters depending on regression functions, including policy, derivative, decompositions, treatment effects and economic structural parameters.

To reduce regularisation and model section bias, we use a Neyman orthogonal moment function where there is no first-order effect of the regression on the expected moment function. We constructed the orthogonal moment function (if the dot product is 0, the function is orthogonal) by adding to an identifying moment the nonparametric influence function of the regression on the identifying moment function. The orthogonal moment function depends on another unknown function besides the regression. A Lasso minimum distance learner of α ̅ is automatic and nonparametric because it relies only on the identifying moment function. We use the structure of identifying moment function to approximate α as a linear combination of a dictionary of known functions. We use the Lasso learner of α and a regression learner in the orthogonal moment functions to construct an automatic debiased machine learner (Auto-DML) of parameters of interest. Debiased machine learning estimators are helpful for several reasons, including policy effects, average derivatives, bounds on average equivalent variation, and any other linear function of a regression. It will also allow the identifying moment functions to be nonlinear in regressions.

We can also apply the Auto-DML to any regression learner, including neural nets, random forests, Lasso, and other high-dimensional learners in the orthogonal moment function.

The estimators of parameters of interest use cross-fitting where the authors average orthogonal moment functions over groups of observations, the regression and learners use all observations, not in the group and include each one in the average over our group. Cross-fitting removes a source of bias and eliminates any need for Donsker conditions for the regression learner.

The combination of cross-fitting and orthogonal moment functions for debiased machine learning is like Chernozhukov et al. (2018). The Auto-DML in Chernozhukov, Newey, and Robins (2018), Chernozhukov, Newey, and Singh (2018), and here innovates by not requiring an explicit formula for the bias correction required earlier.

It builds ideas in classical semi- and nonparametric learning theory with low-dimensional regressions using traditional smoothing methods not applied to the current high-dimensional setting.

This work is the first to present a framework for direct estimation of the Riesz presenter of a broad class of linear and nonlinear functionals, in a high-dimensional setting, without requiring strong Donsker class assumptions.

6. Summary

The parameter of interest for the automatic method of debiasing a machine learning model depends on a high dimensional and/or nonparametric regression. It only needs the form of the object of interest. We allow the regression learners to be anything converging in a mean square at a fast enough rate. The authors have demonstrated root-n consistency and asymptotic normality and given a consistent asymptotic variance estimator for a range of causal and structural estimators, including nonlinear functionals of regression. The authors have applied these methods to estimate the average treatment effect and have found similar results for Lasso, neural nets, and random forest regressions. The authors have also estimated a correlated random slope designation for consumer demand from scanner data and found estimates similar to fixed slope effect elasticities.

Conclusion

As the age of AI has started becoming the new reality, debiasing machine learners will be crucial in ensuring true diversity and inclusion around the world now and in the future. Equity in this age begins with diminishing biases as much as possible in machine learning. Debiasing first starts with human developers and then the machine learning models themselves. There are several ways of solving this problem: viewing problematic bias as a social problem and analysing its causes and implications via a framework of equity, debiasing rule-based machine learning models and using Auto-DML. We will further discuss debiasing techniques and methods in our future articles.

References:

[1809.05224] Automatic Debiased Machine Learning of Causal and Structural Effects (arxiv.org)

Diversity And Inclusion In AI (forbes.com)

Not all biases are bad: equitable and inequitable biases in machine learning and radiology | Barbara Prainsack and Mirjam Pot - Academia.edu

(PDF) A review of possible effects of cognitive biases on interpretation of rule-based machine learning models (researchgate.net)

'The Road to Conscious Machines: The Story of AI' by Michael Wooldridge

An Analysis of Engineering Machine Consciousness

Data Structures and Algorithms - An Overview of The Four Categories of Data Structures and Corresponding Algorithms

Explainable AI with Fuzzy Logic (解释性人工智能跟模糊集合论）

Debiasing Machine Learning

Comentários