Codificação de vieses no processo de modelagem algorítmica

formas de opacidade e obscurecimento a partir do estudo de caso da base de dados Boston Housing

Authors

Abstract

Datasets are the raw material for the machine learning process. Data scientists and students use toy datasets to test algorithms and students, to learn how they work. Boston Housing dataset is an example of toy dataset. One of its attributes, called “B”, caught the attention of the researcher Michael Carlisle (2019). It is the proportion of blacks in each neighborhood. The attribute does not contain absolute numbers or percentages, but the result of a non-invertible function that generates a “ghetto effect” in which certain levels of racial segregation has positive effect on property values. This article contains a systematic review of literature of a relevant sample of recent publications that cite this database in order to identify if the database was properly identified by the authors, made as the authors, which were the models developed and if the variable “B” had an influence on the results. These questions aim to contribute to research on algorithmic or coded biases. These biases become hidden, as mathematical models are often black boxes. And their investigation is usually done indirectly, from their results. By identifying the presence and existing role of the “B” attribute in publications, it will be possible to estimate the invisibility of the base used to develop or propose models. The fact that it received relatively low attention until recently shows how an explicitly racist attribute is unnoticed or included in the calculations. Its investigation may contribute to indicate ways to identify other biased datasets.

Published

2023-02-28