With the expanding use of social media platforms such as Twitter and the amount of text
data generated online, hate speech and toxic language have been proven to negatively affect
individuals in general, and marginalized communities in particular. In order to improve the
online moderation process, there has been an increasing need for accurate detection tools which
do not only flag bad words but rather help to filter out toxic content in a more nuanced fashion.
Hence, a problem of central importance is to acquire data of better quality in order to train toxic
content detection models. However, the absence of a universal definition of hate speech makes
the collection process hard and the training corpora sparse, imbalanced, and challenging for
current machine learning techniques. In this...[
Read more ]
With the expanding use of social media platforms such as Twitter and the amount of text
data generated online, hate speech and toxic language have been proven to negatively affect
individuals in general, and marginalized communities in particular. In order to improve the
online moderation process, there has been an increasing need for accurate detection tools which
do not only flag bad words but rather help to filter out toxic content in a more nuanced fashion.
Hence, a problem of central importance is to acquire data of better quality in order to train toxic
content detection models. However, the absence of a universal definition of hate speech makes
the collection process hard and the training corpora sparse, imbalanced, and challenging for
current machine learning techniques. In this thesis, we address the problem of automatic toxic
content detection along three main axes: (1) the construction of resources lacking in robust
toxic language and hate speech detection systems, (2) the study of bias in hate speech and toxic
language classifiers, and (3) the assessment of inherent toxicity and harmful biases within NLP
systems by looking into Large Pre-trained Language Models (PTLMs), which are at the core of
these systems.
In order to train a multi-cultural, fine-grained hate speech and toxic content detection system,
we have built a new multi-aspect hate speech dataset in English, French, and Arabic. We also
provide a detailed annotation scheme, which indicates (a) whether a tweet is direct or indirect;
(b) whether it is offensive, disrespectful, hateful, fearful out of ignorance, abusive, or normal; (c)
the attribute based on which it discriminates against an individual or a group of people; (d) the
name of this group; and (e) how annotators feel about this tweet given a range of negative to
neutral sentiments. We define classification tasks based on each labeled aspect and use multi-task
learning to investigate how such a paradigm can improve the detection process.
Unsurprisingly, when testing the detection system, the imbalanced data along with implicit
toxic content and misleading instances has resulted in false positives and false negatives. We
examine misclassification instances due to the frequently neglected yet deep-rooted selection
bias caused by the data collection process. In contrast to work on bias, which typically focuses
on the classification performance, we investigate another source of bias and present two language
and label-agnostic evaluation metrics based on topic models and semantic similarity measures
to evaluate the extent of such a problem on various datasets. Furthermore, since we generally
focus on English and overlook other languages, we notice a gap in content moderation across
languages and cultures, especially in low-resource settings. Hence, we leverage the observed
differences and correlations across languages, datasets, and annotation schemes to carry a study
on multilingual toxic language data and how people react to it.
Finally, social media posts are part of the training data of Large Pre-trained Language Models
(PTLMs), which are at the center of all major NLP systems nowadays. Despite their incontestable
usefulness and effectiveness, PTLMs have been shown to carry and reproduce harmful biases
due to the sources of their training data among other reasons. We propose a methodology to
probe the potentially toxic content that they convey with respect to a set of templates, and report
how often they enable toxicity towards specific communities in English, French, and Arabic.
The results presented in this thesis show that, despite the complexity of such tasks, there
are promising paths to explore in order to improve the automatic detection, evaluation, and
eventually mitigation of toxic content in NLP.
Post a Comment