Catching cyberbullies with neural networks

Digital harassment is a problem in many corners of the internet, like internet forums, comment sections and game chat. In this article you can play with techniques to automatically detect users that misbehave, preferably as early in the conversation as possible. What you will see is that while neural networks do a better job than simple lists of words, they are also black boxes; one of our goals is to help show how these networks come to their decisions. Also, we apologize in advance for all of the swear words :).

According to a 2016 report, 47% of internet users have experienced online harassment or abuse [1], and 27% of all American internet users self-censor what they say online because they are afraid of being harassed. On a similar note, a survey by The Wikimedia Foundation (the organization behind Wikipedia) showed that 38% of the editors had encountered harassment, and over half them said this lowered their motivation to contribute in the future [2]; a 2018 study found 81% of American respondents wanted companies to address this problem [3]. If we want safe and productive online platforms where users do not chase each other away, something needs to be done.

One solution to this problem might be to use human moderators that read everything and take action if somebody crosses a boundary, but this is not always feasible (nor safe for the mental health of the moderators); popular online games can have the equivalent population of a large city playing at any one time, with hundreds of thousands of conversations taking place simultaneously. And much like a city, these players can be very diverse. At the same time, certain online games are notorious for their toxic communities. According to a survey by League of Legends player Celianna in 2020, 98% of League of Legend players have been ‘flamed’ (been part of an online argument with personal attacks) during a match, and 79% have been harassed afterwards [4]. The following is a conversation that is sadly not untypical for the game:

Z: fukin bot n this team…. so cluelesss gdam
V: u cunt
Z: wow ….u jus let them kill me
V: ARE YOU RETARDED
V: U ULTED INTO 4 PEOPLE
Z: this game is like playign with noobs lol….complete clueless lewl
L: ur shyt noob

For this article, we therefore use a dataset of conversations from this game and show different techniques to separate ‘toxic’ players from ‘normal’ players automatically. To keep things simple, we selected 10 real conversations between 10 players that contained about 200 utterances: utterance 1 is the first chat message in the game, utterance 200 one of the last (most of the conversations were a few messages longer than 200, we truncated them to keep the conversations uniform). In each of the 10 conversations, exactly 1 of the 10 persons misbehaves. The goal is to build a system that can pinpoint this 1 player, preferably quickly and early in the conversation; if we find the toxic player by utterance 200, the damage is already done.

Can’t we just use a list of bad words?

A first approach for an automated detector might be to use a simple list of swear words and insults like ‘fuck’, ‘suck’, ‘noob’ and ‘fag’, and label a player as toxic if they use a word from the list more often than a particular threshold. Below, you can slide through ten example conversations simultaneously. Normal players are represented by green faces, toxic players by red faces. When our simple system marks a player as toxic, it gets a yellow toxic symbol. These are all the possible options:

	Normal players	Toxic players
System says nothing (yet)	Normal situation	Missed toxic player
System says: toxic	False alarm	Detected toxic player

You can choose between detectors with thresholds of 1, 2, 3 and 5 bad words, to see what they do where in the conversation.

Slide through example conversations

Game 1
Game 2
Game 3
Game 4
Game 5
Game 6
Game 7
Game 8
Game 9
Game 10

First message

Last message

Select a detector

Word list detector

A player is toxic above this number of bad words

Word list detector

A player is toxic above this number of bad words

Word list detector

A player is toxic above this number of bad words

See how the detectors did

Toxic players correctly detected	False alarms
10 0

1st message

message 200

1st message

message 200

As you can see, the detector with the low threshold detects all toxic players early in the game, but has lots of false alarms (false positives). On the other hand, the detector with the high threshold does not have this problem, but misses a lot of toxic players (false negatives). This tension between false positives and false negatives is a problem any approach will have; our goal is to find an approach where this balance is somehow optimal.Teaching language to machinesA better solution might be to use machine learning: we give thousands of examples of conversations with toxic players to a training algorithm and ask it to figure out how to recognize harassment by itself. Of course, such an algorithm will learn that swear words and insults are good predictors for toxicity, but it can also pick up more subtle word combinations and other phenomena. For example, if you look at how often the green and red faces open their mouths in the visualization above, you’ll see that the average toxic player is speaking a lot more than the other players.We haved used 5000 other conversations to train a network that consists of an embedding layer (300 units), two bidirectional GRU layers (16 units), a pooling layer and two dense layers (256 units). The output layer is a single sigmoid unit indicating the network’s conﬁdence that the input text is toxic.The first layer, the embedding layer, is the most low level and contains information on what individual words often appear in similar contexts. The idea is that words that are similar in meaning appear in similar contexts, and thus get similar weights in the network. This means that if we visualize the weights in a 3D space with the T-SNE algorithm, words with a similar meaning should appear closer together:

As you explore the 3D space, you will find many interesting clusters of words that indeed are related in meaning. For example, there is a cluster of words related to time (highlight), a number cluster (highlight), a cluster of adjectives to rate something (highlight), but also (and more useful to the current task) a cluster of insults (highlight) and a cluster of variants of the word fuck (highlight).

However, just knowing the rough meaning of relevant words is not enough – we need to know when to act. Higher layers typically pick up increasingly more abstract tasks, like monitoring the temperature of the conversation as a whole. In the interactive visualization below, you can see which neurons in which layers respond positively (green words) or negatively (red words) to different parts of the conversations [5].

In the first layer, we see that example neuron 1 has developed an interest in several abbreviations like ‘gj’ (good job), ‘gw’ (good work), ‘ty’ (thank you) and to a lesser extent ‘kk’ (okay, or an indication of laughter in the Korean community) and brb ‘be right back’. Example neuron 12 focuses on a number of unfriendly words, activating on ‘stupid’, ‘dumb’, ‘faggot’ and ‘piece of shit’, and also somewhat for ‘dirty cunt’. Note that its results are swapped compared to neuron 1 (red for good predictors of toxicity and green for good predictors of collaborative players instead of vice versa), which will be corrected by a neuron in a later layer. Neuron 16 activates on ‘mia’ (missing in action), which is typically used to notify your team mates of possible danger, and thus a sign that this person is collaborative and probably not toxic.

The neurons in the second layer monitor the conversation on a higher level. In contrast to the abrupt changes in the first layer, the colors in the second layer are fading more smoothly. Neuron 17 is a good example, where we see that the conversation is green in the beginning and slowly goes from yellow to orange, and then later back to green again. Several neurons, like neuron 6 in the second layer, find the repetitive use of ‘go’ suspicious since they are activated more with each repetition.

But does it work?

The big question is whether a harassment detector using a neural network instead of a word list performs better. Below you can compare the previous word list-based approach against three neural networks with different thresholds [5]. The threshold now is not the number of bad words, but the network’s confidence: a number between 0 and 1 indicating how sure the network is that a particular player is toxic. Here you see the results for a neural network with three different confidence thresholds, compared to a word list-based detector with a threshold of 2 bad words: