Auroral classification ergonomics and the implications for machine learning

The machine-learning research community has focused greatly on bias in algorithms and have identified different manifestations of it. Bias in training samples is recognised as a potential source of prejudice in machine learning. It can be introduced by the human experts who define the training sets. As machine-learning techniques are being applied to auroral classification, it is important to identify and address potential sources of expert-injected bias. In an ongoing study, 13 947 auroral images were manually classified with significant differences between classifications. This large dataset allowed for the identification of some of these biases, especially those originating as a result of the ergonomics of the classification process. These findings are presented in this paper to serve as a checklist for improving training data integrity, not just for expert classifications, but also for crowd-sourced, citizen science projects. As the application of machine-learning techniques to auroral research is relatively new, it is important that biases are identified and addressed before they become endemic in the corpus of training data.

the image was suitable for algorithm training, the experts agreed on 95% of the labels. By only using the images with agreeing 55 auroral labels and by excluding the images with ambiguous auroral forms, unwanted features and disagreeing labels, a clean training data set was produced at the price of excluding approximately 73% of the 13 947 images in the initial data set.  Figure 2. The original right-hand key set for classification. The grey-shaded keys were used for the classification.

Ergonomic categories
The comparison of the classifications for both the trials and the main classification run allowed the identification of emerging biases based on the approach each researcher took to identify the aurorae in the images. These biases are a result of the levels 60 of comfort (physical and cognitive) that exist during the classification process, leading to the term: "classification ergonomics".
Those identified as part of this study are shown in Table 2 and are discussed in the subsequent sections.

Physical comfort bias
The classification of the aurorae in the main study was a 10-class system. Given the designations, the number keys were the obvious choice and the classification software used these, either on the main keyboard (0-9) or the numeric keypad (KP0-65 KP9). In case of a mistake, it was possible to go back to the previous image, and the backspace key was used to accomplish this. This key configuration is shown in Figure 2.
The first bias that was noted was the inconvenience of the backspace for making corrections. This required moving the right hand completely away from the rest position where the fingers are hovering over the KP4, KP5 and KP6 keys on the keypad.
4 https://doi.org/10.5194/gi-2019-41 Preprint. Discussion started: 28 January 2020 c Author(s) 2020. CC BY 4.0 License. As this was awkward, there was a perceptible reluctance to make corrections. Thus the KP-DECIMAL (to the right of the KP0 70 key) was used as an alias.
After several hundreds of classifications discomfort was experienced, even with the keyboard rotated 10-20 degrees anticlockwise to make the keys suit the angle of the right hand. As a result, some testing was also done with more comfortable key arrangements. This resulted in a basic WASD configuration being used. WASD refers to the directional (move forward/backward/left/right) keys as used in FPS (first-person shooter) computer games.

75
This configuration is shown in Figure 3, where the coloured circles show the at-rest position of the fingers (keys A, S and D), with the arrows showing easy-reach positions. The left thumb rests on the spacebar. The little finger typically can reach the shift and control keys (as a modifier; in FPS games this might be, e.g. run and crouch), but were not used here. The actual keys that were used for the classification are shaded in grey.
Additionally, the keyboard was rotated 10-15 degrees clockwise to match the natural angle of the left wrist and hand, as 80 shown in Figure 4. This was used for most of the classification work and no discomfort was experienced.

Data contrast bias
If the classifier has just seen a faint, patchy aurora, then a following faint, patchy aurora is likely to be classified the same.
If the preceding image was a bright break-up, then it is more likely for the faint, patchy aurora to be classified as blank. In the initial parts of the study, attempts were made to mitigate this by normalising the image scale of all images. This was not 85 readily achieved with colour images and thus not pursued. Instead the classifications were done randomly and chronologically, respectively, by the two experts.

Environment contrast bias
Humans naturally retain perceptual constancy. This allows visual features to be discerned against a noisy or changing background: a trait that is useful to all animals in a hunter-prey scenario, for instance. However, this human trait of retaining   perceptual constancy results in optical illusions. Colour constancy and brightness constancy will cause an illusion of colour or contrast difference when the luminosity or colour of the area surrounding an object is changed. The eye partly does this as a result of compensating for the overall lighting (change in the iris aperture), but the brain also compensates for subtle changes within the field of view. An example of this is shown in Figure 5.
made it difficult to discern the difference between features which were faint, but still recognisable, and those which were subthreshold for visual identification. Hence, the figure background was changed to black. This made it easier to discern the borderline cases.
The environmental conditions beyond the computer screen were also significant. With differences in the ambient lighting and room brightness being an issue. This was noted and consistency of arrangements sought for the process.

Repetition bias
It is more comfortable to press the same button twice than to press two different buttons. Additionally, if a mistake is made, it is extra effort to go back and correct it. This "laziness" accumulates during the classification process, making long sessions problematic.
For example, if there are 10 similar images in a row, the chance of classifying number 11 in the same way is higher, than if 105 there were 10 random images first. In the study single-repetition bias was 27%, rising to 40% for double-repetition bias.

Learning bias
If there are lots of categories, the classifier may not necessarily hold all of them in mind. Thus some "sectors" of the classification may have a higher activation energy than others. For example, classifying 100s of arcs and patchy aurorae and then getting a curled case; the classifier may subconsciously think: its not patchy, so it must be an arc... inadvertently omitting the 110 thought of a different class. This is a recency effect (where a new classification is biased toward the set of most recently used labels) which has been reported in the biological sciences (Culverhouse, 2007).

Feature bias
The classifier is more likely to get the classification of a prominent feature correct, than faint or diffuse features. This leads to a form of confusion bias; e.g. what to do with a bright curl aurora (Class 2) on a background of diffuse patchy aurora (Class 5).