Relationship Between Facial Areas With the Greatest Increase in Non-local Contrast and Gaze Fixations in Recognizing Emotional Expressions

: The aim of our study was to analyze gaze fixations in recognizing facial emotional expressions in comparison with o the spatial distribution of the areas with the greatest increase in the total (nonlocal) luminance contrast. It is hypothesized that the most informative areas of the image that getting more of the observer’s attention are the areas with the greatest increase in nonlocal contrast. The study involved 100 university students aged 19-21 with normal vision. 490 full-face photo images were used as stimuli. The images displayed faces of 6 basic emotions (Ekman’s Big Six) as well as neutral (emotionless) expressions. Observer’s eye movements were recorded while they were the recognizing expressions of the shown faces. Then, using a developed software, the areas with the highest (max), lowest (min), and intermediate (med) increases in the total contrast in comparison with the surroundings were identified in the stimulus images at different spatial frequencies. Comparative analysis of the gaze maps with the maps of the areas with min, med, and max increases in the total contrast showed that the gaze fixations in facial emotion classification tasks significantly coincide with the areas characterized by the greatest increase in nonlocal contrast. Obtained results indicate that facial image areas with the greatest increase in the total contrast, which preattentively detected by second-order visual mechanisms, can be the prime targets of the attention.


Introduction
The ability to recognize a facial expression is considered as a component of emotional intelligence and plays important part in human communication, including educational communication (Kosonogov V. et al., 2019;Belousova and Belousova, 2020;Budanova I., 2021). In recent years, symptoms of a disruption of the ability to perceive facial expressions are often a special subject of therapeutic interventions . Contemporary research also acknowledges the genetic influence on functioning of was 700ms. Verbal labels of all possible facial expressions appeared following each faded stimulus The subjects responded by clicking a mouse button to indicate which emotion they thought was shown. Prior to the experiment, all subjects underwent training that helped to understand the task, procedure and allowed to actualize the names of emotional expressions. Since the differentiation of emotions is a common task for an adult, prolonged training was not required. At first, subjects in free viewing mode went through photographs of men and women showing different facial expressions. Each image was accompanied by a caption indicating the displayed emotion. Then, in order to familiarize the subjects with the procedure and make sure that they understood the task correctly, several training trials were carried out. The images used in the training were not used in the main experiment.
The duration of the main experiment did not exceed 20 minutes, and the experimental task was not tiring. However, since we recorded not only eye movements, but also the responses of the subjects, this allowed us to monitor the development of fatigue during the experiment. Comparing the percentage of correct answers in the first and last third of the experiment, we did not find a significant decrease in the performance efficiency.

Eye-tracking
Eye movements were recorded using the SMI Red-m tracker. The standard calibration procedure for the device was carried out prior to each experiment. The position of the eyes was recorded at a frequency of 60 Hz. The gaze localization accuracy was 30 arc minutes. For each stimulus, a fixation density map (FDM) was constructed by averaging over all subjects.

Digital image processing
Using software we developed that compares the total luminance contrast in the central operator window with the total contrast in the surrounding area, the face image areas with the highest (max), lowest (min), and intermediate (med) increases in the total contrast were established. The med areas were defined on a conditional straight line connecting the nearest min and max regions, while the degree of contrast increase in med was average between these min and max areas.
For digital image processing, we used a concentric operator. The operator included a central area (central window of the operator) and a surrounding ring (peripheral part of the operator). The width of the peripheral ring was equal to the diameter of the central window. First, in the center area of the concentric operator, we calculated the total energy of the image filtered at a frequency of 4 cycles per diameter of this central area. This filtering frequency was set based on the optimal ratio of carrier-envelope frequencies for human perception of contrast modulations (Babenko, Ermakov and Bozhinskaya, 2010;Sun and Schofield, 2011;Li et al., 2014). In the peripheral part of the operator, the spectral power of the entire range of spatial frequencies perceived by a person was calculated per 1 octave on average. The contrast modulation amplitude was equal to the difference in the spectral power calculated between the central and peripheral regions of the operator.
Changing the diameter of the operator's window while maintaining the filtering frequency (4 cycles per window diameter) made it possible to identify these areas in 5 different ranges of spatial frequencies 1 octave wide (with a center frequency of 4, 8, 16, 32 and 64 cycles per image). The relationship between the operator's diameter and the filtering frequency (the smaller the diameter, the higher the frequency) reflects the well-known property of second-order visual mechanisms, which ensures their scale-invariant capabilities (Sutter, Sperling and Chubb, 1995;Kingdom and Keeble, 1999;Dakin and Mareschal, 2000;Landya and Oruç , 2002).
Using the largest gradient operator, where the diameter of its central area equaled the size of the image, we were able to mark one area with the highest, lowest and intermediate modulation of the total contrast in every stimuli. Then, by repeated halving of the operator's diameter, 2, 4, 8 and 16 areas were marked for each contrast modulation amplitude (min, med and max). The total diameter of the identified at different spatial frequencies areas was equal to the diameter of the conditional circle into which the original image was inscribed. For each stimulus 3 maps of the distribution of areas with the min, med and max modulation of contrast were constructed. These maps were a superposition of Gaussians.

Statistical data analysis
The empirical maps (FDMs) were compared with calculated theoretical maps which were a result of digital processing of stimuli. To assess the similarity of the maps, two distribution-based metrics were used: Pearson's linear correlation coefficient (Cc) which shows if there is a linear relationship between two variables; EMD (Earth mover's distance or Wasserstein distance) which is a spatially robust measure that, unlike all other similar metrics, takes into account the spatial differences between theoretical and empirical www.ijcrsee.com 362 . Relationship between facial areas with the greatest increase in non-local contrast and gaze fixations in recognizing emotional expressions, International Journal of Cognitive Research in Science, Engineering and Education (IJCRSEE),9(3),[359][360][361][362][363][364][365][366][367][368] results (Bylinskii et al., 2018). To calculate the distance matrix, we used a computer implementation of the similarity metric for the Python language (Pele and Werman, 2009).

Results
First, we compared empirical maps for each of the 490 stimuli with the distribution maps of min, med, and max regions constructed from image areas identified in all five spatial frequency ranges. Due to the non-normal distribution of the data obtained and the heterogeneity of the variances, we used a nonparametric test. The medians of the correlation coefficients for min were -0.109, for med and max were 0.323 and 0.459, respectively. By comparing these scores using the Kruskal-Wallis rank sum test (df = 2, n = 1470), it was found that the similarity of theoretical and empirical maps significantly increases with an increase in the modulation amplitude of the total contrast of the selected areas (p <0.000). The median EMD scores for min, med, and max were 5.266, 3.371, and 3.266, respectively. It also should be noted that the shorter EMD indicated less the similarity between theoretical and empirical maps. The Krus-kal-Wallis rank sum test showed that this similarity significantly increases with the increase in the contrast of the selected areas (p <0.000). Then we performed a similar analysis separately for each of the spatial frequency ranges. At this stage, the empirical maps remained the same, and the calculated theoretical maps were built from the areas identified in a narrow range (1 octave) of spatial frequencies with a central frequency of 4, 8, 16, 32 and 64 cycles per image. The correlation analysis results are presented in Table 1. Table 1. . Relationship between facial areas with the greatest increase in non-local contrast and gaze fixations in recognizing emotional expressions, International Journal of Cognitive Research in Science, Engineering and Education (IJCRSEE),9(3), 359-368. The higher the Kruskal-Wallis chi-squared scores, the more pronounced the differences between the compared values (in this case, the correlation coefficients). Statistical comparison of the obtained scores using the Kruskal-Wallis rank sum test showed that the similarity of theoretical and empirical maps significantly increases with an increase in the contrast modulation amplitude of the selected areas.

Table 2. EMD scores in different spatial frequency ranges and the effect of increasing the contrast modulation amplitude
The results of the EMD analysis (shown in Table 2) were consisted with the previous analysis. These results also support the conclusion that, the higher the increase in the total contrast of the selected areas, the more the calculated maps coincide with the empirical FDMs.
To clarify the results obtained at various spatial frequencies, we conducted a post-hoc pairwise comparison of the values obtained for min, med and max areas, using Conover test (Table 3 and 4). Table 3. Table 4.

Post-hoc analysis results for the EMD
The post-hoc analysis showed that the relationship between facial areas with the greatest increase in nonlocal contrast and gaze fixations is disturbed at high spatial frequencies (32 and 64 cpi). It is clear that low and medium spatial frequencies (4, 8 and 16 cpi) are more important for attention control when viewing time is limited. Higher spatial frequencies also seem to be able to direct the observer's attention, but with a longer exposure.
www.ijcrsee.com 364 . Relationship between facial areas with the greatest increase in non-local contrast and gaze fixations in recognizing emotional expressions, International Journal of Cognitive Research in Science, Engineering and Education (IJCRSEE),9(3), 359-368.

Discussions
The main goal of our study was to test the hypothesis that the most informative facial regions may be the regions with the greatest increase in nonlocal contrast. The results obtained definitively showed that in recognizing emotions on faces the distribution of gaze fixations significantly coincide with the layout of areas with the greatest increase in nonlocal contrast at low and medium spatial frequencies. The similarity of theoretical and empirical maps significantly decreases with a decrease in the amplitude of the contrast modulation in selected areas. This effect has been observed and confirmed comparing maps using both the correlation coefficient and the EMD. This applies to both maps that combine the selected areas from all five octaves, and maps constructed the 1-octave ranges of spatial frequencies.
Consistent with previously stated, image areas with contrast modulation activate second-order visual mechanisms in human vision. But how can the functioning of these mechanisms be related to the organization of eye movements? Based on the fact that image areas that differ from the surroundings in their physical characteristics are more informative (Itti, Koch and Niebur, 1998;Einhauser and Konig, 2003;Honey, Kirchner and VanRullen, 2008;Fuchs et al., 2011), it is logical to assume that the targets of focal attention are the areas with the greatest increase in non-local contrast. Spatially overlapping second-order visual mechanisms are able to automatically find these areas in the image at different levels of resolution. The increase in activation of this mechanisms is proportional increase of contrast modulation in the receptive field of the second order filter. We assume that the more the filter is activated, the higher its ability to draw attention to a certain part of the visual field. As a result, the most activated second-order visual mechanisms become "windows" for attention. Through these windows the higher levels of processing receive information from the preattentive stage.
We believe that the perception of a face goes through certain stages. When a new object appears in the observer's field of view, a face in particular, the perception begins with separating this object from the background. Since second-order visual mechanisms have receptive fields of different sizes (Sutter, Beck and Graham, 1995;Kingdom and Keeble, 1999;Dakin and Mareschal, 2000;Landy and Oruç, 2002), it is always possible to find among them the one with a field that best matches the size of the appeared face. As a result, this mechanism is centered relatively towards the appeared face. It is tuned to a lower spatial frequency than other, smaller second-order visual mechanisms also involved in facial processing. Therefore, it has an advantage in initiating the saccade. This conclusion is based on the fact that ultra-rapid saccades to faces are initiated precisely by low spatial frequencies (Guyader et al., 2017). Thus, because the low-frequency second-order visual mechanism is centered relative to the face, the initial saccade with a high probability will be directed towards the center of the face. This may explain previously reported tendency of the first saccades to be directed to the geometric center of the presented image (Tatler, 2007;Burton, 2009, 2010;Atkinson and Smithson, 2020). Attention directed to the center of the face allows us to obtain general (low-frequency) information about the configuration of the appeared object and classify it as a face (Meinhardt-Injac, Persike and Meinhardt, 2010;Cauchoix et al., 2014;Comfort and Zana, 2015). As shown in Figure 1, the averaged FDM has a peak in the center of the face (between the nose bridge and the mouth). Moreover, statistical data analysis (Tables 1 and 2) confirms that the empirical map of gaze fixations most closely matches the calculated max map obtained at the lowest spatial frequency.
However, prior research, both the performance results (Leder and Bruce, 1998;Cabeza and Kato, 2000;Collishaw and Hole, 2000;Schwaninger, Lobmaier and Collishaw, 2002;Bombari, Mast and Lobmaier, 2009) and neuroimaging data (Rossion et al., 2000;Harris and Aguirre, 2008;Lobmaier et al., 2008;Betts and Wilson, 2009;Liu, Harris and Kanwisher, 2010), indicate the contribution of not only configural processing, but also feature processing to face recognition. A detailed (featural) description of faces can be performed by second-order visual mechanisms tuned to higher spatial frequencies. These filters, as the frequency setting increases, highlight smaller and smaller parts of the face. It is agreed that the most valuable frequency range for face recognition is from 8 to 32 cycles per face (Nasanen, 1999;Ruiz-Soler and Beltran, 2006;Willenbockel et al., 2010;Collin et al., 2014). As shown in Figure 1 (lower right corner), the areas with the greatest increase in contrast in frequency range from 11 to 22 cpi (the central frequency is 16 cpi) are located in the area of the eyes and mouth -areas that are most informative for the perception of faces (Butler et al., 2010;Peterson and Eckstein, 2012;Smith, Volna and Ewing, 2016;Royer et al., 2018). Therefore, the smaller image areas are highlighted by second-order visual mecha-nisms, the more detailed information is available for analysis at higher processing levels. . Relationship between facial areas with the greatest increase in non-local contrast and gaze fixations in recognizing emotional expressions, International Journal of Cognitive Research in Science, Engineering and Education (IJCRSEE),9(3), 359-368.

Conclusions
The results of the study allow us to conclude that in recognizing emotional facial expressions the higher the luminance contrast of the facial area, the higher the probability that this area will become the object of the observer's attention. It was shown that gaze fixations correlate better with the regions of maximum modulation of nonlocal contrast, containing information from the lower half of the frequency spectrum. Perhaps this can be explained with the fact that in our experiments the viewing time was limited to 700 ms per image. This amount of time is enough to make a decision about emotional expression, but during this time the observer can perform only 2-4 saccades, initiated by low-frequency information. Increasing the exposure time will allow the observer to pay attention to the details of the perceived image and can enhance the connection between gaze fixations and high-frequency information.
In our opinion, spatial modulation of contrast in an image can be extracted by the second-order visual mechanisms. The more the contrast is modulated in their receptive field, the higher their activation is. The higher the activation, the higher the probability of drawing the attention to this area of the visual field. Those mechanisms that are more activated can alternately attract visual attention and initiate saccades towards the areas with the greatest increase in nonlocal contrast, starting with lower spatial frequencies.
The results obtained set perspectives for new studies, where it could be determined the universal role of modulations of nonlocal contrast in the perception of not only faces, but also other objects, as well as examined the role of other spatial modulations of luminance gradients (modulations of orientation or spatial frequency) in bottom-up visual attention control.
The accumulation of experimental data in this field is related to the development of image segmentation algorithms and solving the problem of salience. New knowledge about the regularities and mechanisms of determining "regions of interest" will help to optimize the operations of preliminary processing of input information in artificial vision systems and can be useful in the development of image classification systems using deep learning networks.