Ann Huang's profile

Auditory Feedback Research

EVALUATING AUDITORY FEEDBACK WITHIN VIRTUAL PERSONAL ASSISTANT INTERACTIONS 2014
Team: Heather Bales, Brett Burnside, Ann Huang
Tools: Self Assessment Manikin, Likert Scales, Questionnaire, ANOVA
Role: Concept Development, Interaction Designer, User Researcher, Co-author
The prevalence of modern mobile devices inspired our group to explore the potential of virtual personal assistants (VPAs) such as Siri, Google Now and Cortana. While users have the ability to choose their preferred voice, auditory feedback noises such as those used to acknowledge user input and precede VPA voice feedback are usually not customizable. We researched several academic papers relating to sound preferences and multimodal feedback and designed a study to test our hypothesis that differing auditory feedback would affect overall user satisfaction with VPA interactions.
 
BACKGROUND
Previous research has identified three different terms for describing auditory feedback.
 
Auditory icons - Gaver (1986) defined these as "caricatures of naturally occurring sounds” with a real world association. An example is the sound of a closing door to represent exiting a program.
 
Earcons - Blattner (1989) defined earcons as "“audio messages used in the user-computer interface to provide information and feedback to the user”. An example would be the dissonant beeps heard when errors occur or the short song that plays when an operating system is starting up.
 
Spearcons - Walker (2006) created this recent variant where audio recordings of verbal phrases are processed in order to act as an earcon. An example is the usage of the word "Wait" at crosswalks.
 
The evolution of feedback noises from real-world mapped auditory icon to human-composed earcons to vocal-based spearcons shows the increased focus on creating sounds that mean something to users. Research has shown that there may be some rules for the auditory tones preferred by most users: a certain range of pitches is found to be ideal, musical earcons are typically preferred over beep-based sounds, and spearcons are useful for conveying recognizable complex data.
 
We also investigated ways to measure user satisfaction beyond a simple Likert scale. Seebode et al.'s "Affective Quality of Audio Feedback in Different Contexts" introduced the usage of Bradley's Self-Assessment Manikin (SAM) to measure the emotional responses of users to sounds. This scale offers an easily administered non-verbal method for assessing pleasure and arousal. Seebode used the SAM in combination with a Likert scale to get an overall measure of user satisfaction.
 
HYPOTHESIS
Non-verbal auditory feedback is a secondary aspect of VPA interactions, meaning that users subconsciously process the sounds without being distracted from their primary task. However, our research on the topic of audio feedback showed that users hold preferences between different types of audio feedback. In response, we offer the following hypothesis:
 
We hypothesize that using different types of acknowledgement sounds will change overall user satisfaction with VPA interactions. More specifically, we predict that users will prefer the following acknowledgement sounds in decreasing order of overall interaction satisfaction: spearcon, multi-tone earcon, single tone earcon, and no sound.
 
STUDY
We designed a study to measure the effects of changing the type of auditory feedback within a VPA interaction. Our group selected a set of unambiguous VPA questions  and recorded the corresponding answers, as well as an error message, using a neutral computer-generated voice. To test the different types of feedback, we generated the following three types of audio feedback to preface each VPA response:
We recruited a convenience sample of twenty participants and asked each to evaluate a prototype VPA with a set of preset questions. Each participant was given a sheet with our five preset questions and was asked to read the questions into a dummy microphone. Unbeknownst to them, they had each sequentially been assigned to one of the conditions in the table above and each heard answers with their condition's audio treatment. We planted a Wizard of Oz researcher to remotely trigger the VPA responses via laptop to our bluetooth speaker. 
Participants were asked to wait for a response to each of their five questions and to fill out a questionnaire about the experience, as well as a short demographic survey, after their experience. For our questionnaire, we used a paper version of the SAM as well as two Likert scales measuring VPA correctness and VPA interaction satisfaction. The correctness question served as a distractor while we collected the responses for pleasure, arousal and satisfaction. Our basic demographic survey asked about other potential biases that we noted from our research of academic papers; these questions covered gender, age, hearing ability, musical training and VPA familiarity.
 
ANALYSIS
After converting our responses into scores between one and five, we used a one-way between-groups analysis of variance (ANOVA) to analyze our data. The left-most choices for each measure (positive, excited, and strongly agree) were assigned to five points, with one point being assigned for the right most choice in each group.
 
Each treatment group was composed of five participants with either a two male/three female or three female/two male split. When rating the pleasure of the interaction, the participants in the single tone group gave the highest rating (M=4.80, SD=0.447), followed by a tie between the no sound group (M=4.40, SD=0.548) and the musical earcon group (M=4.40, SD=0.548), and lastly the spearcon group (M=4.00, SD=1.225). The highest rating for interaction arousal came from the no sound group (M=4.00, SD=0.707), second highest by the musical earcon group (M=3.80, SD=0.837), and least by the single tone group (M=3.20, SD=0.837) and the spearcon group (M=3.20, SD=0.837).  The highest rating for interaction satisfaction came from the no sound group (M=4.60, 0.548), the second highest from the spearcon group (M=4.40, SD=0.548), the third highest from the musical earcon group (M=4.20, SD=0.447) and the lowest from the single tone group (M=4.00, SD=0.707).
As part of our statistical analysis we ran Levene’s Test of Homogeneity of Variances to determine if the group variances are equal. We failed to reject the null hypothesis at the 0.05 significance level based on our high p values for each measure: pleasure (0.411), arousal (0.411), and satisfaction (0.828).  We did not find evidence that the samples come from populations with unequal variances and therefore concluded that we did not violate the assumption of homogeneity of variances. 
 
 
The results of our one-way between groups ANOVA show no statistically significant difference between condition groups within each measure: pleasure F(3,16) = 0.928, p = .450; arousal F(3,16) = 1.691, p = .224; satisfaction F(3,16) = 1.026, p = .408; for all of our measures F_obtained was less than F_critical (3,16) = 3.24.
 
Effect sizes for the groups, calculated using eta-squared, were: pleasure: η2  = .15, arousal: η2 = .23, and satisfaction: η2 = .24. This means that our treatments explain 15% of the variation among our sample’s pleasure rating response, 23% of the variation among our sample’s arousal rating response, and 24% of the variation among our sample’s satisfaction rating response, which are fairly small effect sizes. This supports our finding that there was no significant difference between each of the condition groups.
 
At a 0.05 significance level, we fail to reject the null hypothesis.  We cannot conclude that using spearcons and musical earcons as acknowledgement sounds will result in a more positive user perception of a VPA interaction than the usage of a single tone or no feedback sound, much less that users will indirectly prefer the set of acknowledgement sounds in our specified decreasing order based on their perception of interaction.
 
LIMITATIONS & RECOMMENDATIONS
While our statistical findings may seem quite grim, we believe much of this may be explained due to experimental limitations. Our research should serve as a pilot study for the development of future experimental designs to research this domain. Specifically, we suggest running a similar study, with a larger sample size, over a period of weeks in order to dilute the novelty effect. Participants in such a study could interact with a VPA in the context of daily life, perhaps by using a prototype installed on their personal mobile device.  These participants would then be asked to fill out a more personalized questionnaire about their experience with the VPA. Addressing these and other issues covered in the limitations section below can help future researchers conduct more focused studies.
 
Limitations
Considering the limitations of our experiment is essential to interpreting the results and suggesting future research opportunities. Some of our limitations are due to experimental design; because the experiment was conducted in a public library on a university campus, we had difficulties reserving a consistent study room. This led to changes in our study environment. For instance, one of the three rooms used barely fit three people while another room was not quiet due to neighboring noise. In future research, it is important to standardize the lab setting to control for a consistent experience among participants.
 
Additionally, the Wizard of Oz technique that we used may have distracted participants. Given more time and resources, it would have been preferable to design the prototype in such a way that Wizard of Oz techniques were not required, or to distance the participant from the second researcher more than was possible within our environmental constraints. While the Bluetooth speaker allowed for better Wizard of Oz manipulation, it also caused a softening effect on each audio file’s playback and cut off the first part of each sound. Audio testing with different technology could ensure that the full auditory feedback is clearly playable.
 
Our sample was another source of limitations. One serious limitation of our study was the small size of our sample due to a limited amount of time to recruit for and conduct the study. We used convenience sampling and were able to recruit our twenty participants in the library by offering snacks as compensation. The sample was mostly college students, introducing selection bias and limiting generalizability. A few participants did not speak English as a first language, potentially causing distraction in the form of task anxiety and necessitating deeper concentration on the questions and answers rather than on the overall VPA experience. Our demographic survey did not include a question about language of origin, and thus we do not know if this had an effect on the user’s experience. Future researchers should recruit an increased sample, perhaps through online means or improved compensation, and control for language familiarity.
 
Recruiting in the library may have biased our sample towards benevolent people due to our request for help with our study. Our researchers personally recruited each participant and talked to them prior to their experiment, and were also present during the entire experiment and survey process. Their kindness in volunteering to help our study, as well as our proximity throughout the study, may have translated into higher satisfaction scores because our participants wanted to give us positive feedback about our VPA prototype. One particular example of this was demonstrated when a participant noticeably grimaced in response to our auditory feedback, yet strongly agreed that the interaction was satisfying. After leaving the room, this participant commented to another researcher that the auditory tone was unpleasant. In a similar experiment in the future, it would be of use to help the participant understand that the researchers are testing the interaction, and have no personal feelings of accomplishment associated with the prototype. We also suggest going about recruitment in an anonymized manner to minimize personal interactions with researchers.
 
Our assignment of recruited participants was also a bit arbitrary. Though we assigned our participants to their conditions based on order of arrival in order to ensure that each group would have the same number of participants, it would be better with a larger sample to randomize assignment and remove any unexpected biases in this area.
 
We saw some novelty effects that might have introduced bias into our results. Some participants were very impressed with the basic fact that the prototype was “working”. Perhaps if they experienced this interaction over a longer period of time, the auditory feedback would subconsciously have a greater impact on the emotions and opinions associated with the interaction. Further research could be done to explore whether a sound is perceived as more pleasant or less pleasant after repetition. We also recommend a slightly longer study in which the participant can begin by becoming more familiar with the technology prior to their evaluation portion, potentially overcoming the initial sense of general excitement to use a new tool.
 
Lastly, our methods of measurement require refinement for usage in future research. We suspect there are better ways to get a more sensitive measurement from participants. Though we take the data from the SAM and Likert scale survey at face value, we recognize that more specific questions could have given us more precise data as well as rephrasing the survey question in a more personal way. For instance, instead of our statement of “The virtual personal assistant interaction was a satisfying interaction”, using a more personally framed statement that tangentially assesses satisfaction with the entire experience, such as “I would install this virtual personal assistant on my smartphone,” might yield more appropriate study results. Future researchers have the opportunity to fine-tune the post-interaction questions to better understand the participant’s interaction experience.
Auditory Feedback Research
Published:

Auditory Feedback Research

Research regarding the evaluation of auditory feedback within virtual personal assistant interactions.

Published: