Practical test of voice assistants

Artikelübersicht

The practical test of voice assistanst and its results are described here. Four market established voice assistants have been proved with identical test utterances and the result is shown.
Voice assistants have their strenghten in everyday ask-answer-situations. The performance decreases if complex issues are uttered.

Test concept of a smartphone test of voice assistants

There is a basic principle in software test which says "no test without requirements" Fünf goldene Regeln für erfolgreiches Testmanagement, de (Five golden rules for successfull test management, en) – heise.de
We came across this basic principle when starting the practical test. In fact, we could only sense which requirements underlie the voice assistants.

Generally, the present tests of voice assistants reference to smart speakers. In addition, they focus on beneficing and are based on nuser rating of selling platforms. It is not apparent, to which degree the existing test approaches are comparable.

The focus of Speech & Phone is speech recognition and it suggests itself to evaluate voice assistants according to speech recognition. That’s why, this sentence became our credo for this project.

By creating the practical test of voice assistants we are going to examine carefully the test candidates leave the exlorative test approach.
Diethelm Dahms, Speech & Phone

It is quite clear, that voice assistants bring good things into every day. This must not be verified. For this reason, we focussed on linguistic competence of voice assistants with the following aspects

Speech input according to transformation of acoustic signals into text
Speech output as the transformation of written language into acoustic signals
Dialogue ability as the interacting of an communication between human and machine

Explorative vs. deterministic test approach

Explorative test approaches of voice assistants start whith the subjective nature of using persons. This approach is a excellent possibility to determine the abilities of a system. This results in a hardly solvable bias of recogntion Wahrnehmungsverzerrung, de, recognition bias in combination with the vasps‘ nest principle Wespennestprinzip, de, vasps‘ nest principle, that is to say the confirmation bias. Persons favouring a system wil find easy explanations for discrepancies. Otherwise, persons who find discrepancies, will use these as reasons for refusing. For one thing software quality will be overrated on the other hand underrated.

Test method in practical test of voice assistants

Test analysis in practical test of voice assistants

According to a deterministic test approach, we capture the functions of voice assistants first. Doing so, we found the following functions

Every day organisation
Music and media
Travel
Weather
Knowledge questions

Looking at smartphone voice assistants, the first glance is biased due to identification of voice assistants and smart speakers. The main function of them ist the playback of media. That’s why, all other functions are sensed as a bonus. Since the area of application of the voice assistants in the practical test is smartphone and desktop, the focus will be refined to the point, that voice assistants are not only able to understand speech, but can express clearly.

Starting from the above mentioned functions, we determined the these test parameters during the analysis phase of the test process. The following parameters were included into the test case design as significant criteria.

wording type
use of adverbial determinations
utilisation of idioms
use of imagery in the speech
express necessities
ambiguous expressions

Test preparation of practical test of voice assistants

✒ Information
The practical test of voice assistants has been executed in German only.
The following results refer to these findings in German language.
The translated sentences will show only what was found during the test.

Concrete test cases were created according to the defined test parameters from the abstract test cases. Multiple concrete test cases are created from the abstract test cases within the function knowledge questions. The first examples refer to the time of the day. We used the word rock for ambiguous expressions. (rock, de-de means: skirt, rock music, and coat (outdated), in German, comment of translator). By the way, it is surprising that happens in this case. The article
fehlende Erkennung von Sprachen, de, missing recognition of language has been written for.

Test parameter from practical test of voice assistants

Parameter	Testing utterance
wording type	de Wie spät ist es?
	en What’s the time?
	de Sag mir, wie spät es ist.
	en Say me, what’s the time, please.
	de Kannst du mir sagen, wie spät es ist?
	en Could you tell me, what’s the time?`
adverbial determinations	de Wie spät ist es in Tokio?
	en What’s the time in Tokyo?
use of idioms	de Was hat die Uhr geschlagen?
	en What did the clock strike?
imagery in speech	de Was zeigt die Uhr?
	en What is the clock showing?
necessities	de Ich habe die Zeit vergessen.
	en I forgot the time.
ambiguous expressions	de Zeige Rock Bilder.
	en Show rock images. see explanation above

Test execution of practical tests of voice assistants

Starting from this point we noted 83 test cases deriving from the five functional areas and they were executed for the following four voice assistants. The test field included the following voice assistants

Voice Assistant	Platform	Operation System
Alexa	Smartphone	Android 10
Cortana	Desktop	Windows 10
Google-Assistant	Smartphone	Android 10
Siri	Smartphone	iOS X

The distribution of test cases according to the functions is shown in the following table and diagram.

pie title Verteilung der Äußerungen
    "Everyday"   : 24
    "Media"     : 14
    "Travel"    : 18
    "Weather"   : 13
    "Knowledge" : 31

Function	Share of sentences
Everyday	24%
Media	14%
Travel	18%
Weather	13%
Knowledge	31%

Test results of the practical test of voice assistants

Summary of test results

To summarize: Voice assistants are usable as a speech entry system for web search fomr. This is quite suitable and sufficient for most use cases. Simultanously, it is troublesome, if the spoken answer contains the reference to a result of a web search. This behaviour is usable for multimodel environments, but not for barrier-free applications. But graphical interfaces are not superseded by this and are not planned consitently. This is remarkable because voice user interfaces are better in speech input than in speech output. This design keeps hands free, but not eyes.

The assistants are nothing more than a speech input field of a search engin for complex requests.
Amos Dahms, Analyst at Speech & Phone

Usable linguistic basic skills of voice assistants

Simultanously, it is observed, that voice assistants master basic skills and that they refer to a fee-based package if asked for special functions. In this case, they express missing functions in their answer and it seems they are engaged as sales representatives for fee-based updates of the provider. Voice assistants are a Freemium-Angebot, de, Freemium.

Missing distinct linguistic competence

The linguistic comeptence Sprachkompetenz, de, linguistic comeptence and the linguistic performance Sprachleistung, de, linguistic performance of voice assistants is little marked for speech input and the response behaviour.

Exaomple
The answer to the question "What is the temperature in the lake?" (Wie ist die Temperatur im See?) contains details on the weather of the current location, the weather in a city named "See" (See is the German word for lake, note of the translator), the anomaly of the water in lakes and on biology of lakes.

Missing dialogue ability

Even in situations as this one the limited dialogue ability of voice assistants is ingloriously remarkable. In the above mentioned example, it had been better to check back, which weather or which location coulde have been meant. Due to the waiver to this check back voice assistants damages their reputation to be perceived as an assistant. In my opinion the seem only as know-it-all.

Comparison of passed and failed test cases

The following tabel shows the share of passed and failed test cases per voice assistant.

Voice Assistant	Passed	Failed
Alexa	60 %	40 %
Google-Assistant	63 %	37 %
Cortana	38 %	62 %
Siri	66 %	34 %

Differences per function

The different functions Everyday, Media, Travel, Weather, and Knowledge show different variations concerning the differences, thus the failed test cases.

Function	Variation
Alltag	18 % – 24 %
Medien	13 % – 27 %
Reise	11 % – 27 %
Wetter	18 % – 27 %
Wissen	12 % – 27 %

Differences in error classes of results in the practical test of voice assistants

If we are taking a closer look at the root cause of differences, we find the following categories

Input: The linguistic utterance is not or incorrect interpreted.
Output: Differences from standard languages have been encountered during output.
Dialogue: Check backs are missing, if the utterance of the input is ambiguous or not clear.
Backend: The best dialogue system can not be better than the data the backend delivers.

Share of error classes in the practical test of voice assistants

The table shows the share of errorcases of all voice assistants totally.

Error class	Share of differences
Input	71 %
Output	1 %
Dialogue	13 %
Backend	15 %

Conclusion

Finally, there remains the ingloriously task for testers to put the finger in the wound. The hunting fever is awoken, as soon as the first difference, the first stumble could be encountered. Simultanously, all humans in the test know:

A diamond only shines, if it has been cut.
Zakhar Bron

There are voice assistants etablished at hte market and in the mainstream. Simultanously, additional requirements are created by the breadht of the market. As long as voice assistants are thought for specific functions as banking or telephone exchange, users could neglect the missing capabilities of linguistic performance or linguistic competence. This is no more possible if market penetration improve. Dialogue ability and the capabilities to prepare text suitable for spoken answers according standard language, and that they are able to read aloud web pages come into account.

Presentation of the results from the practical test of voice assistants

The slide collection with catchy summary can be requested by
Mail to contact2019@speech-and-phone.de.

Artikelaufrufe: 345

Bildquellen

Display Dummy: geralt | Pixabay