gms | German Medical Science

GMS Zeitschrift für Audiologie — Audiological Acoustics

Deutsche Gesellschaft für Audiologie (DGA)

ISSN 2628-9083

Modeling and verifying the test-retest reliability of the Freiburg monosyllabic speech test in quiet with the Poisson binomial distribution

Research Article

Search Medline for

  • corresponding author Inga Holube - Institute of Hearing Technology and Audiology, Jade University of Applied Sciences, Oldenburg, Germany; Cluster of Excellence “Hearing4All”, Oldenburg, Germany
  • Alexandra Winkler - Institute of Hearing Technology and Audiology, Jade University of Applied Sciences, Oldenburg, Germany; Cluster of Excellence “Hearing4All”, Oldenburg, Germany
  • Ralph Nolte-Holube - Institute of Hearing Technology and Audiology, Jade University of Applied Sciences, Oldenburg, Germany

GMS Z Audiol (Audiol Acoust) 2020;2:Doc03

doi: 10.3205/zaud000007, urn:nbn:de:0183-zaud0000070

This is the English version of the article.
The German version can be found at: http://www.egms.de/de/journals/zaud/2020-2/zaud000007.shtml

Published: March 27, 2020

© 2020 Holube et al.
This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. See license information at http://creativecommons.org/licenses/by/4.0/.


Abstract

The test-retest reliability of the Freiburg monosyllabic speech test was modeled using different methods. The results were compared to measurements from listeners with and without hearing impairment. The methods are based on the models of Thornton and Raffin as well as Altman et al. Both papers took into account differences in word recognition within the test lists by applying the Poisson binomial distribution and included the variance of the test-list results. The methods allow calculating the bounds of the 90% and 95% confidence intervals when using test lists with 20 words and double lists with 40 words. The data in the current report confirm these bounds. The confidence intervals are broadest for speech recognition scores of 50%. At this score and for test lists with 20 words, the 90% confidence interval has a width of ±20%, corresponding to ±6.0 dB, and the 95% confidence interval has a width of ±25%, corresponding to ±7.4 dB. Thus when evaluating hearing-aid fittings, only differences exceeding this range can be regarded as significantly different.

Keywords: Freiburg monosyllabic test, speech intelligibility, binomial distribution, test-retest reliability, confidence


Introduction

In issue 1/2018, modeling of the reliability of the Freiburg monosyllabic test (FBE) [1] in quiet with the Poisson binomial distribution was presented [2], [3]. The use of this distribution allows attention to differences in word recognition within a test list. This results in a smaller confidence interval for the measurement results than when using the simple binomial distribution that assumes the same probability of recognition for each word in a list. The variance of the Poisson binomial distribution for 20-word test lists could be approximated by the variance of a simple binomial distribution based on 29-word test lists with the same degree of word recognition.

The studies in Holube et al. [2], [3] were limited to the calculation of the 95% confidence interval for the deviation from the true value of the measured value for a test list and, alternatively, for the deviation of the true value from the measured value for a test list. However, the published confidence intervals are not applicable for estimating test-retest reliability or to studies with two test lists used to compare two measurement conditions. Exactly this case exists when verifying hearing aids or other therapeutic treatments. The results of two measurements (e.g., with and without hearing aids), i.e. two scores, are compared, and the success of the treatment is derived from the difference between the two scores. The guideline for assistive devices (Hilfsmittelrichtlinie in German) [4] requires, e.g., for the FBE in quiet, an improvement in speech recognition of at least 20 percentage points with hearing aids as compared to the condition without hearing aids.

Thornton and Raffin [5] calculated the 95% confidence interval for the difference between two measurements by transforming the scores into a scale with homogeneous variance for all test results and then adding the variances of the two test results. Carney and Schlauch [6] essentially confirmed the results of this method using a different approach. They calculated the variance of the difference between two scores assuming binomially distributed scores. For each value for the score from the first measurement, they considered all possible score values for the second measurement. Results using the method of Thornton and Raffin [5], which requires the same recognition probability for all 20 words of a test list, were given by Winkler and Holube [7] based on Steffens [8] and compared with results of repeated measurements.

On the one hand, Dillon [9] argued that if test lists are equally recognizable and the listeners always behave similarly, and assuming the same recognition probability for each word, the width of the 95% confidence interval for the test-retest condition is overestimated when using the method of Thornton and Raffin [5]. This assumption is supported by the analysis in Winkler and Holube [7] since only 3.2% of the measurement data, i.e. less than the expected 5%, were outside the confidence interval according to Thornton and Raffin [5]. On the other hand, Dillon [9] pointed out that Thornton and Raffin’s [5] method can nevertheless be used to estimate the 95% confidence interval, since two effects cancel each other out: Considering different word recognition and applying the Poisson binomial distribution according to Hagerman [10], the 95% confidence intervals become narrower. By intra-individual variability (e.g., by attention fluctuations), especially with a longer time interval between the measurements, they become wider again. As an additional source of variance, Dillon [9] pointed out possible differences among test lists. In speech audiometry, in contrast to Winkler and Holube [7], the same test lists are generally not used in repeated measurements. The 95% confidence interval for the test-retest reliability widens when using different test lists, due to the different mean scores of the test lists.

For the present analysis, measurements from Baljic et al. [11] and Holube et al. [2], [3], which for each subject included the results of five test lists at each of four levels, were interpreted in terms of a test-retest experiment, and the test-retest reliability was evaluated. All measurements were performed within one session. Therefore, only the short-term test-retest reliability was investigated, but not the test-retest reliability over a longer period of time that according to Dillon [9], would probably result in broader confidence intervals. For comparison with the measurement data, the bounds for the 95% and the 90% confidence intervals were modeled using different methods. The methods are based on the Poisson binomial distribution used in Holube et al. [2], [3]. Additionally, the variability of the test lists was modeled. Intra-individual variances of the participants were neglected due to the short time intervals between the measurements.


Methods

Experimental data

The measurement methods are summarized here only briefly. For a detailed description refer to Holube et al. [2], [3].

In 80 young participants having normal hearing abilities (hereinafter named normal-hearing participants), speech recognition was determined as the percentage score for the Freiburg monosyllables in quiet and at four levels (17.5, 23.5, 29.5, and 35.5 dB SPL), with each of five test lists comprising 20 words (n=20). In 40 older participants with hearing impairment (hereinafter named hearing-impaired participants), the levels 65, 80, 90, and 95 dB SPL were used in the same procedure. However, only 65 and 80 dB SPL were included in the analysis, because at the two higher levels, many scores achieved 100%. All measurements of a given participant were performed within one session.

The five fixed-level test-list hit rates for each participant were interpreted as test-retest combinations in pairs. The pairs each consisted of a presented test list and another, subsequently presented, list, i.e. (1; 2), (1; 3), (1; 4), (1; 5), (2; 3), (2; 4), (2; 5), (3; 4), (3; 5), (4; 5). This resulted in 3,200 test-retest pairs for the normal-hearing and 800 test-retest pairs for the hearing-impaired participants. The number of test-retest pairs decreased when the conspicuous test lists of Baljic et al. [11] were excluded (see Table 1 [Tab. 1]). In another variant, two test lists each with double lists of n=40 words were formed. For the analysis of test-retest reliability, all double lists were combined into test-retest pairs so that no single list was duplicated, i.e. (1+2; 3+4), (1+2; 3+5), (1+2; 4+5), (1+3; 2+4), (1+3; 2+5), (1+3; 4+5), (1+4; 2+3), (1+4; 2+5), (1+4; 3+5), (1+5; 2+3), (1+5; 2+4), (1+5; 3+4), (2+3; 4+5), (2+4; 3+5), and (2+5; 3+4). This resulted in 4,800 test-retest pairs for the normal-hearing and 1,200 test-retest pairs for the hearing-impaired participants when all 20 test lists were used. As a variant, the conspicuous test lists of [11] were also excluded for these double lists (see Table 1 [Tab. 1]).

Calculation methods

For a given percentage score pmess1 (test), the question was: In which critical range did the retest percentage score pmess2 lie, so that the difference pmess1pmess2 for a two-sided test was not significantly different from zero at the α=5% level. A two-sided test means that the retest score may be less than or greater than the first score. Thus 2.5% of the retest scores are below and 2.5% of the retest scores are above the 95% confidence interval around the first score. Different methods exist in the literature for calculating the 95% confidence interval, two of which (Thornton and Raffin [5] and Altman et al. [12]) are considered in the current contribution. Both methods were first reproduced and then applied to the available measurement data with n=20 and n=40 words per test list (i.e. simple test lists and double lists). Afterwards, modifications of these methods are presented that took into account the variability of single word recognition, as well as the variability of the mean recognition of different test lists.

Method 1: Critical differences according to Thornton and Raffin

Thornton and Raffin [5] proposed calculating a 95% confidence interval for the assessment of test-retest reliability by the following method: The number X of correct responses for n presented words in a test list is considered to be a random variable. It is assumed to be binomially distributed with B(n,p,X=k). Here p is the probability that one word in the list will be correctly recognized. Here and below, probabilities are given in percent. The expected value of X is thus F1_in text. Speech recognition in percent (score) is the random variable F2_in text. Its expected value is E(pmess)=p and its variance is F3_in text. This variance reaches its maximum at p=50%. At the borders p=0 and p=100%, the variance is zero.

For the test-retest reliability, estimating a confidence interval for the difference pmess1pmess2 of two scores is of interest. For this purpose, the random variables X1 and X2 are first transformed (according to Equation 3 in [5]) using Equation 1 to an angle θ(X,n).

Equation 1

F4_Gleichung 1

The random variable θ thus defined has approximately a variance Var(θ) that is independent of p. Thornton and Raffin [5] chose the approximations F5_in text a for n≥50 and F5_in text b for 10<n<50. The two random variables θ1=θ(X1,n) and θ2=θ(X2,n) have the same variance Var(θ) within this approximation. Assuming that θ1 and θ2 are statistically independent, the variance of the random variable Δθ=θ1θ2 is the sum of the variances, i.e. Var(Δθ)=2Var(θ). For Δθ, a normal distribution with the variance 2Var(θ) is assumed. The 95% confidence interval for θ2 at the score pmess1 thus results in F6_in text.The thus calculated θ2 bounds of the 95% confidence interval are transformed back to X2 bounds to obtain the pmess2 bounds. Thus, if F7_in text a and F7_in text b indicate the scores in the test and in the retest measurement, respectively, this method can be summarized as follows:

Equation 2

F8_Gleichung 2

Equation 3

F9_Gleichung 3

with

Equation 4

F10_Gleichung 4

These bounds were calculated for all scores pmess1 of interest between 0 and 100%. The inverse function X=θ–1(θ,n) of Equation 1 was calculated numerically.

Method 2: Critical differences according to Thornton und Raffin, with variable word recognition

If the individual words of a test list are recognized differently, the same recognition probability p for each word is no longer sufficient for the description. Each word has its own recognition probability, and the binomial distribution is replaced by the Poisson binomial distribution [10]. In order to consider the narrowing of the distribution of X in the Poisson binomial distribution relative to the simple binomial distribution, the variance of θ is now set to F11_in text a instead of F11_in text b. The value for n' is taken from [2], [3], hence, n'=29 for n=20 and n'=58 for n=40.

Thus, method 2 is described by Equation 2 and Equation 3 together with

Equation 5

F12_Gleichung 5

instead of Equation 4.

Method 3: Critical differences according to Altmann et al., with variable word recognition

Altman et al. [12] recommended an approach that corresponds to method 10 of [13]. This method will initially be presented unchanged. Then it is modified to take into account the variability of word recognition within a test list.

If a percentage score pmess for a single test list was measured, the 95% confidence interval for the true value p is in question. Wilson [14] made the following approach:

Equation 6

F13_Gleichung 6

with z=1,96. This is a quadratic equation for p. Its solutions u and o specify the lower and upper bounds, respectively, of the required confidence interval (see [2], [3]). If there are two hit rates pmess1 and pmess2, then the associated lower bounds u1 and u2 and the upper bounds o1 and o2 result. According to [12], the significance of the difference pmess1pmess2 is assessed as follows: If the first score pmess1 is greater than the second score pmess2, then to be significantly different at the 5% level, the difference pmess1pmess2 of the two scores must be larger than

Equation 7

F14_Gleichung 7.

To calculate the 95% confidence interval for the difference between the two scores, the variance for the higher score is added downwards and the variance for the lower score is added upwards. For the other case, namely that the second score is larger than the first score, the difference pmess2pmess1 must be correspondingly larger than

Equation 8

F15_Gleichung 8

For each of the values of pmess1 of interest between 0 and 100%, this method provides a 95% confidence interval for the difference pmess2pmess1. For a given pmess1 (test), the score pmess2 (retest) lies with a probability of 95% between pmess1δu and pmess1+δo. The six equations, i.e. the equations for u1, u2, o1, and o2 and the Equation 7 and Equation 8, must be solved for a given pmess1. There is no closed solution. Therefore, the equations were solved numerically by fixed point iteration.

The calculation method described so far is based on the same single-word recognition within a test list. The variability of the single-word recognition leads to a reduction of the variance F16_in text on the right side of Equation 6, as already described for method 2. This is now taken into account by replacing n by n'. The value for n' is taken again from [2], [3], i.e. n'=29 instead of n=20 and n'=58 instead of n=40.

Method 4: Critical differences according to Altmann et al., with variable word recognition and variable test list recognition

Starting from variable single-word recognition under the same conditions, in a speech test, the mean scores of the lists vary due to different word compositions of lists. If the test-list mean value for each test list could be exactly determined under given measurement conditions, this mean value would have a variance fn2. This depends on the number n of words per test list and on the true value p. This variance contributes to the uncertainty of the true value of p in Equation 6. Thus, taking into account both variable single-word recognition (replacing n by n') and variable test-list recognition (adding fn2 to the variance of p), Equation 6 becomes:

Equation 9

F17_Gleichung 9

with z=1,96. If the variance fn2 is known, the further steps of the method according to [12], as described for method 3, can be carried out.

To determine fn2, the sample variance of the measured test list mean values is calculated. nL test lists of n words are considered with the single-word recognition pji, i=1…n, j=1…nL. The percentage score of the test list j is thus the mean value F18_in text. With the scores averaged over all words in all test lists

Equation 10

F19_Gleichung 10,

the sample variance fn2 of the test list means is:

Equation 11

F20_Gleichung 11.


The variance of single-word recognition is

Equation 12

F21_Gleichung 12.

The relationship between the variance of the recognition of a single word and the variance of the mean value of n words randomly assembled into test lists is

Equation 13

F22_Gleichung 13.

Figure 1 [Fig. 1] shows that this relationship is satisfied for randomly composed test lists with n=1, 20, 40 from the words of the FBE. The variances shown were averaged out of 106 realizations of randomly assembled test lists. However, the variances of the specific test lists of the FBE deviate significantly from the average result of a random combination of words. In addition, Figure 1 [Fig. 1] shows, as expected, that fn2 is smaller in the vicinity of p=0 (almost no word is understood) and p=100% (almost all words are understood) than it is in the middle range around p=50%. The exact dependence of fn2 as a function of p is unknown. The approach chosen here is a parabola

Equation 14

F23_Gleichung 14,

with a parameter c2 to be determined. Thus Equation 9 can be written as

Equation 15

F24_Gleichung 15

with

Equation 16

F25_Gleichung 16.

If, in method 3, Equation 6 is replaced by Equation 15, then both the variability of the single-word recognition and the variability of the test-list mean values are taken into account.

The parameter c2 was calculated from the measured single-word recognition pji as follows. For each of the four levels used, the average F26_in text of single-word recognition was calculated according to Equation 10 and the variance of fn2 according to Equation 11. The values of the four pairs F26-2_in text depend on the selection of the test lists and the word composition of the test lists and on the test-list length n. For the four pairs of values F26-2_in text, a parabola F27_in text was fitted according to the method of least squares. This yielded the value for c2. Three of the resulting parabolas are shown in Figure 1 [Fig. 1]. With the now-known value of c2, the effective list length ñ was calculated using Equation 16. Table 2 [Tab. 2] shows the results for n=20 and for n=40. Since the FBE has 20 words per list, for calculations with n=40, all combinations of pairs of different lists were considered.

Method 5: Critical differences according to Thornton and Raffin, with variable word recognition and variable test-list recognition

To incorporate single-word variability, the variance in Equation 4 decreases to that in Equation 5. Consequently, the variability of test-list recognition is now included by replacing Equation 5 with

Equation 17

F28_Gleichung 17.

Critical differences in a one-sided test

So far, the 95% confidence interval has been considered for two-sided tests. However, when using the FBE in hearing-aid fitting, it is assumed that hearing aids improve speech recognition, i.e. that a higher score is achieved in the second measurement (with hearing aids) than in the first measurement (without hearing aids).
The statistical test for determining a significant difference between the two scores would then examine whether the error probability for the hypothesis that the second score is larger than the first score is less than 5%. This corresponds to the bounds of the 90% confidence interval and can be calculated using the same five methods by replacing z=1,96 with z=1,645. Although the problem is one-sided, for the sake of completeness, the limits of the 90% confidence interval for the second score are given symmetrically around the first score.

Critical differences in the level domain

The FBE determines speech recognition for a given speech level. Its accuracy is provided by the corresponding confidence interval for percentage scores. In contrast, adaptive methods such as the Oldenburg sentence test (OLSA, [15]) or the Göttingen sentence test [16] determine the signal-to-noise ratio or speech level for a given speech recognition score of (mostly) 50%, or even 80% (Speech Recognition Threshold, SRT). The accuracy of the sentence tests in the SRT is given as approx. ±1 dB ([17], [18]). For comparison, the confidence intervals for the percentage score p obtained from method 5 were converted into confidence intervals for the speech level L. For this purpose, the discrimination function given in [18] was solved for the speech level:

Equation 18

F29_Gleichung 18.

For the level L50 at p=50% and the slope s50 at this point, the median values L50=24.7 dB and s50=0.045/dB given in [11] were used.


Results

Comparison of calculation methods

In a comparison of the results from methods 1–5, Figure 2 [Fig. 2] shows the 95% confidence interval of the second percentage score pmess2 , given the value for the first percentage score pmess1. The bounds from method 1, which are based on the same word recognition for each word in a test list, are farthest out, indicating the widest 95% confidence interval. By including variable word recognition in methods 2 and 3, the 95% confidence intervals become narrower, the curves are closest to the center. In methods 4 and 5, the variability of the test lists was taken into account. Thus, the 95% confidence intervals are again farther outside and almost coincide with those of method 1. There are only minor differences between the results of [5] and [12]. This is reflected in Figure 2, in which the bounds from methods 2 and 3, and from methods 4 and 5, lie close together.

Percentage scores of the FBE with 20 words per test list are possible only at intervals of 5%. Therefore, it is useful to conservatively round the bounds of the calculated 95% confidence intervals to multiples of 5%. These bounds for n=20 are given in Table 3 [Tab. 3]. Tab. A. 1 in the Attachment 1 [Attach. 1], contains the corresponding bounds for n=40. The rounding partially increases the differences between the methods. However, the differences are at most 5% for both n=20 and n=40. The only exception is the difference between methods 1 and 3 at p=75% for the lower bound and p=25% for the upper bound at n=20, where the difference is 10%.

For methods 4 and 5, two variants are given in Table 3 [Tab. 3] and Tab. A. 1 in the Attachment 1 [Attach. 1]. When including all 20 test lists (designations 4 and 5), ñ=21.4 was used (see Table 2 [Tab. 2]). By omitting lists 5, 11, 12, and 15, i.e. with only 16 test lists, the effective list length increases to ñ=24.4. The corresponding bounds are given in columns 4/16 and 5/16. Omitting these four test lists reduces the variance of the test lists, and, consequently, the 95% confidence intervals become somewhat narrower.

Comparison with measurement data

The percentages of the measurements outside the 95% confidence intervals are given in Table 1 [Tab. 1]. The goal that 5% of the measurement data should be outside the confidence interval is closely approached by method 1 for both normal hearing (NH) and hearing-impaired (HI) participants and when using 20 or 40 words per test list. However, method 1 does not take into account the differences in word recognition, nor those among test lists, and thus tends to overestimate the width of the confidence interval. For methods 2 and 3, which account for differences in word recognition, approximately 9% of the measurements are outside the 95% confidence interval. The specified bounds are therefore too narrow. Methods 4 and 5, in contrast to methods 2 and 3, take the variability of the test lists into account and achieve the 5% target in the various measurement data variants for all 20 test lists up to a maximum deviation of 0.5%, and for the 16 test lists up to a maximum deviation of 1.1% for the hearing-impaired participants with double test lists.

Figure 3 [Fig. 3] shows the measurement data, together with the critical differences according to method 5. For a percentage score of 50%, the 95% confidence interval lies between 25% and 75% (see Table 3 [Tab. 3], columns „5“). When double test lists are used (n=40), the 95% confidence interval is reduced to 30 % and 70 % (see Tab. A. 1, columns „5“ in the Attachment 1 [Attach. 1]).

One-sided test

In the Attachment 1 [Attach. 1], Tab. A. 2 and Tab. A. 3 show the rounded 90% confidence intervals for n=20 and n=40 for all methods. The percentage of data outside of these confidence intervals for NH and HI for all variants is shown in Table 4 [Tab. 4]. The criterion for the quality of the calculation method is that 10% of the data lies outside the calculated confidence interval. The results are qualitatively similar to those in Table 1 [Tab. 1]. While the bounds according to method 1 tend to be too wide, leading to less than 10% of the data outside the 90% confidence interval, methods 2 and 3 make the interval too narrow. The measurement results for normal hearing and hearing impaired participants can be better approximated using the methods 4 and 5 than when using the methods 2 and 3.

Figure 4 [Fig. 4] shows the measured data together with the 90% confidence interval for method 5. According to method 5 and for n=20, the 90% confidence interval at a hit rate of 50% covers the range between 30% and 70% (see Tab. A. 2 in the Attachment 1 [Attach. 1]). When using double test lists (n=40), the 90% confidence interval at this point is reduced and ranges between 35% and 65% (see Tab. A. 3 in the Attachment 1 [Attach. 1]).

Critical differences in the level domain

For comparison with the accuracy of sentence tests, Table 5 [Tab. 5] shows the limits of the confidence intervals transformed to the level domain with a speech recognition score of 50% and of 80% for single test lists (n=20) and for double test lists (n=40). The confidence intervals are narrower for n=40 compared to n=20, and narrower for the 90% confidence interval compared to the 95% confidence interval. With 80% speech-recognition rate, confidence intervals are wider than for a speech recognition rate of 50%. The width of the confidence intervals ranges from ±4.0 dB for n=40 with a speech recognition rate of 50% (90% confidence interval) to ±11.3 dB for n=20 with a speech recognition rate of 80% (95% confidence interval).


Discussion

Modeling speech recognition as a Bernoulli experiment, with different word recognition scores within the test lists, the Poisson binomial distribution was used to calculate the 90% and 95% confidence intervals using different methods. The methods of Thornton and Raffin [5] and Altman et al. [12] led to similar results. These two methods were extended by additional consideration of the test-list variance. With this approach, the methods met the criteria that approximately 5% and 10% of the measured data are outside the limits of the calculated confidence intervals.

Depending on the variant (single or double test lists, 90% or 95% confidence interval, all 20 or only 16 selected test lists), the confidence intervals at a percentage score pmess1=50% for the first measurement have a width of ±15% to ±25%. The guideline for assistive devices [4] requires an improvement of at least 20 percentage points for a hearing-aid fitting compared to the unaided measurement. At a percentage score of pmess1=50% for the first measurement, an improvement of 20 percentage points in the second measurement is only statistically significant if double test lists are used. When using 20 words per test list, an increase of the percentage score of 20 percentage points by hearing aids is not statistically significant, because the error probability for the decision that the hearing aids improve speech recognition is more than 5%. For a significant improvement to be inferred from a difference of 20 percentage points, both the unaided and the aided condition would have to be determined using double test lists. When using single test lists, an improvement of 20 percentage points in the second measurement can only be regarded as significantly different for a percentage score of 75% or above for the first measurement.

To narrow the confidence bounds, the four test lists that were conspicuous in Baljic et al. [11] may be omitted. Thus, the test-list variance would be reduced. However, there is no guarantee that for HI, in other German-speaking regions, or in other measurement configurations (e.g., in background noise), the same four test lists would still be outliers. An indication for deviations in conspicuous test lists could be that the confidence-interval bounds determined from the measurement data of the 16 selected test lists for the group of HI tended to be too broad. Thus, slightly less than the targeted 5% or 10% of the measurement data lay outside the confidence intervals. Even if all 20 test lists were used, the test list variance obtained from NH measurement data for modeling may be different for different groups of listeners or measurement conditions, leading to narrower or wider confidence intervals. For the measurement data of HI, however, the statement of Dillon [9], that HI have the same test-retest reliability as NH, was confirmed.

A comparison of the measurement results with the modeled confidence bounds also confirmed the conclusion of Dillon [9] that the bounds of Thornton und Raffin [5] according to method 1, i.e. when using the simple binomial distribution, can mimic the measured test-retest reliability relatively well. These bounds had already been specified for the FBE by Winkler and Holube [7] for n=20. By using the Poisson binomial distribution in methods 2 and 3, however, the confidence intervals became narrower. After considering test-list variance in methods 4 and 5, widths became wider again, so that the limits of method 1 are approached. It should be noted, however, that the variability between the participants discussed by Dillon [9] was not incorporated in the present study. A possible reason for the negligible variability of the participants could be that two measurements within the same session were compared. Therefore, only the short-term reliability for test and retest was examined. The probably small intra-individual variance of participants within one session may be below test-list variance and might have been negligible here. Reliability has not been studied over an extended period, i.e., over several sessions, so that changes due to the variables “physical and mental state” of the participants were not measured. Another explanation for the negligibile variability between the participants could be that individual differences were not sufficiently considered [9]: To apply the Poisson binomial distribution, only the mean speech-recognition values of each single word were used. In individual participants, speech recognition of the words may have differed even more clearly, and methods 2 and 3 would have led to even narrower confidence intervals. In that case, an additional source of variance, e.g., the intra-individual variance, would be necessary to model the confidence intervals matching the measurement data.

For comparison with the sentence tests, the confidence intervals obtained from method 5 for percentage scores of 50% and 80% were transformed to confidence intervals for speech level. Using single test lists (n=20) with a speech recognition score of 50%, the 90% confidence interval has a width of ±6 dB. The confidence interval for the FBE is thus considerably wider than the confidence intervals for the adaptive sentence tests of about ±1 dB ([17], [18]). Hearing aids would need to improve the speech level by more than 6 dB at a speech recognition score of 50% in order to achieve a significant effect. For example, if the hearing aid only improved the speech level by 3 dB, then the sentence tests would result in a significant difference, and thus in a difference in efficacy, but not the FBE. The goal of an improvement of more than 6 dB appears to be easily achievable for speech recognition tests in quiet. However, whether this requirement can be transferred to an improvement by 6 dB in signal-to-noise ratio for FBE in noise is still unclear. Even if the same lists with n=20 and n=40 words would be used in noise, the variance in word recognition and in test-list recognition may differ from the FBE in quiet, and, therefore, deviating confidence bounds may result.


Conclusions

  • Critical differences can be estimated relatively well solely from the number of measurement items, using method 1 proposed by Thornton und Raffin.
  • With further knowlege about the speech test, i.e. the distribution of recognition of single items and the variance of test lists, methods 4 and 5 provide a more accurate model of the test-retest reliability.
  • Confidence intervals should always be stated when publishing speech test results. It should also be noted whether a one-sided or a two-sided test was considered.

Notes

Competing interests

The authors declare that they have no competing interests.

Acknowledgement

This analysis was funded by the doctoral program Jade2Pro of Jade University of Applied Sciences. Additional funds were provided by the European Regional Development Fund (ERDF-Project Innovation network for integrated, binaural hearing system technology [VIBHear]), together with funds from the State of Lower Saxony.
Manuscript language services were provided by http://stels-ol.de/.


References

1.
Hahlbrock KH. Uber Sprachaudiometrie und neue Wörterteste [Speech audiometry and new word-tests]. Arch Ohren Nasen Kehlkopfheilkd. 1953;162(5):394-431. DOI: 10.1007/BF02105664 External link
2.
Holube I, Winkler A, Nolte-Holube R. Modellierung der Reliabilität des Freiburger Einsilbertests in Ruhe mit der verallgemeinerten Binomialverteilung. Z Audiol. 2018;57(1):6-17.
3.
Holube I, Winkler A, Nolte-Holube R. Modeling the reliability of the Freiburg monosyllabic speech test in quiet with the Poisson binomial distribution. Does the Freiburg monosyllabic speech test contain 29 words per list? GMS Z Audiol (Audiol Acoust). 2020;2:Doc01. DOI: 10.3205/zaud000005 External link
4.
Gemeinsamer Bundesausschuss. Richtlinie des gemeinsamen Bundesausschusses über die Verordnung von Hilfsmitteln in der vertragsärztlichen Versorgung. Hilfsmittelrichtlinie. 2018 [accessed 13. Dezember 2018]. Available from https://www.g-ba.de/downloads/62-492-1666/HilfsM-RL_2018-07-19_iK-2018-10-03.pdf External link
5.
Thornton AR, Raffin MJ. Speech-discrimination scores modeled as a binomial variable. J Speech Hear Res. 1978 Sep;21(3):507-18. DOI: 10.1044/jshr.2103.507 External link
6.
Carney E, Schlauch RS. Critical difference table for word recognition testing derived using computer simulation. J Speech Lang Hear Res. 2007 Oct;50(5):1203-9. DOI: 10.1044/1092-4388(2007/084) External link
7.
Winkler A, Holube I. Test-Retest-Reliabilität des Freiburger Einsilbertests [Test-retest reliability of the Freiburg monosyllabic speech test]. HNO. 2016 Aug;64(8):564-71. DOI: 10.1007/s00106-016-0166-2 External link
8.
Steffens T. Test-Retest-Differenz der Regensburger Variante des OLKI-Reimtests im sprachsimulierenden Störgeräusch bei Kindern mit Hörgeräten. Z Audiol. 2006;45(3):88-99.
9.
Dillon H. A quantitative examination of the sources of speech discrimination test score variability. Ear Hear. 1982 Mar-Apr;3(2):51-8. DOI: 10.1097/00003446-198203000-00001 External link
10.
Hagerman B. Reliability in the determination of speech discrimination. Scand Audiol. 1976;5:219-28. DOI: 10.3109/01050397609044991 External link
11.
Baljić I, Winkler A, Schmidt T, Holube I. Untersuchungen zur perzeptiven Äquivalenz der Testlisten im Freiburger Einsilbertest [Evaluation of the perceptual equivalence of test lists in the Freiburg monosyllabic speech test]. HNO. 2016 Aug;64(8):572-83. DOI: 10.1007/s00106-016-0192-0 External link
12.
Newcombe RG, Altman DG. Proportions and Their Differences. In: Altman DG, Machin D, Bryant TN, Gardner MJ, editors. Statistics with Confidence: Confidence Intervals and Statistical Guidelines. 2nd Edition. London: British Medical Journal Books; 2000. p. 45-56.
13.
Newcombe RG. Interval estimation for the difference between independent proportions: comparison of eleven methods. Stat Med. 1998 Apr;17(8):873-90. DOI: 10.1002/(sici)1097-0258(19980430)17:8<873::aid-sim779>3.0.co;2-i External link
14.
Wilson EB. Probable Inference, the Law of Succession, and Statistical Interference. J Am Stat Assoc. 1927;22(158):209-12. DOI: 10.1080/01621459.1927.10502953 External link
15.
Wagener KC, Kühnel V, Kollmeier B. Entwicklung und Evaluation eines Satztests für die deutsche Sprache I: Design des Oldenburger Satztests. Z Audiol. 1999a;38:4-15.
16.
Kollmeier B, Wesselkamp M. Development and evaluation of a German sentence test for objective and subjective speech intelligibility assessment. J Acoust Soc Am. 1997 Oct;102(4):2412-21. DOI: 10.1121/1.419624 External link
17.
Wagener KC, Brand T. Sentence intelligibility in noise for listeners with normal hearing and hearing impairment: influence of measurement procedure and masking parameters. Int J Audiol. 2005 Mar;44(3):144-56. DOI: 10.1080/14992020500057517 External link
18.
Brand T, Kollmeier B. Efficient adaptive procedures for threshold and concurrent slope estimates for psychophysics and speech intelligibility tests. J Acoust Soc Am. 2002 Jun;111(6):2801-10. DOI: 10.1121/1.1479152 External link