A significance test of interaction in 2 × K designs with proportions

When investigating the deficits of a single patient, psychologists usually compare his/her performance in one or more tests to the performance of a control group. This can be done for any kind of variables, provided (i) that the design does not require the investigation of interactions between two or more factors, (ii) that the comparison between two or more individuals is not desired, and (iii) that the collection of the control data is possible. Yet, researchers are constantly interested in assessing interactions in the performance of an individual, and in the comparison of two or more individuals for investigating double dissociations or the efficiency of different methods of therapy, etc. They also may desire to investigate cases where only extremely simple and easy tasks can be performed, where ceiling effects are observed in the performance of the controls, and thus the case-controls comparison is impossible. The available statistical tools for the analysis of intra-individual or inter-individual performance (mainly with proportions) do not offer the possibility to assess interaction, they are not appropriate when some cells may contain 0 or 1 proportions, and when the sample size is small. Here, we present the Q’ test which may be used to test the hypothesis of equal proportions and proportion differences in 2 × K designs, offering therefore the possibility for researchers to investigate the main effects and interaction. This test can be used for any sample size and even when the data contains extreme proportions. Finally, a procedure of multiple comparisons described in this paper may be used to locate statistically significant sources of variance and differences.

Damaged brains and disordered minds have been studied by psychologists mostly through single-case studies, where the performance of a patient is compared to the performance of a normative sample.In recent years, some statistical tests have been adapted to the single-case design with one, two, and K tests (Crawford & Garthwaite, 2002).However, the use of these otherwise remarkable tests is limited when investigating interactions between factors, when comparing two or more individuals for purposes of assessing double dissociations (a patient exhibits a deficit in task X but not in task Y, and a second patient exhibits exactly the opposite pattern of performance) or the I would like to thank Louis Laurencelle for his useful comments and his commitment to helping making this a valuable contribution to the literature.efficiency of different methods of therapy or reeducation, and, most interestingly, when investigating the performance of patients presenting with massive cognitive impairments.These patients can perform tasks the simplicity and ease of which do not allow to collect data from control groups, just because controls perform 100% correctly.The comparison to the normative data is thus impossible because of the absence of variance in the control group.
Cases where data is made of proportions and where intra-individual, inter-individual, or pooled group analyses are possible, may offer scientist's some exceptional opportunities to study human cognition and its breakdown, but also to study some phenomena in other domains of fundamental and applied research.At this aim, researchers frequently use the classical chi-square tests.Yet, these tests can be used with confidence only if the number of trials per condition is large (N > 40), and no low scores are observed (r < 5).Tests that assess the main effects and the interaction between at least two factors with proportions are rarely accessible to social, cognitive and other behavioral scientists.Marascuilo (1970) presented a test for the comparison of K independent sensitivity indexes, d-primes, of the signal detection theory (Green & Swets, 1966), and suggested its use either for the analysis of individual or pooled group data.This test is of most interest to our purpose, because, since a d-prime is the difference between two normalized proportions, the difference between K d-primes corresponds to the interaction in a 2 × K design.
The direct transposition of this method to the analysis of non-normalized proportions is, however, not recommended.As a point of fact, Marascuilo (1970) uses a variance (Gourevitch & Galanter, 1967) based on the Wald variance of a proportion.This variance is well known and can be computed as follows, where p ˆ is the proportion, and N the sample size.The use of this variance seemingly leads to several kinds of anomalies (Newcombe, 1998), and its use is not recommended (Newcombe & Altman, 2000).The Q' test of 2 × K interaction presented in this paper is a modified version of the Marascuilo test (1970) in which a different variance is introduced.This variance, can be computed as follows, where z is the 2 / 1 α − z from the standard Normal distribution.This equation is extracted from the equation of the Wilson confidence interval of proportions (Newcombe & Altman, 2000;Brown et al., 2001): The Wald and the Wilson methods are both approximate but differ importantly in some points: The Wald variance performs quite well when the sample size is large (N > 40), whilst the variance is inflated when the sample size is small.On the contrary, the Wilson variance is applicable for all samples and, interestingly, it performs just as well as the Wald variance for large samples.
The Wald variance should not be used for very low or very large observed proportions, and its value is 0 when the proportion is 0 or 1.No such restrictions exist for the Wilson variance, which can be applied in all cases and the value of which is not 0 for extreme proportions.
Finally, there are serious recommendations that the use of the Wald method requires that neither r nor N-r is less than 5 (Newcombe & Altman, 2000).The Wilson method can be applied in all cases.
These differences are visible in Figure 1, where the Wald and the Wilson variances are compared for proportions from 0 to 1, for a small (N = 10) and a large (N = 60) sample.The two methods perform equally well for the large sample, and this clearly establishes their direct link and equivalence.However, the Wilson variances are less inflated for the small sample than the Wald variances, and this allows a better interpretation of the results.As pointed out by Newcombe and Altman (2000), the Wald method leads to "too extreme an interpretation of the data, and sometimes do not make sense" (p.46).Thus, unless the sample size is large and no extreme proportions are present (theses conditions are rarely met is some research domains), the Wilson variance seems more appropriate and more adequate for the analysis of proportions and differences in proportions.For reasons of applicability to all samples and whichever the proportion, the Q' test presented here uses the Wilson variance, recommended for its applicability to any data (Newcombe, 1998;Agresti & Coull, 1998).
In the 2 × K design, the magnitude of difference between 2 proportions (signalled as 1 and 2) is compared in K (1, 2, …, K) different conditions, where a condition may represent a group of observers (pooled group data), a single observer, or even a single test from individual data.The resulting test statistic has a χ 2 distribution with ν = (K-1) degrees of freedom.For each of the K conditions, let the estimates of a difference in two proportions be denoted by where 1 p and 2 p denote the proportions to compare, and where k = 1, 2,…, K. Let the variance of the proportions be denoted by Vark1 and Vark2, respectively: (see equation 2) where Nk1 and Nk2 represent the sample size in the k1 th and k2 th condition, and where z is the 2 / 1 α − z from the standard Normal distribution (i.e., 1,96).The variance of the proportion difference is: and the Q' is: where As mentioned above, the Q' test statistic has a χ 2 distribution with ν = (K-1) degrees of freedom and the corresponding critical value above which a difference among the tested conditions is significant can be read in a χ 2 table.

A worked example with practical steps
The data must be first arranged in a 2 × K table.Let's consider the following data, taken from a study where patient RR, suffering from progressive agnosia due to posterior cortical atrophy, was required to name pictures of usual objects.The null hypothesis is that patient RR's naming performance does not vary as a function of the picture color and semantic category of the objects.The pictures could be colored or grey-scaled (i.e., factor 1: color), and could represent (a) vegetables, (b) animals, or (c) tools (i.e., factor 2: semantic category).Thirty-six pictures per condition (N = 36) were presented.It is not obligatory for the N to be the same for all conditions.The results are presented in Table 1 and depicted graphically in Figure 2.

The 2 × × × × K interaction
There are seven steps of computations in order to assess the color X semantic category interaction, and they are represented in Table 1.
Step 1: compute the proportions for each condition, by dividing the observed score by the corresponding number of trials 917 , 0 36 where ra1 is the score in the a1 th cell, Na1 is the sample size for the a1 th condition, and ˆ p a1 is the proportion of the a1 th condition and so forth.
Step 2: compute the variance for each proportion, for However, for the small sample size, there are important differences between the two variances.The greatest differences are observed for the extreme (0 and 1) and the intermediate proportions.The Wilson variance is thus applicable under the same conditions as the Wald variance, and it performs better when the sample size is not large, and when extreme proportions are observed.For these reasons, the Wilson method should be preferred to the Wald method.example: Step 3: For each condition of factor 2 (a, b, …, k), compute the proportion differences between the two conditions of factor 1 (1, 2): If the Q' is equal or bigger than the value of χ 2 read in a statistical table, then the null hypothesis is rejected and the tested interaction is significant at the corresponding P level.
For df = 3-1 = 2, the corresponding χ 2 value for P = 0,05 is 5,99.We can thus reject the null hypothesis.The interaction is significant, suggesting that patient RR's performance varies as a function of the color of the pictures and the object semantic category.

The main effects
One may use the Marascuilo procedure (Marascuilo, Table 1.Scores (correct responses) obtained by patient RR in a naming task with colored (1) and grey-scaled (2) pictures of vegetables (a), animals (b) and tools (c), and the seven steps necessary for the analysis of the color X semantic category interaction with proportions.Thirty-six pictures were presented in each tested condition.Step 1 Proportion ( ˆ p ) 1 0,917 0,694 0,333 2 0,278 0,306 0,222 Step 2 Variance (Var) 1 0,0023 0,0054 0,0056 2 0,0052 0,0054 0,0045 Step 1966) in order to derive a χ 2 for the main effect of each factor.However, the use of this method requires large samples, and scores that are bigger than 5.The method we describe below is very similar to the one used for the assessment of the interaction, and it can be used with any data.Let's consider the steps of the main effect of factor 2, in which 3 conditions were tested.The data is collapsed across the non-tested factor.
Step 1: The collapsed scores for a, b, and c are ra = 43, rb = 36 and rc = 20, respectively, and the sample size for each condition is 72.
Step 3: For each condition, the difference with the pooled proportion is derived: As before, the Q' follows the χ 2 distribution with df = k-1 (here, df = 3-1 = 2).The critical χ 2 value for P = 0,05 when df = 2 is 5,99: the null hypothesis can be rejected.The main effect of semantic category is significant, suggesting that patient RR's naming performance differs as a function of the object semantic category.
The same procedure also applies to the other main effect, even if other tests exist, such as the z-score for the difference of two proportions.The use of the Q' test for the main effect of the 2-condition factor gains its interest in the fact that the two collapsed proportions are compared to a baseline represented by the overall pooled proportion, rendering the test more conservative.

Multiple comparisons
The presence of a significant interaction or a significant main effect does not really inform us on the reasons of rejection of the null hypothesis.Marascuilo and McSweeney (1967) developed a method of multiple comparisons, consisting in the comparison of two proportions through the confidence interval of their difference.Once again, the problem is that these formulas use the Wald confidence interval of proportions, rendering the results difficult to interpret when N < 40, as well as when 0 ˆ= p or 1 ˆ= p .Here we propose an alternative procedure which can be used for any data.
As in the Marascuilo and McSweeney (1967) procedure, a critical value should be calculated: For df = 3-1 = 2, the corresponding χ 2 value is 5,99, and the critical value is 45 , 2 99 , 5 = .Then comes the computing of a value, ψ, for each desired comparison: where ij d ˆ is the difference between the proportions to compare, i and j, and ij D ˆ is the variance of their differences, the equation of which was given in (eq.2) and (eq.5).Let's consider that, following a significant interaction we desire to compare the conditions a1 and a2: If ψij ≥ critical, then the difference is significant at P = 0,05.In our example the condition a1 differs from the condition a2 at P = 0,05.This procedure can be used for any other comparison, including the comparisons between the conditions of a single factor when its main effect is significant.
Other tests: two conditions, k conditions, and 2 × × × × 2 design The Q' test just described above has the advantage to be flexible and useful with three other designs.Even though other tests exist for the comparison of 2 proportions, K proportions and 2 × 2 designs, it is of most interest to use the same procedure and derive the same values for these tests.As a point of fact, this would render the scientific studies directly comparable.One has certainly noted that the computation of the main effects in the 2 × K design corresponds to two distinct tests: The main effect of the 2-condition factor is actually the comparison of two proportions derived from collapsed data.One can just compare two single proportions in the same way.As is the case for the main effect of factor 1, df = 1.
The main effect of the K-condition factor is in fact the comparison of K proportions.The computations of this main effect can be used in any study when the difference of K conditions is investigated.
Finally, even though the Q' test was primarily developed to test 2 × K designs, it can also be used at its present form with 2 × 2 designs.In this case, df = 1 for both main effects, as well as for the interaction.

Use and abuse of the Q' tests
The original test by Marascuilo (1970) was designed for the assessment of differences among K d-primes.He recommended the use of the test for pooled group data, and in single-case studies.The Q' test, as a test deriving directly from the test of Marascuilo (1970), can thus be used in both cases, but it has not the disadvantages of more familiar tests.This makes the family of Q' tests most useful in single-case and small-N studies.Here are some recommendations for the use of the family of Q' tests, aiming at avoiding any possible abuse: The analysis should be carried on original scores, never on percentage-transformed data.Even though the computations are carried on proportions, the variance is completely different when using scores than when using percentages because it is dependent on the N. Thus, if you have presented the patient with N = 40 trials in your original study, the percentage transformation would increase this number to N = 100.This would allow Type I errors to be more frequent and the plausibility of the results would be strongly questioned.
The Q' tests can be used in a variety of situations and research domains, such as education and marketing, but they are also of great importance in neuropsychological studies involving single cases, especially when it is difficult to obtain any normative data from a control group because of ceiling effects.In this last case, the use of the Q' tests should be justified mostly by the impossibility to collect data from a control group.Whenever the collection of control group data is possible, the use of case-controls tests (Mycroft et al., 2002;Crawford & Garthwaite, 2002) should be preferred to the Q' tests, unless the comparison of at least two cases is desired (see below), or the design involves a 2 × 2 or 2 × K interaction.
The key notion in cognitive neuropsychology is double dissociation.The existence of two independent cognitive systems or processes can be assumed when a patient exhibits a deficit in task X but not in task Y, and when a second patient exhibits exactly the opposite pattern of performance.Thus, the direct comparison of two cases is necessary for a double dissociation to be assessed.Yet, the single-case adapted tests do not allow the direct comparison of two patients.The Q' tests allow this kind of comparison.Precisely, the 2 × K test allows the comparison of the performance of 2 individuals in K tests or conditions, or the comparison of the performance of K individuals in 2 tests or conditions.These possibilities render necessary the use of the Q' tests at least in neuropsychology.

Implementing the Q' tests
The equations presented in this article can be copied in a spreadsheet of any commonly used software (e.g., Excel) and be kept for further use.A free Excel file allowing the computation of main effects, interaction and multiple comparisons is available on the journal's web site.

Conclusions
The Q' tests family constitute an interesting and useful tool for the analysis of 2 × K designs with proportions, and their use can improve the statistical inference in a variety of research domains and clinical contexts, such as education, marketing, neuropsychology and cognitive psychopathology.The analysis of more complex designs (e.g., 2 × 2 × K) should be possible in the future by extending the Q' tests, and this exciting possibility should allow a more sophisticated, plausible and adequate analysis of data from intra-individual, inter-individual, and group studies.

Figure 1 :
Figure1: A direct comparison between the Wald and the Wilson variances for different proportions and for sample sizes of N=10 and N=60.The two variances are very similar for the large sample size.However, for the small sample size, there are important differences between the two variances.The greatest differences are observed for the extreme (0 and 1) and the intermediate proportions.The Wilson variance is thus applicable under the same conditions as the Wald variance, and it performs better when the sample size is not large, and when extreme proportions are observed.For these reasons, the Wilson method should be preferred to the Wald method.

Figure 2 :
Figure 2: Graphic representation of the performance (proportion of correct responses) of patient RR in a picture naming task.The pictures were either colored or grey-scaled, and were representing usual objects of three semantic categories, vegetables, animals and tools.Error bars represent 95% Wilson confidence intervals for proportions (eq.3).The visual analysis of this graphic suggests that picture color and object semantic category interact, and this is confirmed by the Q' test (Q'(2) = 15.82,P = 0.0004).

D
divide 1 by the variance of the proportion difference: The d0 can be obtained by dividing the sum of the values obtained in Step 5 by the sum of the values obtained in Step 6 the sum of the values obtained in Step 5 by the sum of the values obtained in Step 6:Step 7: for each proportion, compute the contribution of each difference to the main effect:The sum of the values obtained in Step 7 is the Q' value of the 3