Assessing parameter invariance in item response theory ’ s logistic two item parameter model : A Monte Carlo investigation

Statistical properties of the ability level estimate ( ) in item response theory (IRT) were investigated through a Monte Carlo investigation, based on data generated with a four cases multifactor design. Dichotomous items and the logistic two-parameter IRT model in a one-dimensional setting have been chosen. The estimation procedure was the marginalized Bayesian item parameters estimation and EAP estimation for . The property of invariance is discussed. Results show that estimation of is intrinsically biased, is constrained by the number of items and that it performs better when the number of items and the number of examinees increase. Furthermore, IRT parameters do not seem to perform better nor give more information than those used in classical test theory.

Statistical properties of the ability level estimate ( ) in item response theory (IRT) were investigated through a Monte Carlo investigation, based on data generated with a four cases multifactor design.Dichotomous items and the logistic two-parameter IRT model in a one-dimensional setting have been chosen.The estimation procedure was the marginalized Bayesian item parameters estimation and EAP estimation for .The property of invariance is discussed.Results show that estimation of is intrinsically biased, is constrained by the number of items and that it performs better when the number of items and the number of examinees increase.Furthermore, IRT parameters do not seem to perform better nor give more information than those used in classical test theory.
Classical test theory (Gulliksen, 1950;Lord & Novick, 1968;Laveault & Grégoire, 2002) proposes an algebraicconceptual framework to explore the connection between an observed score measured by a test which evaluates a skill, knowledge or psychological aptitude, and the person's unknown true score or ability level.Item response theory (IRT) (Hambleton, Swaminathan & Rogers, 1991;Bertrand & Blais, 2004), on the other hand, tackles the same problem on a molecular basis, i.e. item-wise, by trying to model the interaction between the respondent's ability level and the M. G.: Centre de réadaptation en déficience intellectuelle et troubles envahissants du développement Mauricie/Centredu-Québec -Institut universitaire, 3090, rue Foucher Trois-Rivières (Qc) G8Z1M3; Email : marlene_galdin@ ssss.gouv.qc.ca;L. L.: Département des sciences de l'activité physique, Université du Québec à Trois-Rivières.We wish to thank the reviewers for their carefull revision and the suggestions they made.They greatly enhanced the quality of this paper.operational characteristics of each item.An attractive feature of IRT is its parametric setting, usually represented with a kitem parameter logistic probability model (k = 1, 2, 3), and the property of invariance associated with it (McKinley, 1989).
The full 3-item parameter logistic model serves to illustrate the role and interpretation of each component: it describes the examinee's probability of giving the correct response to an item: (1) In equation (1), r denotes r th examinee's ability level, bj is the item's difficulty level, aj its coefficient of discrimination at the inflexion point, and cj the index of pseudo-guessing.Values for and bj range currently from -3 to 3, aj is usually a small positive value, and cj, varying from 0 to 1, is used mostly for multiple-choice items where chance supplies a minimum probability of guessing the correct answer or the probability of low level respondents to obtain the correct answer.The 2-parameter model does away with the cj parameter (i.e.cj ≡ 0) and the 1-parameter model (e.g.Rasch model) uses only the bj coefficient (e.g.setting aj ≡ 1).
Item response theory enhances and in some way supplants classical test theory (CTT) by implementing new concepts and a new vocabulary to describe tests (item characteristic curve, item/test information function, optimal testing, etc.) and by putting the focus on the estimation of items' operational characteristics (e.g.assessment of test dimensionality, estimation of the a, b, c parameters, item bias and differential item functioning), although these issues are also addressed in CTT.Moreover, tenants of IRT put forward the property of invariance possessed by parameter estimates, advocating that such estimates, that of for instance, are obtained free of context and can be deemed truly characteristic of their object, by opposition to the context-bound estimates in CTT."Invariance" often means that values of IRT item parameters ought to be identical for separate groups of examinees and through different measurement conditions (Rupp & Zumbo, 2006).
What is invariance?Like most authors on the same topic, Hambleton et al. (1991) stress the importance of this concept as a distinctive asset of IRT: The importance of the property of invariance of item and ability parameters cannot be overstated.This property is the cornerstone of item response theory and makes possible such important applications as equating, item banking, investigation of item bias, and adaptive testing.(p.25) On the one hand, "invariance" means equality: "If invariance holds, the parameters obtained should be identical" (Hambleton et al., 1991, p. 20;Rupp & Zumbo, 2006, p. 64).On the other hand, a less stringent form of correspondence, e.g.linear equivalence, is admitted as a demonstration of invariance: two sets of parameters are said mutually "invariant" if they may be linearly transformed one into the other (Hambleton et al., 1991;Rupp & Zumbo, 2006;Stocking and Lord, 1983) 1 .This second meaning of "invariance", also named "congruence", is akin to the notion of (linear) correlation, to the point that values of Pearson's correlation coefficients are taken as conclusive indications of invariance (Fan, 1998;Frenette, et al., 2007), with a threshold value of r = 0.90 being proposed.
From another standpoint, that of estimation theory in mathematical statistics (Kendall & Stuart, 1977;Freund, 1992), the concept of invariance must be translated into affine concepts, notably the concept of "bias".An estimating function based on a random sample of a population is said to be unbiased if its expectancy (across samples) is equal to the target parametric value.For the ability parameter of respondent "r", this simply means: (2) concurring with the "identical" definition of invariance in Hambleton et al. (1991), the bias being measured by the difference between E{ } and r, here E{ }r = 0.As Mckinley (1989) pointed out, the first step before using an IRT model is to estimate its parameters; usually none of them are known a priori.Baker and Kim (2004) provide a broad coverage of the methods and procedures for estimating the parameters of test items and examinees' ability levels.Pragmatically, the LOGIST TM program (Barton & Lord, 1982) was popular for a while; the method implemented in that program was called joint maximum likelihood estimation (JMLE) and was formulated by Birnbaum (1968): the and item parameters were simultaneously estimated.
Other popular programs such as BILOG-MG 3TM (Zimowski, Muraki, Mislevy & Bock, 2003) or MULTILOG V7TM (Thissen, Chen & Bock, 2003) use (optionally) the marginal maximum likelihood estimation (MMLE), and an expectation-maximization (EM) algorithm.This technique estimates the items' and parameters in consecutive steps.The advantage is that convergence can be reached with a fixed number of items without calling upon an arbitrary prior ability distribution.Baker and Kim (2004) recommend using the marginalized Bayesian item parameter estimation (BME).This estimation is quite similar to the MMLE, except that a prior distribution is added on the discrimination parameter (a).BME ensures that the procedure can be completed even in limit cases (e.g. when all items have been answered correctly or all incorrectly).Once the item parameters are "calibrated", i.e. estimated, the parameters are obtained by the largely used Bayes Expected A Posteriori (EAP) estimation procedure proposed by Bock and Mislevy (1982).
The lack of invariance or the so called item parameter drift has been studied by others (e.g.Frenette, et al., 2007;Rupp & Zumbo, 2006;Si & Schumacher, 2004;Wainer & Thissen, 1987;Wells, Subkoviak & Serlin, 2002).Results show that there might be a slight lack of invariance or item parameter drift under particular conditions (e.g.test length, number of examinees, presence of other latent traits), but findings are not unequivocal for specific conditions which worsen those variations.As pointed out earlier, accuracy of measurements is important in order to help users choose the best model which fits their reality or need.
The main purpose of this study was to shed more light on the invariance of estimation of , b and a parameters, in the context of a largely used two-phase estimation procedure.This paper presents a Monte Carlo investigation based on a four cases design which reproduces conditions that might be found in different testing contexts.Indeed, test reliability, test length, number of examinees and values of item parameters vary widely from one context to another, so that we set up three main cases divided into four sub-cases to cover a large array of possibilities reflecting realistic conditions.To complete our investigation, a fourth case was designed based on a pioneering idea.This idea relates to the question of what might happen with regard to one's ability estimation if some other group of individuals with very different ability levels are introduced in the estimation process : does one's ability estimate keep invariant whatever group of individuals it is embedded in, or otherwise what are the effects of such a relocation?The salient questions that we wished answered were the following: How do the estimated abilities ( ) match their corresponding generated values ( )? What factors do influence the consistency, reliability and other indicators of 's invariance?Do the estimated abilities ( ), and the more classical X scores (the sum of answers), behave equivalently across conditions, and what distinguishes ?In order to provide complete information to the reader, other questions and other answers will also be examined with regard to parameters a and b and their corresponding CTT indices.The parametric organization and details of the experiment together with the modalities of the Monte Carlo implementation are laid out in the next section.

STUDY DESIGN
This study is a Monte Carlo investigation.Dichotomous (0/1) items and the logistic two-parameter IRT model (a and b) in a one-dimensional setting have been chosen.First, item parameters a and b are generated, followed by values, then, from these, random item response patterns for each examinee are generated twice, once at "pre-test" and again at "post-test".Numbers of items and examinees are given in the Cases section below, together with their parametric conditions.The whole procedure is iterated 30 times within each condition.Means, standard deviations and Pearson correlation coefficients are computed for varying outcome indices across the 30 iterations: the relatively low noise of our simulated data coupled with the high efficacy (cf.F tests and ω 2 ) of our independent variables obviated the need of unduly slowing down our experimentation with more replications.Data generation and compilation as well as the handling of experimental conditions were programmed in Borland's Delphi 5 TM language and run on a PC computer platform.
The main difference between cases 1, 2 and 3 is the reliability (ρXX) condition between the pre-and post-test X scores.There are reciprocal relations between the test-retest reliability (ρXX) of X scores, number of items (k) and the item discrimination coefficients (a) in the logistic two-parameter model, i.e. a set of k items having Gamma-generated a coefficients with mean μa will result in a specific mean value of ρXX 2 .Cases 1, 2 and 3 are explained below.In Case 4, we introduce "witness protocols", i.e. sets of responses from a few respondents that are transferred unchanged from pretest to post-test and are then mingled with freshly generated data, the purpose being to measure the robustness of estimates when the estimation environment changes.
Case 2: Low reliability (ρXX ≈ 0.40) A common low value of test-retest reliability was imposed for all k through a lessening of μa, i.e. μa ≈ 0.564, 0.308, 0.236 and 0.165 for k = 10, 30, 50 and 100 respectively.The same combinations of μb, μ θ and N were applied as in case 1.
Case 3: High reliability (ρXX ≈ 0.80) A common high value of test-retest reliability was imposed for all k through a variation of μa, i.e. μa ≈ 1.843, 0.858, 0.628 and 0.423 for k = 10, 30, 50 and 100.The same combinations of μb, μ θ and N were applied as in cases 1 and 2.
Case 4: Witness response protocols Ability ( ) estimation and the associated invariance principle of IRT being the main concerns of this study, we contrived a way of ascertaining the reliability and stability of estimates by manipulating the sampling conditions of estimation.Thus, a "witness response protocol" is a protocol generated at pre-test which is identically reproduced at post-test, while "companion protocols" are allowed to vary, i.e. are generated afresh at post-test from their original parameter.In all sub-cases of case 4, only conditions with N = 500 examinees and k = 30 items were studied, together μa = 0.5, μb = 0 and (except case 4d) μ θ = 0; note that the μa = 0.5, k = 30 couple entails a moderate reliability of ρXX ≈ 0.618.As for every combination of conditions in cases 1, 2 and 3, each sub-case of case 4 was iterated 30 times.At each iteration, ten (10) Monte Carlo runs were effected, and the outcomes and estimates for the 10 first examinees (the "witnesses") were extracted and stored, so that 100 (= 10 × 10) witness sets of data were produced per iteration and submitted to analysis.Specific conditions for each sub-case are described below.
Sub-case 4a.For this sub-case, used as a standard for comparison, the 10 witness protocols of a Monte Carlo run are in fact generated at each of pre-and post-test times.Explicitly, 500 -values (with μ θ = 0) are generated once, and response protocols are generated anew at pre-and then at post-test for all examinees (control condition).
Sub-case 4b.In each run, 500 -values (with μ θ = 0) are generated, along with 500 response protocols at pre-test.At post-test, the first 10 protocols of pre-test are reproduced identically as witnesses, and the remaining 490 (= 500 -10) protocols are generated afresh ("same companions" condition).
Sub-case 4c.In each run, 500 -values (with μ θ = 0) are generated, together with the 500 response protocols for pretest.At post-test, the first 10 protocols of pre-test are reproduced identically as witnesses; for the remaining part of the sample, 490 (= 500 -10) new -values (still with μ θ = 0) are generated along with their random response protocols ("equal new companions" condition).
Sub-case 4d.In each run, 500 -values (with μ θ = 0) are generated, together with the 500 response protocols for pretest.At post-test, the first 10 protocols of pre-test are reproduced identically as witnesses ; for the remaining part of the sample, 490 (= 500 -10) new -values, under hyperparameter μ θ = 1 and generally higher, are produced, and their corresponding random response protocols are obtained ("better new companions" condition).

Generation of
The one-dimensional ability level ( ) was generated as a random normal deviate, with μ θ as specified (e.g.0, 1 or 2) and σ θ 2 = 1.

Generation of b
The parameter embodying item difficulty level was likewise generated as a random normal deviate, with μb as specified (e.g.0, 1 or 2) and σb 2 =1.

Item parameters and ability estimation
In BILOG-MG 3 TM or MULTILOG TM , the default estimation procedures are the marginalized Bayesian item parameter estimation (Bayesian Modal Estimation -BME) via an EM algorithm (Dempster, Laird, & Rubin, 1977) for item parameters and the Bayes Expected A Posteriori (EAP) estimation for .In this study, the same procedures were used through computer freeware called Libirt (Item Response Theory Library, version 0.8.4) 3 (Germain, Valois, & Abdous, 2008).Although it was already validated, we submitted the Libirt procedure to independent checks, the EIRT estimates coinciding satisfactorily 4 with those from BILOG-MG 3 TM .The default values and prior distribution needed for the a and b parameters as well as the reference distribution for are the same that in BILOG-MG 3 TM .After being generated as explained in the previous section, the response protocols were processed through the two-phase Libirt program in order to obtain item parameters and ability estimates.
For the BME/EM process, a normal prior distribution was used for the item difficulty parameter b, and a prior lognormal distribution was used for the discrimination parameter a (with μa = 1.70 and σ = 2.81).Considering numbers of items and subjects under some conditions, for the EM algorithm, we chose to run a maximum of 100 iterations, and precociously ended when the desired precision (i.e. 10 -5 ) was achieved.In the context of the marginalization, the -values were assumed to follow a standard normal distribution.
In the EAP procedure, a non-iterative algorithm, each is individually estimated as a weighted average across thedomain (uniformly distributed from -4 to 4); the weighting factor is the joint probability for the k items using equation (1), with cj ≡ 0.
Other considerations on the estimation procedures will be brought up later, in the discussion section.

RESULTS
In this section, we first identify the various quantities produced and recorded for this Monte Carlo investigation; statistical treatment methods are also outlined.Results pertaining to ability estimates are then examined, and finally complementary results about the estimation of item parameters are reviewed.

Statistical data and methods
Effectuated under the experimental design, each Monte Carlo run handled a few sets of variables : X : classical raw score of examinee (= sum of items) ; P : classical index of difficulty of item (= proportion of examinees giving correct response) ; , : examinee's ability generated or estimated from IRT procedures ; : examinee's estimated "true score" computed from other IRT estimates ; a, b, , : item's discrimination and difficulty index respectively, generated or estimated from IRT procedures.
Parametric data generated by our programs to simulate the random response protocols, i.e. the "true" , a and b values used for each Monte Carlo run, were tested for consistency with our sets of corresponding hyperparameters (μ and σ 2 ) and proved unbiased and adequate (with no significant departure from due values).
Statistical analyses presented hereafter use either crossed ANOVA designs, Student's t tests or Pearson's correlation coefficients.Unless stated otherwise, statistical significance is at the 0.01 level or better.Finally, in order to give the reader some appreciation of effect size, Hays' (1981) omega squared (ω 2 ) index of experimental efficacy is occasionally produced ; the index is derived in the usual way from ANOVA's expectancy formula for mean square, under a fixed effects model.

Statistical analysis
In a first section, we present results on the test-retest reliabilities of and X, "pertinence measures" for , the spread of distributions and their accuracy in order to answer our main questions about the matching of the estimated abilities ( ) with their corresponding generated values (θ).We shall also examine the factors that influence the consistency, reliability and other statistical characteristics of , and the correspondence between and the classical raw scores X across conditions.The second section will bear on item parameters' estimation, namely the test-retest reliability, pertinence and accuracy of and estimates.

Around the ability level estimate ( )
Test-retest reliability of and X Globally, across conditions, the levels of test-retest reliability for and raw score X are equivalent (F < 1).Data in Table 1 were taken from Case 1 under standard (μ θ = 0, μb = 0) conditions.Both reliabilities increase as a function of the number of items (k) (F = 1579.27and 1756.36,df = 3, 348, ω 2 = 0.929 and 0.936, for and X respectively), and they follow almost exactly the Spearman-Brown prediction formula (used to predict the reliability of a test whose number of items has been changed [Lord & Novick, 1968;Bertrand & Blais, 2004]) , an expected result for the X score but one that comes somewhat as a surprise for the estimate.Furthermore, reliability of θ estimates increases with the number of examinees (F = 12.85, df = 2, 348, ω 2 = 0.062) while this is not the case for scores X (F < 1).In fact, estimates tend to be stabilized especially when N increases from 100 to 500 subjects, which is not he case for X.
This last result is interesting, because one of fundamental assumptions of CTT is that the reliability of scores X depends on k, the number of parallel items, via the accumulation of true variance, but it does not depend on N. Now, in order to estimate examinees' θs, our IRT procedure first obtains estimates for item parameters a and b, estimates which appear to be stabilised by an increase of N, the number of protocols processed (see below).Thus, the observed gain in reliability of θ may be a corollary of the increase in item parameters' reliability.
Data from cases 2 and 3 of the experimental design confirm the above results.In case 2, raw score reliability was "imposed" at ρ(X, X) ≈ 0.40 across protocols with diverse number of items (k) by varying hyper-parameter μa, and at ρ(X, X) ≈ 0.80 in case 3.Both the observed test-retest

Pertinence of measures for
In this Monte Carlo investigation, the true examinee's ability level was defined by his parameter, so that the pertinence of various estimating functions of ability can be directly assessed.Correlations of with diverse ability estimates ( , X, ) will now be scrutinised.
Correlations r( , X) and r( , ) behave similarly to corresponding reliability coefficients r(X1, X2) and r( , ), except that they are stronger and vary more slowly as a function of k.In fact, the ratio between the two sets of indices amounts to the ratio between r(T, X) and r(X, X) in test theory, i.e. r(T, X) = { r(X, X) } ½ .These correlations also increase with k as the square root of the Spearman-Brown formula, and they similarly interact with N, the number of examinees.Averaging over all combinations of case 1, it seems worthwhile reporting that the means of r( , X) and r( , ) are quasi equal, i.e. 0.785 and 0.786 respectively (F < 1), the more so if we consider that quantities and are generated / estimated on a standardised continuous scale about the normal probability model, and score X is a crude binomial-like count with its well-known ceiling and floor effects.The assumed specificity and advantage of and IRT scaling do not stand out here.
In order to enable us to delve into the inter-relations of , , X and , we ran a special Monte Carlo experiment with fixed conditions N = 500, k = 30, μ θ = 0, μa = 0.5 and μb = 0. First, we obtain equivalent pertinence coefficients for (r( , ) = 0.799) as for X (r( , X) = 0.792).Now, if we correlate each observed score-value (X') to the mean of all generated -values associated with it, we obtain r( , X') = 0,980, a quasi perfect match.Also, the raw r( , X) = 0.9746 jumps to r( , X') = 0.9974 when we regroup equal X scores and average the concomitant values.Finally, considering r( , ) = 0.633, the r( , ) corrected for attenuation becomes 0.799 / ≈ 1.00, similarly to r(X1, X2) = 0.625 and corrected r( , X1) = 0.792 / ≈ 1.00 : this result seems to indicate that all the information (or portion of "true variance" ) contained in the true values is equally in the and the X estimates, hence that these estimates are linearly equivalent.
The preceding results suggest that the true variance available in the generated sample is transferred to the X score distribution as well as in the estimated .Moreover, considering the forceful increase of correlation from individual values, i.e. r( , X) = 0.792, to regrouped values, i.e. r( , X') = 0,980, the supernumerary values produced by the IRT estimation procedure appear to convey no more information, except some noise.Computations done with estimated true scores ( ) lend the same results and point to the same conclusions.

Spread of distribution
Ability parameters ( ) in case 1, under conditions μ θ = μb As for estimated ability data ( ), their observed mean was consistently equal to 0, a result that is ascribable to the EAP estimation procedure and implementation.The spread of values was assessed by two statistics, range and standard deviation; the two bearing similar results, we report only the former.Figure 1 depicts the evolution of the range as a function of k (F = 311.66,df = 3, 348, ω 2 = 0.721) and N (F = 692.53,df = 2, 348, ω 2 = 0.793), with an interaction effect (F = 35.31,df = 6, 348, ω 2 = 0.364), the increase of range getting slower as N goes from 1000 to 100.
In order to get a more thorough understanding of the above results and establish them firmly, we ran another series of Monte Carlo experiments, this time using 100 replications under standard conditions of case 1 (μ θ = 0, μa = 0.5, μb = 0) and measuring the range of generated ( ) and estimated ( ) ability levels : results appear in Figure 2. Data for estimated in Figure 2 (dotted lines) match closely those shown in Figure 1, re-enacting the varying influence of k on the spread.On the other hand, generated values seem to be unperturbed by k, and, for each level of N, they agree satisfyingly with their expected values under the normal distribution 5 .Since the spread of increases as a function of k but only up to a certain maximum depending on N, two questions must be tackled : What mechanism can be invoked to explain the increase on k, and what blocks its effect at each N ?

Accuracy of estimates
We now look into the accuracy of the ability estimate by examining different measures of the distance between the value and its target .Data in Table 2 stem from case 1 under standard conditions μ θ = 0 and μb = 0 : one is the average of the absolute deviations between and across the N pseudo respondents, the other is the maximum absolute deviation.All variations are significant at more than 0.01 alpha level.Ave | -| decreases more rapidly as a function of k (ω 2 = 0.937) than of N (ω 2 = 0.115) ; note that quantity ave | -| follows closely , which is the expected absolute deviation for a normal variate with standard deviation given by the standard error of .For max | -|, it decreases with k (ω 2 = 0.785) but increases with N (ω 2 = 0.216), as expected 6 .Considering that the true (i.e.generated) distribution has a standard deviation of 1, the reported differences between true and estimated s appear somewhat large, i.e. near to 0.5 for the Ave measure and to 2.0 for the Max measure.
Now, what does happen if we leave behind us the reassuring environment of « standard conditions » and explore new parametric conditions with various μ θ and μb?For the sake of simplicity, the following analyses were limited to sub-conditions N = 500 and k = 30, and they are entirely representative of all our N and k combinations.
Table 3, below, reports the means of , X and in situations for which μb = 0 and μ θ = 0, 1 and 2 respectively.Note that, here as everywhere, the mean values of the N estimated are consistently 0, a likely corollary of the twophase and EAP estimation procedure put to work.Raw scores increase with μ θ (F = 295.67,df = 2, 87, ω 2 = 0.868), as expected.The estimates vary also (F = 696.86,df = 2, 87, ω 2 = 0.939) but they do so contrary-wise and, in view of their values, compensate almost exactly the variation of μ θ .
The means in Table 4 represent situations with fixed μ θ = 0 and with varying μb (= 0, 1, 2).Here again, next to = 0, which is expected though still astonishing 7 , we observe a decrease in X scores (F = 688.26,df = 2, 87, ω 2 = 0.939) as item difficulty (μb) increases, which is also reflected in the means of (F = 757.60,df = 2, 87, ω 2 = 0.944).Here is thus a quasi plausible parametric outcome, stemming from a standard, centered population (with μ θ = 0).Finally, Table 5 renders three situations wherein hyper- parameters μ θ = μb are compared.Here, the interaction between the and b parameters occurs at generation time, i.e. the relocation of the distributions of and b is annihilated by subtraction in the exponent part of equation ( 1), "( rbj)", so that all our observed means are at center.This consequence, even though it is evident, pinpoints the essential and mutual indeterminacy of the and b scales, which affects all IRT models, from the 1-parameter model upward.
Cases 4a-4d of our experimental design make use of witness pseudo respondents, i.e. respondents for whom the same pre-test protocol is used at post-test and whose ability level is then estimated among new companion protocols.Conditions for estimation were N = 500 respondents (among which 10 were retained as witnesses), k = 30 items, item discrimination and difficulty hyper-parameters μa = 0.5 and μb = 0.As usual, 30 Monte Carlo replications were performed; for each replication, 10 estimation samples were taken in order to accumulate 100 witnesses for purpose of statistical analyses.
Test-retest reliability of witness data are reported in Table 6.Data from Case 4a were used to validate our sampling scheme, and they mimic the corresponding data from Case 1.As can be seen, the reliability coefficients per se are but slightly affected by their new estimation environment, whether 98% ( = 490 / 500) new protocols emanate from the same original companions (Case 4b), from new companions having the same ability level (Case 4c) or even from new and much more talented companions (Case 4d).For all cases, the two values estimated from the same protocol are highly correlated.The near-to-perfect reliability coefficient means that individual s keep a linear relation one to the other, in the guise of = b1 + b0 : as we shall see, the accuracy problem resides in the "b0" component.
To throw some light on the intricacies of our problem, we ran yet another series of estimation runs for Case 4d, again with 30 replications, thereby collecting new types measurement.The salient results follow.
Firstly, the b parameter obtains a = 0.000 at pre-test and = -0.967at post-test (t = 25.08 to 36.86 8 , df = 299).This negative shift of the difficulty parameter estimates (notwithstanding the constant μb = 0) reflects the positive shift of the population level (from μ θ = 0 to μ θ = 1), the twophase estimation procedure banking on a 0-centered normal population.
Secondly, the estimates glide from = -0.013 to = -0.629(t = 30.83 to 43.57, df = 99), a negative shift ascribable to the mixing of our witnesses (coming from a μ θ = 0 population) with brighter companions (coming from a μ θ = 1 Table 5. Means of parameter estimates under conditions N = 500, k = 30, and µ θ -µb = 0 (30 replications) mean μ θ = 0, μb = 0 μ θ = 1, μb = 1 μ θ = 2, μb = 2 0.000 0.000 0.000 X 14.99 15.00 15.01 0.012 -0.029 0.006   Equivalently, the witness protocols at post-test are processed amidst higher-graded protocols, the set of which is to be matched with a 0-centered population.Consequently, the small batch of our 10 witnesses is thus downgraded along the ability axis.A phenomenon analogous to the above occurs for the percentile rank (PR) of witnesses, which we computed.At pre-test, we obtain = 49.42,close to 50 as expected, and = 28.79 at post-test (t = 20.18 to 27.74, df = 99), a negative shift imposed by the fact that ranks are computed for all 500 respondents among which 490 have now enhanced response protocols.
Thirdly, the true score estimates ( ), computed at each time from , â and estimates, change from = 14.95 at pretest (near to the ½k = 15 target) to = 16.03 at post-test (t = -10.21 to -9.71, df = 99), a small but consistent and highly significant positive shift.Though paradoxical, this result may tentatively be explained by a differential effect of the change in μ θ from 0 to 1. On the one hand, because our 10 (out of 500) witnesses originate from all strata of the μ θ = 0 population, some may be better gifted and compatible with μ θ = 1 subjects, and consequently their mean estimated ability decreases from about 0 ( = -0.013) to only = -0.629instead of -1.On the other hand, because the procedure for item-parameter estimation is confronted with 98% high-grade μ θ = 1 protocols, it forces the estimates at posttest down to = -0.967,quite near to -1.Hence, even if subjects' ability levels have been lowered at post-test, they were administered items of an even lower difficulty level, resulting in a small rise of their predicted value.
Lastly, again for the accuracy of estimates, we calculated the mean absolute difference (MAD) between true and estimated at the two estimation times.At pretest, we observed MAD1 = 0.492 (range 0.414 to 0.605), and at post-test, MAD2 = 1.090 (range 0.870 to 1.348), a significant increase (t = -27.12,df = 32), betraying an important loss of accuracy for our witness .

Reliability, pertinence and accuracy of estimates
The classical item difficulty estimate P, which (oddly) designates the proportion of correct responses for the item, is the analogue of IRT's parameter, and their statistical behaviours can be securely compared.Such comparisons are effected in Table 7 : data originate from Case 1, with μ θ = 0, μb = 0 and μa = 0.5.
Item parameters a, b and P being estimated across respondents, it is to be expected that both reliability and pertinence coefficients benefit from an increase of their number N. Reliability of P is globally somewhat higher than that of (F = 738.69,df = 1, 348, ω 2 = 0.672), the difference diminishing as N increases (F = 79.07,df = 2, 348).For pertinence coefficients (right of Table 7), the initial advantage of P at N = 100 (F = 631.81,df = 1, 348) vanishes at N = 500 (F < 1) and turns upside down at N = 1000 (F = 17.55).
As for the accuracy of , we must recall first that, in the realm of our IRT procedures and programs, hyperparameters μb and μ θ play a compensatory game by virtue of which the resultant , the mean of the estimated distribution, is put equal to 0. This means, for instance, that a μ θ = δ distribution of true abilities will engender a μ( ) = 0 distribution of estimated abilities and a concomitant μ( ) = -δ distribution of difficulty indices.Given this caveat, the means are unbiased (≈ -δ), with no variance effects of N or of k.As shown in Table 8, the relative accuracy of increases with k (F = 14.38, df = 3, 348, ω 2 = 0.100) and more so with N (F = 520.54,df = 2, 348, ω 2 = 0.743), these effects diminishing somewhat (F = 2.48, df = 6, 348, p < 0.05, ω 2 = 0.024) toward high values of N and k.
Among many others, the preferred index of "item discrimination" in CTT is probably the item-test correlation coefficient, r(yj, X), where yj is the subject's 0/1 response at item j, and X is his number of good responses across k items (where X = Σ yj).We correlated this index with the âj parameter estimate : Table 10 presents the correlations obtained.

About the ability parameter estimate
features of Monte Carlo studies such as this one include the cheap abundance of generated data and the fact that data models are explicitly defined and implemented.In the case of the present study, true ability and item parameters were put on stage, together with their statistical estimates, which derived from response protocols obtained through an explicit 2-parameter logistic IRT model.Thus, the IRT model's properties were assured per definition, and conclusions hereof can be safely drawn.
Regarding the invariance of ability estimate in IRT, our data plainly demonstrate two things.Firstly, as Table 3 shows, the estimate is generally biased, bias coinciding accidentally with zero when the associate population location parameter (μ θ ) is zero.This bias pertains to the indeterminacy of the and b scales, in fact to the indeterminacy of the difference (b) in the definition of the IRT model (see formula 1 at page 3).The currently applied two-phase estimation procedure, used also in this study, settles the indeterminacy by anchoring the estimates in a finite distribution with mean 0, i.e. μ( ) = 0, while allowing a shift of the distribution to accommodate examinees' response patterns.Another factor infringing invariance is k, the number of items in the estimation set, the spread (range and standard deviation) of estimated correlating with k (see Fig. 1 and 2).Secondly, observed values of correlations between true i and estimated , our so-called pertinence coefficients, give a paradoxical support to the "congruence" aspect of invariance.It is true to say that estimates are linearly related to their parametric counterparts i, the paradox being that (1) the correlation level between the two obeys standard theorems of CTT, theorems whose demonstration is based on the interplay of true and error variance, and (2) expectedly enough, the same correlation levels are observed between i and raw score Xi.The threshold of "r ≥ 0.90" for declaring invariance appears quite irrelevant in this context.Moreover, data from our witness protocols show patently that the level (or bias) of the estimate is adjusted to the context of the companion respondents with which it is processed -notwithstanding the fact that the adjusted is based upon an unchanged Summarising the above discussion and the detailed evidence in our data, we assert that the concept of "invariance", given as a distinctive asset of item response theory, is over-defined and overrated.The estimates are not invariant across a shift in the population location parameter (μ θ ) and their spread and positioning are influenced by the number of items.If estimates appear to be invariant across sets of items, it is because the distribution is formed anew using μ( ) = 0, notwithstanding the actual difficulty levels (μb) of items.Thus, the alleged "invariance of the IRT ability estimate" should be changed to "linear equivalence, tainted with bias indeterminacy".Moreover, the observed linear, or "congruence", properties of are entirely shared by classical score X, apart from the fact that only X scores levels vary coherently with true levels.Hence, in our opinion, "invariance" does not hold for the ability estimate and, above all, its remaining "congruence" (or linear) properties do not constitute a distinctive asset, as they characterize also the classical X score measure.Finally, the apparent superiority of estimates in terms of discriminating capacity 9 is not corroborated by a superior reliability level or by a better efficiency to discriminate respondents, as was shown earlier.

About the item parameters' estimates
Accessorily, the present Monte Carlo rendered some information on the properties of item parameter estimates, the â et of IRT as well as r(yj, X) and P in CTT.Due to the ( , b) indeterminacy mentioned above, the estimate is intrinsically biased, its location parameter μ( ) playing a compensatory role with μ( ) in fixing μ( ) = 0 in phase 1 of the two-phase estimation procedure.Apart from that, and P are linearly congruent with the true b parameter (see Table 7), the classical P index being somewhat more reliable.No crude bias effect was observed for the â estimate, were it not for a slight positive bias depending upon hyper-parameter μa.Our data contain no information specific to the "invariance" of the â estimate, as the experimental design of the study included only shifts in location (via hyperparameters μ θ and μb) but not in range (all and b distributions were controlled with σ 2 = 1).Estimated â's were not perturbed by shifts of location in ability or difficulty parameters, and they correlated nicely with true a values (see Table 9).

CONCLUSION
Item response theory's estimates of ability ( ) are not invariant across a change of the estimation context, be it a shift in the ability level of co-examinees or in the global difficulty level of items.Wells et al. (2002) have already documented such variant effects under small changes in subgroups of items, but the real snag comes from the intrinsic indeterminacy of the ( , b) pair in all IRT models, e.g.Pj( r) = [ 1 + exp( -aj( rbj) ] -1 , the operating characteristic of Pj( r) being the difference ( rbj) where each shift of can be compensated by an equal shift of b.IRT estimation procedures, like the two-phase procedure employed here, cannot overcome this indeterminacy, which results in the fact that distributions are generally biased and arbitrarily centered on μ( ) = 0.
On the other hand, the estimate displays nice linear, or "congruence" (Hambleton et al., 1991) properties : its reliability levels are comparable to those of classical raw score X, and its pertinence, i.e. correlation of the estimate ( ) with the true individual parameter ( i), is also quite good, in fact, it is comparable again to the correlation between i and raw score Xi !These linear properties are influenced by the size of their estimation basis (the number of items) and statistically consistent ; indeed, they seem to reflect the proportional amount of true variance, a standard result in CTT and one that does not appear to be grounded in the algebraic framework of item response theory.As for the relative advantages of item parameter estimates of IRT versus CTT, IRT's estimate of item difficulty correlated highly with classical P difficulty index, index P being somewhat more reliable and better correlated with the true parametric b value.The two discrimination indices also compared well, IRT's â coefficient being a little more reliable and better linked to the a parameter than classical item-test correlation It may be pertinent here to restate that the generic IRT model is, up to now, the only conceptual apparatus that can pretend to be a true "model" of what goes on between the respondent and the set of items that confronts him.It is a first-order (no interactions nor sequential processes are assumed), stimulus-response model, but, even then, it goes a long way beyond the crude axiomatic basis of CTT.When we turn to concrete applications, though, this good model turns up flawed with an intrinsic indeterminacy and with grave estimation problems.Consequently, when on the practical side, the hoped-for theoretical merits of IRT estimates become tarnished and should be judged on a psychometric basis at a par with the classical estimates X, P and r(y, X), who behave as well if not better (Fan 1998 ;Frenette et al. 2007).
Even if the procedural and parametric settings of this simulation study matched those of current IRT applications, their limitations are present and must be overcome.How do IRT estimates fare under estimation procedures different from the ones employed here and, above all, what are the essential changes entailed by a one-phase, joint estimation procedure (for both ability and item parameters) ?How do test-retest reliability levels of ability estimates ( , X) compare, when item parameters are estimated only once, at test time?And, finally, are the concepts of "invariance" and "dimensionality" really cardinal in the epistemological definition of a response model (Blais, 1987 ;Frenette et al., 2007 ;Wells et al., 2002), and could they not be replaced by the more universal descriptors used in statistical estimation theory ?The creative potential of the generic IRT model has not been exhausted by this or all other studies, and much has yet to be harvested with its aid.

Figure 1 .Figure 2 .
Figure 1.Range of estimated distribution as a function of k and N (case 1) † PR = percentile rank * r(X1,X2) = 1 by definition for witness protocols.
mechanism of this negative relocation of our witnesses is as follows.The better part of post-test protocols comes from a high-level (μ θ = 1) true population : in order to maintain a centered ( = 0) estimated population, the estimation procedure is forced to assume easier items, thus producing lower b indices.Now, lower b indices, and easier items, applied to unchanged witness protocols entail a concomitant reduction in the estimated ability levels.

Table 6 .
Test -retest reliability coefficients for 4 indices in Case 4

Table 8 .
Accuracy of b as a function of k and N