The Application of Canonical Correlation to Two-dimensional Contingency Tables

This paper reintroduces and demonstrates the use of Mickey's (1970) canonical correlation method in analyzing large two-dimensional contingency tables. This method of analysis supplements the traditional analysis using the Pearson chi-square. Examples and a MATLAB source listing are provided. Almost every elementary statistics textbook has some coverage of the chi-square test In particular, the chi-square test is presented in the analysis of categorical data. Most of these textbooks will take the reader up to the contingency table that involves the cross tabulation of two categorical variables. With contingency tables, there are two modes of analyses (Kennedy, 1983): (1) Symmetric and (2) Asymmetric. In the symmetrical case, no distinction is made between the two variables as to which is the dependent variable and which is the independent variable. The primary interest is in whether the two variables are related. In the asymmetric case one of the categorical variables is identified as the independent variable and the other categorical variable is the dependent variable. Here the interest is in The authors wish to thank the editor and an anonymous reviewer for clarifying the different variations that exist in canonical correlation analysis. They point out two major aspects to canonical correlation. The first is the nature of the input data and the second is the algorithm used to extract the canonical coefficients and correlation. The editor also wrote in SPSS a program to create a dummy data set from a contingency table suitable for analysis using the Mickey method [available on the journal's web site]. whether a difference exists between the categories of the independent variable. In both cases, the test statistic is the Pearson chi-square statistic and it is computed using the same formula: , where the Oi's are the observed frequencies for category i and the Ei's are the expected or theoretical frequencies for category i. Additional information can be obtained about these two variables by computing indices of association such as the phi or Cramer's V coefficient. If the categorical variables have only two categories, the odds-ratio can be computed to provide more information (Kerlinger & Lee, 2000). Other than these only a few other statistics such as kappa or the contingency coefficient provides information about the two variables. In the case where a categorical variable has more than 2 categories, some have recommended additional tests using the chi-square statistic between pairs of categories. This is tantamount to multiple comparison …

whether a difference exists between the categories of the independent variable.In both cases, the test statistic is the Pearson chi-square statistic and it is computed using the same formula: , where the Oi's are the observed frequencies for category i and the Ei's are the expected or theoretical frequencies for category i.
Additional information can be obtained about these two variables by computing indices of association such as the phi or Cramer's V coefficient.If the categorical variables have only two categories, the odds-ratio can be computed to provide more information (Kerlinger & Lee, 2000).Other than these only a few other statistics such as kappa or the contingency coefficient provides information about the two variables.In the case where a categorical variable has more than 2 categories, some have recommended additional tests using the chi-square statistic between pairs of categories.This is tantamount to multiple comparison tests made in ANOVA with three or more levels of the independent variable.However, unlike ANOVA, research done on these post hoc tests in terms of the experimentwise error rate has been mixed (Garcia-Perez & Nunez-Anton, 2003;Macdonald & Gardner, 2000;Thompson, 1988).Hence such tests should be used and interpreted with caution.
Nearly 40 years ago in a rather obscure technical report written by Mickey (1970), the notion was put forth that canonical correlation could be used to analyze large 2-way contingency tables and provide descriptive information beyond those commonly discussed in statistics textbooks.
The traditional approach to 2-dimensional contingency tables did not yield information about categorical variables in the same way that canonical correlation could (Mickey, 1970).Thirty years after Mickey's report, Dunlap, Brody and Greer (2000) published an innovative article demonstrating how one could analyze large 2-dimensional contingency tables through canonical correlation.The method proposed by Dunlap, et al. (2000) was considerably more complicated than the one proposed and demonstrated by Mickey (1970).Dunlap, et al., (2000) outlined an elaborate method to obtain the proper correlation tables suitable for analysis by canonical correlation.Dunlap, et al.'s (2000) approach was to take a contingency table and transform it into a correlation matrix that is then submitted to a canned computer program such as SPSS1 or SAS for canonical analysis.One of Dunlap, et al.'s (2000) goal was to show the interpretative advantages provided by canonical correlation analysis in describing relationships between categorical variables and sets of categorical variables over the more traditional approaches.
However, canonical correlation has not had the widespread popularity as other multivariate statistical methods.With the IBM PC version of SPSS that appeared in 1984 canonical correlation was no longer listed in the index or table of contents of the user's manual (see Norusis, 1984).In a PsycInfo search of peer-reviewed journal articles from 1998 to 2009 using canonical correlation analysis, there were only 286 reported studies.In contrast, for the same period of time and using the same search parameters, multiple regression reported 5,425 hits, factor analysis had 11,709, structural equation modeling reported 17,534 and MANOVA had 947.Cluster analysis had 2367 hits, discriminant analysis had 961 and logistic regression reported 9628.The second lowest multivariate method was multidimensional scaling (MDS) which had 722 studies.Canonical correlation is covered in many multivariate statistics textbooks (e.g.Lattin, Carroll & Green, 2003;Tabachnick & Fidell, 2005;Kashigan, 1991) but its use in research studies have lagged.In fact, SPSS no longer has it easily available as a subprogram in their latest packages.SPSS has designated canonical correlation to a macro that the user can execute through a series of syntax statements instead of a point-and-click menu.Garson (2008) reports that canonical analysis can be obtained through SPSS's MANOVA subprogram.However, it is available only through syntax and not from the SPSS menus.
Canonical correlation is considered to be the most general correlational method.It attempts to find the highest correlation between two sets of variables.In each set there are two or more variables.This is unlike multiple correlation where the correlation is found between one variable (dependent variable) and a linear combination of two or more variables (independent or predictor variables).In canonical correlation there exist sets of linear combinations that are maximally correlated.The objective of canonical correlation can involve any one or all of the following: a) Determining whether two sets of variables made on each object (person) are linearly correlated b) Determining which variables in each of the two sets contribute the most to the relationship between the two sets of variables.c) Predicting the combined linear score for an object (person) of one set of variables using the variables in the other set.
Canonical correlation is useful for descriptive research purposes because it does not require the data to be normally distributed.The data are assumed to be drawn from a common covariance and dispersion matrix whose elements are finite and that the sets of variables are related linearly.This paper will examine the Mickey method of analyzing contingency table data using canonical correlation.It is much simpler than the method put forth by Dunlap, et al. (2000).The Dunlap, et al. (2000) method involves the creation of a correlation matrix and a factor analysis to determine the missing row and column correlations before being submitted to canonical correlation computations.The Mickey method only requires the creation of a dummy variable data set using information from the cross tabulations of the two categorical variables and the computation of the total variance-covariance matrix (or total covariance matrix) unadjusted for the means of the two variables.Essentially the total covariance matrix is the sums of squares and cross-products matrix divided by the sample size.The use of BMD09M, BMDP6M or BMDX75 for the Mickey method is straightforward since there are different options as to what the canonical correlation analysis would use in terms of the dispersion matrix.The Mickey method uses the option "covariance matrix about the origin."Unfortunately, public domain versions of the BMD programs are no longer available or are hard to find.However, BMDP6M is still available commercially through a company called Statistical Solutions (http://www.statsol.ie/index.php).The BMD canonical analysis program provides the user with different options in terms of the dispersion matrix to be used, e.g., correlation matrix, covariance matrix.SPSS however will only analyze correlation matrices.For those researchers that are familiar with MATLAB, the algorithm for the Mickey method is not difficult and can be programmed in MATLAB.The appendix for this manuscript contains the MATLAB commands and syntax for canonical correlation and the data set used for each example.After a considerable effort, the authors were able to locate a public domain version of BMDX75.The executable version of BMDX75 is also provided with this article.This program will execute in Windows XP, but it is not a Windows based program and does not conform to the Windows graphical user interface.The command and data files for each example are also included along with setup instructions similar to those found in the old BMD manuals.The authors have also written a very easy to use BASIC program for converting a 2-dimensional contingency table into a data set suitable for analysis by the Mickey method.This program will execute on most Microsoft BASIC language products such as GWBASIC Interpreter-Compiler or QBASIC.As of this writing, a GWBASIC Interpreter-Compiler is available at the website: http://www.thefreecountry.com/compilers/basic.shtml.A QBASIC Compiler is available at http://www.qbcafe.net/qbc/english/download/compiler/qbasic_compiler.shtml The Mickey method is demonstrated on three contingency tables.The first is from the original Mickey study (1970) concerning kidney transplant outcome for 254 patients based on tissue matching.The second is taken from Dunlap, Brody and Greer (2000).Dunlap et al. (2000) reports the cross-classification of 1660 people according to mental health symptoms and parents' social economic status.The third is from Lindeman, Merenda and Gold (1980).Lindeman, et al. (1980) reports the cross-classification of 1889 arrestees across 6 cities in the United States by the level of heroin use and type of crime.

Creating the Dummy Variable Data Set for the Mickey method
To use the Mickey method, the data presented in a twoway contingency table must be transformed into a dummy variable data set.With a p × q contingency table the dummy variable data set will contain p + q variables.Each data point (or person) would have a "1" for one of the p variables (Xi) and another "1" for the q variable (Yj) as dictated by the cross-tabulation in the contingency table.All other variables (Xi•, Yj•) would have a "0" (zero).
Symbolically, this would look like:

Approve
No Opinion One can see that the first nine lines in the dummy variable set correspond to the 9 Republicans who approved of some political issue.The next four are Republicans who did not approve of some political issue, and so on.

Computing the Total Covariance Matrix
The variance-covariance matrix (sometimes called the covariance matrix) is usually computed with a correction of the sums-of-squares and cross products for the means and a division by N -1.The Mickey method, however, requires a covariance matrix that is unadjusted for the means and with a divisor of N.This covariance matrix is called the total covariance matrix.The computational formula for the total variance-covariance matrix using the Mickey method is , where Z is the N × p matrix of dummy coded variables created for the Mickey method.There are alternative covariance matrices that can be used for the analysis.This paper is staying with the original procedures used by Mickey (1970).

Partitioning the Covariance Matrix
The covariance matrix computed for the p + q variances would be partitioned into sub matrices where the first set, called X, will be for the p variables and the second set called Y for the q variables.There are two other sub matrices that represent the cross between the X variables and the Y variables.The partitioned figure is shown in Figure 1.

Using the Partitioned Matrix and Submatrices
Once the partitioned matrix has been created, the usual analysis (Tabachnick & Fidell, 2005) calls for creating a square matrix V (of size p × p) using the following formula: Next, the characteristic roots and vectors or eigenvalues (λi) and eigenvectors for V are computed.
The eigenvectors are the canonical function coefficients.The canonical correlations are found by taking the square root of the eigenvalues.
Next, the same computations are done for the second set.Compute Next find the eigenvalues and eigenvectors for U.The eigenvectors for this set provides information on how the variables in the second set are related.
This procedure, however, is less robust than other methods.This procedure as pointed out by a reviewer will not work if the Cov(YY) matrix is not positive definite.He suggested using the method that utilizes the Cholesky decomposition procedure.This procedure involves using the Cholesky algorithm to decompose two matrices, Cov(XX) and Cov(YY).If the decomposed matrices for Cov(XX) and Cov(YY) are designated as r1 and r2, respectively, then compute the following matrix: By putting the w matrix through singular value decomposition, the first and second sets of canonical coefficients and the canonical correlations are obtained.This is the method used in this article.If XS is used to represent the first set of canonical coefficients and YS is used to represent the second set of coefficients, then the unstandardized canonical coefficients are obtained by .Likewise for the second set, the unstandardized coefficients are found by computing .Standardized coefficients are found for each variable by computing the square root of the sums-of-squares of the coefficients for each variable and dividing the unstandardized coefficient by this square root value.If represents the unstandardized coefficients for variable 1, the standardized coefficients for variable 1 can be computed by .

Significance Tests
Significance tests are used to determine if the remaining canonical correlations are statistically different from zero.A transformed Wilks' Lambda, Λ, is usually used for this purpose.There are many transformed statistics (Lattin, Carroll & Green, 2003).One is by Bartlett and it is computed using the steps given below.1. Compute Wilks' Lambda: 2. Compute the Bartlett Chi-square approximation to Wilks' Lambda: with (pk) ×  (qk) degrees of freedom, where N = total frequencies, p is the number of X's and q is the number of Y's.This method is the one used by the authors' of this paper when writing the computer program in MATLAB.Each eigenvalue or canonical correlation is tested by the same test statistic but with an important modification.It is a sequential process where the contribution from the previous canonical variate is removed before the χ 2 statistic is calculated.Also with the previous variate removed, the degrees of freedom are also reduced by a factor of 2.
The BMD program (BMDX75) uses a different computational algorithm.The BMD program computes the Chi-square statistic using the algorithm specified in Veldman (1967).The chi-square values are different from the one used in the MATLAB program and the degree of freedom used to evaluate the chi-square statistic is different.
The difference can be seen in the two outputs.
Example from Mickey (1970).(N = 254) Mickey's (1970) example dealt with data collected from a kidney transplant center.The data were from 254 parent-tochild transplantation.The two categorical variables were (1) Compatibility match between the kidney and the patient and (2) the outcome of the transplant.Both variables contain ordered categories.Compatibility had 4 categories where the best match was assigned to category "A."The outcome of the transplant fell into 5 ordered categories where those patients with the best outcome were assigned to category "A." Canonical correlation results showed the number of statistically significant canonical correlations and the canonical coefficients related to each categorical dimension.In using the Mickey method of canonical correlation analysis, the first canonical correlation will be equal to 1.0 and its associated eigenvector coefficients will be 1.0.Mickey (1970) states that the eigenvalues and eigenvectors are an artifact of his method and that both should be discarded and ignored.With the exception of the analysis performed on the Mickey data, the output presented in all MATLAB examples will omit the eigenvalue of 1.00 and the eigenvector coefficients of 1.00 in order to preserve space.Likewise, the unstandardized coefficients produced by MATLAB will be presented for the first example only.The researcher should consider the other correlation values.Given above are two outputs.One is from MATLAB and the other is from BMDX75/BMD09M.In Mickey's example the first canonical correlation is 0.2872.It does not appear very large, but it is the only correlation that is statistically significant (see Table 2a).The MATLAB program computes and outputs both unstandardized and standardized canonical coefficients.Generally, the standardized coefficients are used in interpreting the results of the analysis (Green & Tull, 1970).The first set of standardized canonical coefficients in The results of the canonical analysis indicate a relationship between transplantation outcome and compatibility of tissue matching.The primary association is match versus mismatch.The results of the ordering lend statistical support that A match is in general superior to B and C is superior to D.
MATLAB give both unstandarized and standardized coefficients, while the older BMD programs give unstandardized coefficients (see Tables 3a, b, and c).MATLAB and BMD generate the same unstandardized values.The unstandardized coefficients reveal the same relation found with the standardized coefficients.Another glaring difference between the MATLAB output and the BMD is the display of the number of sets of canonical coefficients for the Y-side.MATLAB shows every set of coefficients on the Y-side while BMD only shows the same number of coefficient sets as the X-side.
Note that the zero or empty frequencies in the contingency table does not prevent the continuance of the analysis.

Example from Dunlap, Brody & Greer (2000). (N = 1660)
Table 4 presents the contingency table found in Dunlap, Brody and Greer (2000).The analysis involves two categorical variables: (1) mental health status and (2) parents' socio-economic status.Mental health status has four categories: Well, mild, moderate and impaired.Parents' SES has five categories: A, B, C, D, E and F, where parents in the "A" category are of high SES and those in the "E" category are low SES.This example is of special interest since it will present a direct comparison between the Mickey method and the Dunlap method.This table is one of three that Dunlap et al. (2000) used in the application of their method of canonical analysis of a contingency table.The Mickey method and Dunlap method produced very similar results.The Mickey method (see Table 5a) found the following canonical correlations: .1613,.0371,and .0173.The Dunlap method (as reported in Dunlap, et al., 2000) found the following coefficients: .1607,.0371and .0168,respectively.The second canonical correlation is identical and the other two are quite close.Both methods found only one statistically significant correlation.
The Dunlap method produces factor loadings instead of canonical coefficients.When comparing the loadings and coefficients from the two methods, the values are not the same.However, since we are using canonical correlational analysis in a descriptive sense, we need only to look to see if the pattern of relationship within the factor loadings and within the canonical coefficients appears to be the same.In this case, the pattern shown in the first canonical function follows the same pattern given in Dunlap's factor loadings.In Table 5b, when looking at the X-side and Y-side canonical coefficients produced by the Mickey method, the factor loadings found by the Dunlap method are presented next to them enclosed in parentheses.Here, the same pattern emerges.For the Mental Health categories, Well and Mild appear with the same sign and the same ranking.Likewise, Moderate and Impaired emerged with the opposite sign and the same ranking.Similarly, for Parents' SES, A, B and C all appear with the same sign and ranking.D, E, and F all appear with the opposite sign from A, B, and C and with the same rankings.
The canonical analysis of this data set shows that parents with higher SES tend to have fewer children with severe mental problems than those of the low SES.The relationship between parents' SES and mental health status was not a strong one since the statistically significant canonical correlation was .1613.

Example from Lindeman, Merenda & Gold (1980)
(N = 1889) Lindeman, Merenda and Gold (1980) present a study involving two categorical variables: (1) heroin use and (2) criminal offense.Table 7 is a reproduction of their table.Lindeman, Merenda and Gold (1980) reports a statistically significant chi-square (χ 2 = 121.90,df = 12, p < .001), between the dimensions of amount of heroin use and type of crime.This chi-square test indicates that there is a relationship between heroin use and type of crime.It does not yield any more information than that.Lindeman, et al., (1980) does proceed to show the contribution of each crossclassified categories by using the observed frequency and the expected frequency for each cell (e.g. for "Current user" by "Serious Crime Against Persons," χ 2 = 25.50).Table 7 shows the greatest difference in the category of crimes against persons.The arrested non-drug user committed 35.5% of their crimes in these categories while only 9.5% of the heroin users committed these crimes.The canonical analysis adds more information to supplement the traditional chi-square test.The canonical correlation analysis produced one statistically significant canonical correlation (see Table 8a).In examining the first set of canonical coefficients (see Table 8b) that corresponds to the largest canonical correlation we find Current User, Past User and Other Drug User to have the same sign (0.6764, 0.4499, and 0.0311, respectively).Non Drug Users received a value with the opposite sign (-0.5823).The values indicate a ranking of the users with Current Users receiving the highest coefficient.The magnitude of the coefficients indicates that Current and Past heroin users are closer together than the other two.Other drug users are separate from heroin users and separate from non-drug users.In examining crime-type, the second set of canonical coefficient that corresponded to the largest canonical correlation shows a grouping of Serious (0.6434) and Less Serious (0.6583) Crimes against Persons.The other crimes formed the other grouping where Property Crimes (-0.1692) and All others (-0.1790) have the closer coefficients then Robbery (-0.3032).These coefficients indicate that current and past heroin users tend to commit more robbery and property crimes while other drug users and non-drug users commit more serious crimes against people.Thus the canonical correlation analysis reveals a much more subtle relationship between any history of drug use and crime type that the chi-square analysis did not reveal.

Discussion
This paper re-introduces the Mickey method (Mickey, 1970) in using canonical correlation analysis for large two-dimensional contingency tables.Unlike the simple 2 × 2 or 2 × 3 contingency tables, larger ones pose a difficult problem in interpretation.Canonical analysis allows the researcher a way to interpret the relationship between the column categories and the row categories in addition to a test of significance.This article provides the researcher with an alternative or additional analysis method for large 2dimensional contingency tables.
Canonical correlation for some reason unknown to the authors is not used more.It is disappointing that one of the most popular statistical packages, SPSS, no longer includes it among its easily accessible, point-and-click procedures.Other packages, with the exception of BMDP, do not provide the necessary option that allows the computation of a total variance-covariance matrix unadjusted for the means.Hopefully, this article will modestly lead to a revival of canonical correlation analysis in research papers.The use of canonical correlation is straightforward and easy to use and provides the researcher with additional information beyond the simple Pearson chi-square test found in elementary statistics books.The Dunlap method (Dunlap, Brody & Greer, 2000) is an alternative approach to the Mickey method.It provides essentially the same information, but it is a bit more difficult for novice researchers.Example 2 in the paper contrasts the results found by Dunlap, et al. (2000) and the Mickey method.The Dunlap method requires the additional understanding of factor analysis.Dunlap's method does require some level of sophistication in transforming raw data to phi (correlation) coefficients and the additional step of estimating missing correlation values using factor analysis.Dunlap, et al. (2000) have also mentioned the similarities of canonical correlation analysis on contingency table data and the method of correspondence analysis.
The Mickey method requires a specific data set up.This paper, however, includes a simple BASIC program for taking a contingency table and converting it to a data set suitable for the Mickey method.This paper also includes program statements used to perform the Mickey method using MATLAB.For those who do not have MATLAB, included with this paper is a compiled FORTRAN program following the setup of the old BMDX75 computer program.These steps, however, can be transferred easily for those who have BMDP6M.The BASIC program and the executable FORTRAN program will run on Windows XP, however, it does not have the graphical user interface for A Google search reveals the existence of MATLAB clones.These MATLAB clones are free but not 100% compatible with MATLAB.However, with some modifications as specified within each of the clone programs, MATLAB source code can be created to work on the clone software.For those interested in trying MATLAB clones to perform the statistical analysis presented in this paper, a description and availability of these MATLAB clones are available at:

Figure
Figure 1.Partitioned Covariance Matrix used in Canonical Correlation Analysis.Cov(XX) Cov(XY) Example: Let's say we are given the following contingency table with the two categorical variables political affiliation and opinion: • ≠ j;

Table 2a .
MATLAB Canonical Correlations and Significant Tests of Mickey's Data

Table 2c
-0.8186).Even though outcomes B and C have opposite signs, they are closer to one another in absolute magnitude than they are to the other outcomes.This indicates that B and C outcomes are very similar.

Table 4 .
Cross-classifications of 1660 Individuals on Mental Health Status and Parents' SES.

Table 5a .
MATLAB Canonical Correlations and Significant Tests of Dunlap, Brody & Greer Data.

Table 7 .
Cross-classifications of 1990Arrestees by Level of Heroin Use and Type of Crime

Table 8a .
MATLAB Canonical Correlations and Significance Tests of the Lindeman, Merenda & Gold Data.

Table 8b .
Canonical Coefficients of the Lindeman, Merenda & Gold Data.

Table 9b .
Canonical Correlations and Significance Tests from BMD09M/BMDX75 of Lindeman, Merenda & Gold Data THE COVARIANCE MATRIX ABOUT THE ORIGIN IS USED IN THE FOLLOWING CALCULATION