Gayle, V. (1996) 'Modelling Tabular Data with an Ordered Outcome', Sociological Research Online, vol. 1, no. 3, <http://www.socresonline.org.uk/1/3/4.html>

Copyright Sociological Research Online, 1996

 

Modelling Tabular Data with an Ordered Outcome

by Vernon Gayle
Department of Applied Social Science, University of Stirling

Received: 19/01/96      Accepted: 23/9/96      Published: 2/10/96

Abstract

A large amount of data that is considered within sociological studies consists of categorical variables that lend themselves to tabular analysis. In the sociological analysis of data regarding social class and educational attainment, for example, the variables of interest can often plausibly be considered as having a substantively interesting order. Standard log-linear models do not take ordinality into account, thereby potentially they may disregard useful information.

Analyzing tables where the response variable has ordered categories through model building has been problematic in software packages such as GLIM (Aitken et al., 1989). Recent developments in statistical modelling have offered new possibilities and this paper explores one option, namely the continuation ratio model which was initially reported by Fienberg and Mason (1979). The fitting of this model to data in tabular form is possible in GLIM although not especially trivial and by and large this approach has not been employed in sociological research. In this paper I outline the continuation ratio model and comment upon how it can be fitted to data by sociologists using the GLIM software. In addition I present a short description of the relative merits of such an approach.

Presenting this paper in an electronic format facilitates the possibility of replicating the analysis. The data is appended to the paper in the appropriate format along with a copy of the GLIM transcript. A dumped GLIM4 file is also attached.


Keywords:
Ordered Categorical Data; Continuation Ratio Model

Introduction

1.1
A large amount of data within sociological enquiries consists of categorical variables that lend themselves to tabular analysis. In many areas of sociological research, for example the analysis of data regarding social class, educational attainment or attitudinal measures, the dependent or outcome variables can often plausibly be considered as having a substantively interesting order. In these circumstances we wish to understand the effect of one or more explanatory variables upon an ordered categorical dependent variable. The problem is easily stated. How do we apprehend relationships in data where the response variable is categorical and ordered in some substantively meaningful fashion?

1.2
Log-linear models are now routinely employed in sociological research to analyze tabular data (Whitely, 1983). This is a reasonable solution to the problem of modelling tables with ordered categorical responses but standard log-linear models would not take ordinality into account, thereby disregarding potentially useful information (Berridge, 1992). Analyzing tables where the response variable has ordered categories through model building has been problematic in software packages such as GLIM (Aitken et al., 1989). Recent developments in statistical modelling have provided new possibilities using the GLIM software. Berridge (1992) provides a macro for fitting the continuation ratio model and Wolfe (1996) provides a macro for fitting the proportional odds model.

1.3
The motivation for the proportional odds model is provided by an appeal to the existence of an underlying continuous and perhaps unobservable random variable. The continuation ratio model is a useful alternative to the proportional odds model. The continuation ratio model is particularly suited to cases where the categories of the response variable really are discrete. The response variable is a series of ordered categories and movement from one to another denotes a shift or change from one state to another. In such cases the response categories cannot be thought of as coarse groupings of some finer scale (McCullagh, 1980). The continuation ratio model is also suited to cases where the movement of the behaviour under examination plausibly has only one direction. Expressed technically, it is not palindromically invariant to the order of the categories of the response variable.

1.4
A clear sociological example of where the model is appropriate is the data treated in Berridge (1992). The data was collected by the Social Change and Economic Life Initiative (SCELI) (ESRC, 1991). A series of respondents were asked to compare their current job with what they were doing five years ago. Their responses where classified according to whether they considered that the responsibility involved in their job had either significantly increased, remained the same, or decreased, during the last five years. This response is clearly ordered and categorical. The order marks a shift from one state to another.

1.5
In the case of educational research one can imagine a plethora of response variables measuring educational attainment that are ordered and categorical. Often these variables will embody an obvious direction of measurement. For example attainment might be measured by the highest educational qualifications that a respondent has achieved. The order of the categories of this variable denote a shift from having one level of educational qualification to another, higher level of qualification. The categories are discrete and the behaviour has an obvious direction.

1.6
In the remainder of this paper I treat fitting the continuation ratio model to data in tabular form with an ordered outcome variable. The paper will suggest that the continuation ratio model has some substantively useful applications within sociological research. Finally I hope to allude to possible developments in the application of the model to the analysis of data within sociology in the future.

1.7
As I will hopefully demonstrate it is possible to fit the continuation ratio model to tabular data directly using GLIM software. This is desirable as fitting the model in SPSS is not practical. The data could be reconstructed, so that a series of logistic regression models could be fitted in SPSS or indeed any other software that fits binary regression models. However, it seems to me that this would not be entirely sensible and employing the GLIM software is a much better solution. I would also argue that fitting the model in GLIM is positively desirable as it falls within the general framework of the generalized linear modelling which provides an attractive and beneficial statistical modelling approach for social scientists (Gilchrist, 1985).

1.8
In addition the approach that I am suggesting in GLIM is easily generalizable to fitting the model to data in record form using Berridge's Macro (Berridge, 1992). This is particularly desirable as it is possible to fit models with covariates as explanatory variables. This is appealing and potentially provides a useful tool when building models with an increased number of explanatory variables. Arguably this is an appropriate methodological approach to the analysis of larger and more complex data sets. In the last resort however, the merits and demerits of using SPSS can only be demonstrated by comparing the analyses and I would encourage present readers of this article to use SPSS in such an analysis and to submit a 'Comment' for the next issue of Sociological Research Online. My own view is that providing a run in SPSS would not provide any additional value, but it would be interesting and useful to read contrary views.

Data

2.1
The data that I have chosen by way of an example is extracted from the Youth Cohort Study of England and Wales and presented in tabular form (albeit for different analysis) in Against The Odds: The Education and Labour Market Experiences of Black Young People by Drew et al. (1992). It is a major longitudinal study of young people's experiences as they complete their period of compulsory education and enter the world of work, training or further education. The data from two cohorts of the study are presented by Drew et al. (1992), which amount to over 28,000 young people, over 900 of which were Asian and 500 Afro- Caribbean. Table 1 (below) reports the fifth year examination results by ethnic origin and gender.


Table 1: Fifth Year Examination Results by Ethnic Origin and Gender (Column Percentages)

No. of Higher Passes*WhiteAsianAfro Caribbean



MalesFemaleMalesFemalesMalesFemales
Zero504450496258
One - Three252825292730
Four +252825221112

* O' level and C.S.E. results.


2.2
In this example the outcome (educational attainment) is measured through the number of higher passes that the young person achieves. Information on the exact number of higher passes is not available and the young person's achievement is classified according to whether they gained no higher passes, one to three higher passes, or four or more higher passes. The response variable is ordered, categorical and substantively we can expect movement in only one direction. The continuation ratio model is appropriate to this analysis therefore.

Fitting the model in GLIM

3.1
The continuation ratio model was posited by Fienberg and Mason (1979) and discussed by Fienberg (1980) and by McCullagh and Nelder (1983). The continuation ratio model is appropriate when the response variable is constructed as a series of ordered categories. Here I use the term ordered categorical to mean that each response can be categorized as C1,....,Ck. In general a response of Ci is better in some substantive sense than a response of Cj (if i<j). In order to model the data, information on individuals Ii,....,In should be available and comprise of a vector of Zi explanatory variables and a single ordered categorical response variable. This information is tabulated into a multiway table.

3.2
The objective of the modelling process is to facilitate a better comprehension of how the ordered outcome is related to the explanatory variables. This might be alternatively expressed as an attempt to explore whether or not individuals with Z1 combination of explanatory variables do better or worse in some sense, than counterparts with Z2 combination.

3.3
The continuation ratio model can be considered as a series of comparative logistic regression models (a technical definition of the model is provided in Berridge, 1992). In the usual fashion we are modelling the log odds of r successes out of n trials but in this case for a series of partitioned tables. In the case of the example under consideration the first partition is displayed in Table 2.


Table 2: First Partition of Fifth Year Examination Results by Ethnic Origin and Gender

No. of Higher Grade Passes

Zero (%)Some Passes (%)
White 
    Males5050
    Females4456
Asian 
    Males5050
    Females4951
Afro Caribbean 
    Males6238
    Females5842


3.4
The second partition is simply a logistic regression of the survivors of the first logistic regression model.


Table 3: Second Partition of Fifth Year Examination Results by Ethnic Origin and Gender

No. of Higher Grade Passes

One - Three (%)Four + (%)
White 
    Males5050
    Females5050
Asian 
    Males5050
    Females5842
Afro Caribbean 
    Males7129
    Females7129


3.5
In a substantive sense first we are modelling the log odds of having no higher passes, rather than some higher passes. Second, we are modelling the log odds of having between one and three higher passes rather than four or more higher passes given that a young person has some qualifications. In short, we are looking at the relationship between having and not having qualifications and, given that a respondent has some qualifications, the difference between the two educational levels.

3.6
The column percentages that are displayed in Table 1 are displayed as raw frequencies in Table 4 (below).


Table 4: Fifth Year Examination Results by Ethnic Origin and Gender (Frequencies)
Males
No. of Higher PassesWhiteAsianAfro Caribbean



MalesFemalesMalesFemalesFemales
Zero 6566 5697 262 188 141 148
One - Three 3283 3626 132 111  61  77
Four + 3283 3626 132  85  25  31

3.7
The data that is displayed in tabular form in Table 4 can be fitted straightforwardly in GLIM once the data is rearranged. A new data matrix must be constructed. The continuation ratio model can be considered as a series of comparative logistic regression models. In the usual fashion we are modelling the log odds of r successes out of n trials for a series of partitioned tables. In the case of the example under consideration, the first partition (see Table 2) compares zero higher passes with some higher passes. The next partition compares those that have between one and three higher passes and those that have four or more (see Table 3).

3.8
The new data matrix must include values for r successes out of n trials for the different factor combinations (ie. the gender and the ethnicity factors). In addition a separate factor that denotes the cut-points for the tables needs to be created. This factor will take on values up to Ck-1. In this case the response variable has three categories and thus there are two cut- points.

3.9
The data is arranged into a data file in the format shown below (Table 5) where:
r is the number of successes
n is the number of trials
cutpt is the two partitions or cut-points
the ethnicity variables are coded 1=no and 2=yes
Gender is coded as Male=1 and Female=2.


Table 5 - Data

r n cutpt Asian Afro Caribbean gender
 
6566 13132 1 1 1 1
3283  6566 2 1 1 1
5697 12949 1 1 1 2
3626  7257 2 1 1 2
 262   526 1 2 1 1
 132   264 2 2 1 1
 188   384 1 2 1 2
 111   196 2 2 1 2
 141   227 1 1 2 1
  61    86 2 1 2 1
 148   256 1 1 2 2
  77   108 2 1 2 2


3.10
We simply treat the data as a table of data for logistic regression (see Table 6). The data is handled by the GLIM software in this format.


Table 6 - GLIM4 Input File

 [o] GLIM 4, update 8 for Sun SPARCstation / Solaris on 10 Jul 1996 at 
17:58:16
 [o]                  (copyright) 1992 Royal Statistical Society, London
 [o] 
 [i] ? $units 12$
 [i] ? $data r n cutpt asian afro gender$
 [i] ? $dinput 12$
 [i] File name? data
 [i] ? $factor cutpt 2 asian 2 afro 2 gender 2$
 [i] ? $yvar r$
 [i] ? $err b n$
 [i] ? $fit cutpt-1$

The term 'afro caribbean' is reduced to 'afro' in the GLIM syntax.


3.11
Twelve units of data are read into GLIM (see Table 8). The cut-point, the ethnicity variables and the gender variable are declared as factors and the numerator r is declared as the response variable. A logistic regression model is posited (link g), a binomial error (err b) declared with n as the denominator. The null continuation ratio model is fitted by fitting the cut-point (fit cutpt-1).

Results

4.1
The results of fitting the continuation ratio model in GLIM to this table is intended to be purely illustrative and no sociological analysis will be undertaken.


Table 7: Results of the Model Fitting Process

Model Variables d. in deviance d. df.
A Cutpt -  -
B Cutpt + asian 2.850  1
C Cutpt + afro caribbean 61.550  1*
D Cutpt + afro caribbean + gender 59.760  1*
E Cutpt + afro caribbean + gender + afro caribbean.gender 0.038  1
F Cutpt + afro caribbean + gender + cutpt.afro caribbean 4.436  1*
G Cutpt + afro caribbean + gender + cutpt.afro caribbean + cutpt.gender 33.960  1*

* Significant at 5% level.


4.2
The results of the modelling process are reported in Table 7. The null continuation ratio model is fitted, followed by the explanatory factors and then the interaction effects.

By restoring the dumped GLIM4 file 'newycs.dum' that is provided the modelling process can be replicated. Download newycs.zip or newycs.dum.sit



4.3
Model A can be considered as the null continuation ratio model, and with a G2 of 166.9 with 10 degrees of freedom the model fits the data badly (p<0.001). The first two models (B and C) test for ethnicity effects. Model B includes the Asian factor which is not significant and Model C includes the Afro Caribbean factor which is significant. Model C is a significant improvement on the null continuation ratio model (Model A). Model D includes the gender effect which is significant although overall this model fits the data badly with a G2 of 45.6 with 8 degrees of freedom (p<0.001). So far the modelling process has detected an ethnicity effect (being Afro Caribbean) and a gender effect. Model E tests for an interaction effect between the Afro Caribbean and the gender effect, but this is not significant.

4.4
The underlying assumption is that the two factors that have been detected as being significant will operate across the two partitions. Model F checks whether or not the ethnicity effect (being Afro Caribbean) is standard across the two partitions. Since the interaction (cutpt.afro caribbean) is significant we can conclude that the ethnicity effect is not standard across the two partitions. Model G checks whether or not the gender effect is standard across the two partitions. Since the interaction effect (cutpt.gender) is significant we can conclude that this effect is also not standard across the two partitions. With a G2 of 7.204 with 6 degrees of freedom Model G fits the data quite well (p>0.2) and is the model of best fit.


Table 8: Parameter estimates for model

estimate s.e. parameter
-0.0007558 0.01704 CUTPT(1)
-6.942e-5 0.02413 CUTPT(2)
 0.5249 0.09377 AFRO CARIBBEAN(2)
-0.2343 0.02423 GENDER(2)
 0.3739 0.1849 CUTPT(2). AFRO CARIBBEAN(2)
 0.2401 0.04120 CUTPT(2).GENDER(2)


4.5
The results of the parameter estimates that are reported in Table 8 do not lend themselves to easy interpretation. The log estimates that are reported are best understood as odds ratios. The odds ratios are derived in the usual fashion for logistic regression. They are simply the inverse logs of the parameter estimates for the various factors.


Table 9: The Fitted Odds of Having None Rather Than Some Higher Passes (model G)

Whites and Asians
    Males 1
    Females 0.79
Afro Caribbean
    Males 1.69
    Females 1.34


4.6
The fitted odds that are reported in Table 9 are the odds of having no higher passes rather than some higher passes for various groups of young people. This is analogous to a logit of r successes out of n trials. In this case r (the number of successes) is C1 (zero higher passes) and n (the number of trials) is the sum of all three categories C1 + C2 + C3.

4.7
Table 9 reports that the fitted odds for a white or Asian male having no higher passes rather than some higher passes are one. There is an even chance therefore of a white or Asian male having some as opposed to no higher passes. The fitted odds of a white or Asian female having no higher passes rather than some higher passes is 0.79. The odds ratio of white and Asian males having no higher passes rather than some higher passes compared with white and Asian females is 1/0.79 =1.27 and we can conclude that the educational performance of these females is better than that of their male colleagues.

4.8
Afro Caribbean males have larger odds of having no higher passes rather than some higher passes. This is also the case for Afro Caribbean females, but it is not as extreme. The odds ratio of Afro Caribbean males having no higher passes rather than some higher passes compared with Afro Caribbean females is 1.69/1.34 =1.26. We can conclude that the educational performance of Afro Caribbean females is better than that of their male colleagues.


Table 10: The Fitted Odds of Having One - Three Passes Rather Than Four + Given That a Respondent Has Some Higher Passes (Model G)

Whites and Asians
    Males 1
    Females 1
Afro Caribbean
    Males 2.46
    Females 2.47


4.9
The fitted odds that are reported in Table 10 are the odds of having between one and three higher passes, rather than four or more, given that a respondent has some higher passes. This can be considered as a model of the survivors from the first logistic regression model. It is analogous to a logit of r successes out of n trials where r is C2 and n is the sum of C2 +C3 .

4.10
The fitted odds reported in Table 10 indicate that given that they have some higher passes, white and Asian males and females have an equal chance of having between one and three higher passes and having four or more higher passes. Afro Caribbean males and females, given that they have some higher passes, have fitted odds that suggest they are about two and a half times more likely of having between one and three higher passes rather than four or more.


Summary

Ethnicity is important

Overall ethnicity is important but Asian young people are not systematically different to white young people in terms of their educational achievement. Afro Caribbeans are doing worse than their white and Asian colleagues, both in terms of having qualifications and in terms of their level of qualifications.

Gender is important

Females do better than males in terms of having some rather than no qualifications. However, when we consider those young people that have qualifications, females no longer have an advantage.

Model Criticism

5.1
Model criticism is an important aspect of the statistical modelling process (Everitt and Dunn, 1983; Dale and Davies, 1994). There is a problem with regards to using residuals such as Pearson's residuals for an ordinal response which has been modelled using the continuation ratio model. In the standard fashion a residual would be calculated for each j of the first j (j=1,....,c-1) partitions of the response. These residuals do not provide a single summary statistic for the performance of the continuation ratio model for any particular combination of explanatory factors.

5.2
Berridge (1994) has treated model criticism for continuation ratio models. Whilst this is aimed at the analysis of residuals for modelling data in record rather than tabular form the diagnostic that he suggests can be generalized to meet the requirements of data presented in tables. He proposes that the observed and expected probability mass functions are compared by way of a measure of distance.

[Equation]

5.3
It is assumed that category Cj is assigned location j, j=1,....,c. The actual or observed category of an individual is denoted by Cj* , with corresponding location j*, j*=1,.....,c. The expected probability of individuals with a vector of explanatory variables Zi choosing their actual category j is denoted as pij *, j*=1, .....,c.

5.4
Considering the cut-point alphaj, gender betai and ethnicity (Afro Caribbean) gammak i=1,2, j=1,2, k=1,2. it is possible to calculate fitted unconditional probabilities from the fitted conditional probabilities.

[Equation] and [Equation] are fitted conditional probabilities for the two partitions (cut-points alpha1 and alpha2) for white and Asian (gamma1) males (beta1).

[Equation] and [Equation] are the fitted conditional probabilities for the partitions (cut-points alpha1 and alpha2) for white and Asian (gamma1) females beta2. The fitted conditional probabilities for the two cut-points for the various factors combinations can be readily computed in a similar fashion.

5.5
The fitted unconditional probabilities can be calculated for the Cj with various factor combinations using the formula

P1 = H1 , P2 = (1-H1 ) H2 , and P3 = 1- P1 - P2.

The calculations can be viewed and the unconditional fitted probabilities are reported in Table 11.




Table 11: Fitted Unconditional Probabilities for Fifth Year Examination Results by Ethnic Origin and Gender

Number of Higher Grade Passes
Zero One - Three Four +
White and Asian
    Males 0.50 0.25 0.25
    Females 0.44 0.28 0.28
Afro Caribbean
    Males 0.62 0.26 0.12
    Females 0.57 0.28 0.15



Table 12: Estimates of Regression Diagnostic for Particular Estimates of e.p.m.f. and Each Value of j*


Estimate of r when:

e.p.m.f.pi1pi2pi3j*=1j*=2j*=3
White and Asian
    Males0.500.250. 25-0.750.251.25
    Females0.440.28 0.28-0.840.161.16
Afro Caribbean
    Males0.620.260. 12-0.50.51.50
    Females0.570.28 0.15-0.580.421.42


5.6
The diagnostic has advantages over the use of Pearson's residual as a single diagnostic for the performance of the model for each factor combination is now available. This diagnostic also takes ordinality into account and would extend to each individual respondent in the case of fitting the model in record form. In a descriptive sense the results reported in Table 12 suggest that the model fits the data reasonably well as there are no residuals larger than ±1.96. A longer term objective would be to derive a diagnostic which permits formal testing of the significance of its estimates (Berridge, 1994).

Conclusion

6.1
I hope that by way of using this example I have demonstrated that the continuation ratio model is appealing for the analysis of data in tabular form where the outcome variable has a substantive order and is categorical. And I hope that I have demonstrated that the model is advantageous when the movement in the behaviour under investigation is only plausible in one direction, such as when educational attainment is measured by qualifications. The results have shown that the model clearly allows us to assess whether individuals with various factor combinations do better or worse in some substantive sense than contemporaries with other combinations of explanatory factors. In addition we can also assess the magnitude of the effects of given combinations of explanatory factors with regard to the ordered outcomes.

6.2
Since a large amount of data within sociological research consists of ordered categorical variables that lend themselves to tabular analysis I hope that this work goes some way to suggesting the possibilities of such an approach. The statistical modelling that has been outlined above will hopefully have shown that the continuation ratio model has some properties that are desirable for sociological research. The problem of the potential loss of information that a standard log-linear approach might yield has been arrested by the use of the continuation ratio model. Another desirable property is that the model can be fitted using GLIM software and it falls within the general framework of generalized linear modelling.

6.3
In the near future I intend to continue my explorations into modelling tables with ordered categorical response variables to other data with the view to directing the model towards substantive research questions. Methodologically, one obvious line of development will be to compare the use of the continuation ratio model with other modelling approaches to ordered categorical data such as the proportional odds model. Presenting this work in an electronic format may stimulate such comparative work as the data and a full account of the modelling process are available. Another methodological development will be employing the GLIM macro provided by Berridge (1992) which facilitates the fitting of the continuation ratio model to data in record form making it possible to fit models with covariates as explanatory variables. This is appealing and potentially provides a useful tool when building models with an increased number of explanatory variables. Arguably it is an appropriate methodological approach for more ambitious data analysis projects.

Acknowledgements

The author would like to thank Dr. Damon Berridge for his advice and support in this endeavour.

References

AITKEN, M., ANDERSON, D., FRANCIS, B. & HINDE, J. (1989) Statistical Modelling in GLIM. Oxford: Oxford University Press.

BERRIDGE, D. (1992) 'Fitting the continuation ratio model in GLIM4', in L. Fahrmeir, B. Francis, R. Gilchrist & G. Tutz (editors) Advances in GLIM and Statistical Modelling, Lecture Notes in Statistics, 78. London: Springer-Verlag.

BERRIDGE, D. (1994) 'Assessing the Goodness of Fit of Regression Models for Ordinal Categorical Data', 9th International Workshop on Statistical Modelling, Exeter University.

DALE, A. & DAVIES, R.B. (editors) (1994) Analyzing Social and Political Change. London: Sage.

DREW, D., GRAY, J. & SIME, N. (1992) Against the Odds: The Education and Labour Market Experiences of Black Young People, Employment Department Training Research and Development Series, no. 68 - Youth Cohort Series no.19.

ESRC (1991) The Social Change and Economic Life Initiative: An Evaluation. Swindon: ESRC.

EVERITT, B. & DUNN, G. (1983) Advanced Methods of Data Exploration and Modelling. London: Heineman.

FIENBERG, S. & MASON, W. (1979) 'The Identification and estimation of age-period-cohort models in the analysis of discrete archival data', Sociological Methodology, pp.1 - 67.

FIENBERG, S. (1980) The Analysis of Cross- Classified Data. London: MIT Press.

GILCHRIST, R. (1985) Introduction: GLIM and Generalized Linear Models, Lecture Notes in Statistics 32. London: Springer Verlag.

McCULLAGH, P. (1980) 'Regression Models for Ordinal Data (with discussion)', Journal of the Royal Statistical Society B, vol. 42, no. 2, pp.109 - 142.

McCULLAGH, P. & NELDER J. (1983) Generalized Linear Models. London: Chapman and Hall.

WHITELEY, P. (1983) 'The Analysis of Contingency Tables', in D. McKay, N. Schofield & P. Whitely (editors) Data Analysis and the Social Sciences. London: Frances Piner.

WOLFE, R. (1996) 'General Purpose Macros to Fit Models to an Ordinal Response', Glim Newsletter, #26.

Copyright Sociological Research Online, 1996