The LASSO On Latent Indices For Ordinal Predictors in Regression

Many applications of regression models involve ordinal categorical predictors. A motivating example we consider is ordinal ratings from individuals responding to questionnaires regarding their workplace in the Household Income and Labour Dynamics in Australia (HILDA) survey, with the aim being to study how workplace conditions (main and possible interaction effects) affect their overall mental wellbeing. A common approach to handling ordinal predictors is to treat each predictor as a factor variable. This can lead to a very high-dimensional problem, and has spurred much research into penalized likelihood methods for handling categorical predictors while respecting the marginality principle. On the other hand, given the ordinal ratings are often regarded as manifestations of some latent indices concerning different aspects of job quality, then a more sensible approach would be to first perform some sort of dimension reduction before entering the predicted indices into a regression model. In applied research this is often performed as a two-stage procedure, and in doing so fails to utilize the response in order to better predict the latent indices themselves.

In this talk, we propose the LASSO on Latent Indices (LoLI) for handling ordinal categorical predictors in regression. The LoLI model simultaneously constructs a continuous latent index for each or groups of ordinal predictors and models the response as a function of these (and other predictors if appropriate) including potential interactions, with a composite LASSO type penalty added to perform selection on main and interaction effects between the latent indices. As a single-stage approach, the LoLI model is able to borrow strength from the response to improve construction of the continuous latent indices, which in turn produces better estimation of the corresponding regression coefficients. Furthermore, because of the construction of latent indices, the dimensionality of the problem is substantially reduced before any variable selection is performed. For estimation, we propose first estimating the cutoffs relating the observed ordinal predictors to the latent indices. Then conditional on these cutoffs, we apply a penalized Expectation Maximization algorithm via importance sampling to estimate the regression coefficients. A simulation study demonstrates the improved power of the LoLI model at detecting truly important ordinal predictors compared to both two-stage approaches and using factor variables, and better predictive and estimation performance compared to the commonly used two-stage approach.