Three essays on using data mining for covariate interactions in discrete choice models

Publication Type:
Thesis
Issue Date:
2012
Filename Description Size
Thumbnail01Front.pdfcontents and abstract705.29 kB
Adobe PDF
Thumbnail02Whole.pdfthesis10.06 MB
Adobe PDF
Full metadata record
NO FULL TEXT AVAILABLE. Access is restricted indefinitely. ----- A main aim of market research is to obtain precise estimates of consumer preferences. The data gathered to estimate preferences can be either revealed or stated preference. The advantage of the latter is that we can design a stated preference survey in a manner where the finite set of alternatives available for the respondent to choose from is known to the experimenter. This is known as a discrete choice study. A discrete choice survey is designed such that a single respondent would answer multiple scenarios. The gathering of multiple observations for each individual would improve the precision of the estimates for that individual (Louviere & Woodworth 1983). When analysing the factors that affect choice, we look at covariates which may explain choice behaviour. As part of the survey, covariates, most commonly demographics, are gathered about the respondent that may affect preferences. Early efforts at integrating covariates into a choice model proved unfruitful as demographics were entered into the model as main effects, which did not yield significant model improvement (Slama & Tashchian 1985). DeSarbo, Ramaswamy and Cohen (1995) suggest that covariate effects should be entered into a discrete choice model in an interactive way rather than as additional main effects. To boost the accuracy and explanatory power of any model, interactions should be included in the analysis in addition to main effects. An interaction is thereby present if the effects of one variable are moderated by the levels of another variable (McLelland & Judd 1993). Unless theoretical reasoning suggests otherwise, traditionally, covariate interactions have been entered into choice models post hoc without a guiding heuristic. They are often in the simple form of multiplying one variable X1 by another variable X2 and entered into the model as X1*X2, then estimating the model and seeing if the parameter on X1*X2 is statistically significant. Including interactions in the aforementioned manner may capture some of the true interaction, but this is only by chance, unless the covariate interactions were designed into the choice experiment as a specific combination of variables in the choice questionnaire (McFadden 1986). A particularly tricky type of covariate interactions is demographic interactions, which are usually not designed into the experiment, and are collected post-hoc and added to a choice model. In much of the existing literature, the addition of demographic covariates has not proven to be fruitful. Dawes & Corrigan (1974) observe that as interactions enter a linear model, the percentage of variance explained by the added interactions will decrease with the order of the interaction. This is a logical consequence of forcing the higher order interactions parameters to act over the entire data space. It is unlikely that the interacted variables act globally over the entire data space (Munoz & Felicisimo 2004); instead, interactions are important only in localised regions of the data space (Breiman 2001). Such ‘localisation’ is a natural result of the non-linearity that is created when demographic characteristics that comprise a mixture of different data scales (e.g., nominal, ratio, ordinal, and interval) interact with each other. Complicating this issue even further is the fact that localised interactions have smaller sample sizes, and small moderation effects are harder to detect than large moderation effects (McLelland & Judd 1993). The main research question of this thesis is to develop a method of interaction detection that combines the statistical strength of aggregated discrete choice models with the flexibility of a non- parametric statistical tool such as data mining. Such an approach allows one to look at the effects of localised interactions using a mixture of experimentally designed covariates and demographic data. The three essays that follow address issues of interaction detection, as well as show how, through the use of data mining techniques, proper covariate interactions can boost model accuracy in the presence of heterogeneity and missing data. Data mining finds interactions in a way that is both statistically valid and easy to interpret for the marketing manager. The three essays that follow show different, but related, ways in which the accuracy of discrete choice models can be boosted when proper covariate interactions are integrated into the model using tree-based data mining methods and spline-based data mining methods. This is in contrast to traditional interaction techniques which over-parameterize the model in order to cover the entire data space. Obtaining the best covariate interactions is especially important in essay 3 where the inclusion of improperly determined interactions would reduce the accuracy of imputation methods. Differences in demographics can be a source of heterogeneity; all three essays demonstrate that proper covariate interactions using demographics can account for observed heterogeneity while minimising unobserved heterogeneity. The capturing of observed heterogeneity is explicitly modelled and addressed in the second essay. Essay 1 introduces the concept of adding CART (tree-based data mining) interactions to a discrete choice model. This essay distinguishes itself from previous work by looking at both local areas of interactivity and covariate interactions, in a discrete choice model setting using stated preference data, as well as in a multinomial choice setting. The covariate interactions uncovered by CART are found to boost the accuracy of the discrete choice model. Essay 2 looks at using the MARS data mining algorithm to find optimal interactions of experimentally designed data and demographics in a choice task given to Australian farmers. Using the interactions found by MARS, this essay investigates a MARS-Mixed Logit hybrid model as well as a MARS-Discrete Choice hybrid model and shows that, after capturing observed heterogeneity with the MARS-Discrete Choice hybrid interactions, the amount of residual unobserved heterogeneity is minimal. This finding implies there is no reason to run a mixed logit random coefficients model if good interactive structure is found. Essay 3 shows how CART interactions can boost the effectiveness of parametric missing data imputation models. This essay extends the data mining methods used in the previous two essays by allowing the interactions to predict not only the dependent variable, but to also predict the experimentally designed covariates and demographic variables as functions of themselves in the presence of missing data. The local interactive regions discovered by CART are hybridised into the multiple imputations (MI) and expected maximization (EM) algorithms for missing data imputation. The added information provided by CART increases the accuracy of these models by providing information about localised interactions to the parametric imputation engines.
Please use this identifier to cite or link to this item: