Using Machine Learning to Uncover Hidden Heterogeneities in Survey Data
Survey responses in public health surveys are heterogeneous. The quality of a respondent's answers depends on many factors, including cognitive abilities, interview context, and whether the interview is in person or self-administered. A largely unexplored issue is how the language used for public health survey interviews is associated with the survey response. We introduce a machine learning approach, Fuzzy Forests, which we use for model selection. We use the 2013 California Health Interview Survey (CHIS) as our training sample and the 2014 CHIS as the test sample. We found that non-English language survey responses differ substantially from English responses in reported health outcomes. We also found heterogeneity among the Asian languages suggesting that caution should be used when interpreting results that compare across these languages. The 2013 Fuzzy Forests model also correctly predicted 86% of good health outcomes using 2014 data as the test set. We show that the Fuzzy Forests methodology is potentially useful for screening for and understanding other types of survey response heterogeneity. This is especially true in high-dimensional and complex surveys.
Additional Information© 2019 The Author(s). This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/. Received 13 December 2018; Accepted 07 October 2019; Published 05 November 2019. Data availability: The data are available at the CHIS website http://healthpolicy.ucla.edu/chis/data/Pages/GetCHISData.aspx. The code required for replicating the results reported in this paper is available at: https://github.com/OHDSI/FuzzyForest. An earlier version of this research was presented as a poster at the 33rd Annual Meeting of the Society for Political Methodology, July 21–23, 2016, Rice University; we thank meeting participants for their comments and feedback. This work was partially funded by NSF IIS 1251151. Author Contributions: C.M.R., M.A.A. and R.M.A. conceived the research. C.M.R. and R.M.A. obtained and recoded the data. C.M.R. undertook the analysis. C.M.R., M.A.A. and R.M.A. wrote the paper. All authors reviewed the manuscript. The authors declare no competing interests.
Published - s41598-019-51862-x.pdf
Supplemental Material - 41598_2019_51862_MOESM1_ESM.pdf