Kristin Chen on Using Artificial Intelligence (AI) Techniques to Accelerate Data Processing
Kristin Chen, a Westat data scientist, uses artificial intelligence (AI) subsets—namely natural language processing, machine learning and deep learning techniques—to build data products to automate data processing. She is experienced in developing data pipelines to train, evaluate, and put into production predictive machine learning models that facilitate survey operations. Further, her skills in data exploratory analysis, data mining, and statistical analysis allow her to discover and present data patterns to social science researchers. She recently co-authored an Issue Brief, Best Coding Practices to Ensure Reproducibility (PDF). Here, Ms. Chen explains how Westat uses AI today and how it might be used for future research.
Q: How does Westat use AI?
A: We use AI tools to automatically process surveys that generate large amounts of open-ended texts. These tools, which include natural language processing, machine learning, and deep learning techniques, allow us to rapidly process data in minutes, which previously took manual coders hours to complete.
Q: What is natural language processing, machine learning, and deep learning?
A: Both machine learning and deep learning allow us to build models and teach the machines to automatically learn from data, identify patterns, and make decisions. Machine learning does it by executing more “classic” computer programs, or what we called “flat algorithms,” for various kinds of tasks, such as classification, regression, and clustering when it comes to dealing with tabular, or structured data. Deep learning, on the other hand, uses a multilayered structure of algorithms called neural networks to extract deeper features in massive amounts of unstructured data such as text and images, which present unique capabilities that deep learning models are able to solve and go beyond those that machine learning models can solve. On top of this, natural language processing refers to techniques to process text and extract features that can be integrated into either machine learning or deep learning models.
Q: Specifically, how do these tools support survey administration?
A: With natural language processing, we build a model to classify sentences and label the topics of interviewers’ field comments. With machine learning, we build predictive models to (1) estimate the likelihood of making appointments or getting respondents to complete the survey and (2) predict the weekly number of completed cases to quantify interviewers' productivity, etc.
Q: Are you using natural language processing, machine learning, and deep learning in current projects?
A: Yes, we are using these tools for the Medical Expenditure Panel Survey-Household Component (MEPS-HC) and for the Residential Energy Consumption Survey, Energy Supplier Survey (RECS ESS).
MEPS-HC, which is funded by the Agency for Healthcare Research and Quality, provides the most complete source of data on the use and costs of health care for the nation’s non-military, non-institutionalized population. This survey, which we have conducted since its beginning in 1996, generates 10,000+ open-ended comments from interviewers who enter them into computer-assisted personal interviewing (CAPI) system to ensure clarification of respondents’ answers during each data collection period. The reason why interviewers do not backtrack to respondents’ original answers and modify them on the spot is because it is time-consuming and may incur errors. In order to narrow down the topics and reduce redundancy, we designed a dropdown menu of 10 categories in CAPI, from which interviewers can make a selection. With machine learning, deep learning, and natural language processing, we built a classification model to predict the category for each comment and suggested the top 3 categories ranked by classification probability for human coders who must verify the corrected categories selected by the interviewers. Before using the tool, our coders had to manually verify the categories from 10 options rather than 3, which increases human costs. For the past 2 data collection periods in 2020, the data tool achieved 95% classification accuracy.
We also use natural language processing for the U.S. Department of Energy’s 2020 RECS ESS—surveys we have performed for decades. The energy costs and usage data, collected from 20,000 U.S. households and approximately 1,000 energy suppliers, enable energy leaders to calculate future U.S. energy demands and plan for energy efficiency improvements. To match households with energy suppliers, we developed a tool based on Python—a programming language to automate the data cleaning and text matching to determine if the energy suppliers reported by respondents existed on a reference database of suppliers. With natural language processing, data processing was significantly accelerated with the first batch of 2021 respondent inputs of 4,000+ entries processed in less than 30 minutes.
Q: What other AI technologies do you foresee will be used in the future?
A: AI will continue playing a significant role in solving problems involving text, videos, images, and audio data. Natural language processing will continue to support the need for named entity recognition, string matching, and sentence classification. Westat is consistently embracing cutting-edge tools and software to conduct AI and machine learning practices. For instance, I foresee the adaption of automatic machine learning using AutoML or Amazon SageMaker.
Q: How will Westat meet future client needs?
A: Westat’s ongoing investment in cutting-edge technologies and ability to adapt and apply modern and open-source techniques to enhance survey administration, information collection, and social science research demonstrate our inventiveness, forward thinking, and commitment to delivering the most effective, efficient data-driven solutions to our clients.
Westat’s ongoing investment in cutting-edge technologies and ability to adapt and apply modern and open-source techniques to enhance survey administration, information collection, and social science research demonstrate our inventiveness, forward thinking, and commitment to delivering the most effective, efficient data-driven solutions to our clients.
- Kristin Chen, Data Scientist, STATISTICS & EVALUATION SCIENCES