Skip to content

Multimodal Machine Learning for Human Behavior Analysis

Understanding human behavior is important for many different applications related to healthcare, including ambient intelligence for hospitals, elderly living environments, clinician-patient interactions, and behavioral health assessment. Human behavior is multimodal, and dynamically changing. Moreover, privacy is a critical aspect to be considered for human behavior estimation in healthcare settings.

Our prior work has focussed on how non-verbal visual human behavior (e.g., the visual focus of attention, body and arm pose, facial activity, etc.) can be estimated from a variety of sensors, ranging from very sparse privacy-preserving overhead range sensors to frontal video cameras, (b) how complementary non-verbal speech information (speaking and interruption patterns, tone, prosody etc.) can be integrated to supplement sparse visual information, and (c) how social science metrics like perceived leadership, contribution, and personality traits can be predicted from automated human behavior estimates. We leveraged machine learning, computer vision, image and signal processing algorithms for this multimodal group dynamics analysis.

Unobtrusive and Privacy-Preserving Multimodal-Sensor-Enabled Ambient Intelligence

The automated estimation of human interactions in group settings forms the foundation for understanding team processes, and many systems use video cameras and/or wearable sensors as the basis for this estimation. However, such sensors may inhibit natural human behavior and can be particularly limiting in privacy critical spaces like hospitals or assisted daily living environments. We investigated a different modality for studying human interaction: time-of-flight (ToF) sensors. These sensors preserve human privacy much more than video cameras, while still allowing fine-grained measurements that can be effectively used to characterize individual and team behavior. We developed computer vision and machine learning methods for using ToF sensors for estimating human location, body, head and arm pose, visual focus of attention, and, when combined with non-verbal audio signals, speaking and interruption patterns. We then studied if the automatically extracted features could be used to predict perceptions of leadership, contribution, and group performance.

Selected Publications:

Human Behavior Estimation using frontal video cameras

Video cameras enable a fine-grained analysis of eye gaze and facial expressions, that is not possible by overhead privacy-preserving range sensors. We developed multimodal machine learning algorithms that fuse video information from frontal cameras with non-verbal speech information to predict perceived leadership, contribution, and personality traits.

Selected Publications: