Classification of Infant Sleep–Wake States from Natural Overnight In-Crib Sleep Videos

Northeastern University and University of Maine
WACVW CV4Smalls 2025

Abstract

Infant sleep is critical for healthy development, and disruptions in sleep patterns can have profound implications for infant brain maturation and overall well-being. Traditional methods for monitoring infant sleep often rely on intrusive equipment or time-intensive manual annotations, which hinder their scalability in clinical and research applications. We present our dataset, SmallSleeps, which includes 152 hours of overnight recordings of 17 infants aged 4–11 months captured in real-world home environments. Using this dataset, we train a deep learning algorithm for classification of infant sleep–wake states from short 90 s video clips drawn from natural, overnight, in-crib baby monitor footage, based on a two-stream spatiotemporal model which integrates rich RGB frames with optical flow features. Our binary classification algorithm was trained and tested on "pure" state clips featuring a single state dominating the timeline (i.e., over 90% sleep or over 90% wake) and achieves over 80% precision and recall. We also perform a careful experimental study of the result of training and testing on "mixed" clips featuring specified levels of heterogeneity, with a view towards applications to infant sleep segmentation and sleep quality classification in longer, overnight videos, where local behavior is often mixed. This local-to-global approach allows for deep learning to be effectively deployed on the strength of tens of thousands of video clips, despite a relatively modest sample size of 17 infants.

Overall Pipeline

Infant Sleep-Wake Classification Overview

Architecture of our two-stream network for infant sleep–wake classification. The model processes 90-second video clips through parallel RGB and optical flow streams. Each stream consists of: (1) a Feature Extractor using 3D ConvNet to extract spatiotemporal features from K sequential frames, (2) a Classifier incorporating self-attention mechanisms for temporal dependency modeling, temporal averaging, and fully connected layers with rectified linear unit (ReLU) activation. A modality selection/fusion module allows flexible stream combination (RGB alone, flow alone, or RGB + flow) before final classification into Sleep/Wake states. The architecture highlights two main processing stages: feature extraction (green blocks) and classification (purple blocks).

SmallSleeps Dataset Creation

The SmallSleeps dataset represents a significant contribution to infant sleep research, comprising 152 hours of overnight video recordings from 17 infants aged 4-11 months captured in real-world home environments. Unlike traditional sleep studies that rely on intrusive equipment, SmallSleeps uses non-invasive video monitoring installed in cribs by caregivers. The dataset features comprehensive behavioral coding performed by research assistants who annotated sleep-wake states based on visible behavioral cues, with a particular focus on arousal events characterized by eye opening and body movements. For our final dataset, the videos were segmented into 90-second clips with varying levels of "purity" (percentage of time in a single state), creating a robust foundation for developing and evaluating automated sleep classification algorithms despite the relatively modest sample size.

SmallSleeps Dataset Creation Process

Purity Bins: Capturing the Complexity of Infant Sleep Transitions

The creation of different "purity bins" in the SmallSleeps dataset addresses a fundamental challenge in infant sleep classification: the natural heterogeneity of sleep-wake transitions. Rather than simplistically categorizing video clips as either "sleep" or "wake," we recognized that real-world infant sleep patterns frequently involve mixed states and gradual transitions. By introducing five threshold bins (90-100%, 80-90%, 70-80%, 60-70%, and 50-60%), each representing the percentage of time an infant spends in a particular state within a 90-second clip, the dataset provides a more nuanced foundation for algorithm development.

This stratified approach serves multiple purposes. First, it allows us to train models on "pure" state clips (90-100% in one state) to establish clear baseline performance. Second, it enables systematic evaluation of how algorithm performance degrades when faced with increasingly mixed state clips—a critical consideration for real-world deployment. Finally, this methodology creates a pathway toward more sophisticated sleep segmentation in longer videos, where local behavior often contains mixed states. The paper specifically highlights how the "Pure-State Model" (trained on 90-100% purity clips) and "Mixed-State Model" (trained on 70-100% purity clips) demonstrate different capabilities in handling sleep-wake classification across varying levels of state heterogeneity.

Results and Analysis

Classification Results Analysis

In our analysis of feature performance, we found compelling evidence for the superiority of optical flow features compared to RGB features for infant sleep-wake classification. Our RGB-only models achieved moderate accuracy (71-72%), while models incorporating optical flow features demonstrated substantial improvements, reaching 78% accuracy. The most dramatic enhancement was in recall, which increased by 11-12 percentage points when optical flow was included, while maintaining or slightly improving precision.

The benefits of optical flow were particularly pronounced in the challenging 70-80% purity bin, where recall improved by 18 points and precision by 3 points compared to RGB-only models. This suggests optical flow excels at capturing the subtle movements characteristic of wake states, especially in ambiguous mixed-state clips. Our t-SNE visualization further supported this finding, showing distinct subject-specific clusters for wake states in the optical flow feature space, while RGB features showed considerable overlap between sleep and wake states within each subject's cluster.

Feature Space Analysis

Feature Space Analysis

Our t-SNE visualization of the feature space reveals distinct patterns in how RGB and optical flow features capture sleep-wake states. The optical flow feature space shows clear separation between sleep and wake states, with subject-specific clusters that maintain their distinctiveness. This visualization helps explain the superior performance of optical flow features in our classification tasks, particularly in capturing the subtle movement patterns that distinguish wake states from sleep.

CV4Smalls 2025 Presentation

BibTeX

@InProceedings{Moezzi_2025_WACV,
    author    = {Moezzi, Shayda and Wan, Michael and Manne, Sai Kumar Reddy and Mathew, Amal and Zhu, Shaotong and Galoaa, Bishoy and Hatamimajoumerd, Elaheh and Grace, Emma and Rowan, Cassandra B and Zimmerman, Emily and Taylor, Briana J and Hayes, Marie J and Ostadabbas, Sarah},
    title     = {Classification of Infant Sleep-Wake States from Natural Overnight In-Crib Sleep Videos},
    booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV) Workshops},
    month     = {February},
    year      = {2025},
    pages     = {42-51}
}