AV-FOS: Transformer-Based Audio-Visual Multimodal Interaction Style Recognition for Children with Autism Using the Revised Family Observation Schedule 3rd Edition (FOS-R-III)

1. Background: Clinical Challenges and Technological Prospects in Behavior Monitoring of Children with Autism

Autism Spectrum Disorder (ASD, autism) is a lifelong neurodevelopmental disorder. In recent years, the prevalence of autism in the United States has risen rapidly, with current epidemiological data indicating that one in every 36 children is affected by autism. The main manifestations of autism are difficulties in communication and social interaction with others, restricted interests and activities, and repetitive stereotyped behaviors. These core symptoms directly affect the daily activities and social functioning of affected children in their families, schools, and societies. Furthermore, “Challenging Behaviors” (CBs) associated with autism—including self-injury, aggression, and disruptive behaviors—carry significant clinical concern. These behaviors not only intensify social barriers for the children but also introduce severe health risks, even endangering themselves and others.

Currently, behavior monitoring for children with autism primarily relies on clinical evaluations conducted periodically by professionals in hospitals or institutions. Traditional behavior monitoring methods present multiple issues such as high cost, labor intensity, short observation periods, and inability for long-term continuous monitoring. Moreover, limited clinical observation scenarios often fail to capture behavioral changes in the child’s true family environment, leading to possible discrepancies between diagnostic outcomes and actual behavior. Thus, the development of automated, intelligent behavior analysis tools has become a pressing challenge in the field of autism. A system capable of automatically analyzing the interaction behaviors between autistic children and their caregivers in authentic home settings will greatly alleviate caregiver burdens and assist in diagnosis and intervention.

In the field of behavioral assessment for autism, the FOS-R-III (Revised Family Observation Schedule, 3rd edition) is a validated direct observation tool specifically designed to monitor the detailed interactions between children with autism and their parents in various contexts. Widely used in clinical and research settings, it lays a robust foundation for analyzing, intervening, and supporting challenging behaviors (CBs) and parent-child interaction styles. However, current FOS-R-III coding mainly relies on manual annotation, which is highly tedious and labor-intensive. Using artificial intelligence technologies such as deep learning to automate FOS-R-III coding promises a revolutionary advancement in the autism field.

2. Paper Origin and Author Introduction

This paper, “AV-FOS: Transformer-based Audio-Visual Multimodal Interaction Style Recognition for Children with Autism Using the Revised Family Observation Schedule 3rd Edition (FOS-R-III),” was published in the IEEE Journal of Biomedical and Health Informatics (September 2025 issue). The authors are Zhenhao Zhao, Eunsun Chung, Kyong-Mee Chung, and Chung Hyuk Park, who represent the Department of Biomedical Engineering at George Washington University and the Department of Psychology at Yonsei University. This interdisciplinary team integrates engineering and psychology expertise, providing solid theoretical and technical support for this research. The project is funded by the National Science Foundation (NSF) and focuses on “long-term human-robot interaction and intervention.”

3. Detailed Research Workflow

The research presented by this paper is an innovative, original scientific contribution in the autism field, aiming to construct an automated and intelligent behavior recognition system for FOS-R-III scale data encoding and to address several clinical pain points in behavior analysis. The research workflow comprises the following main steps:

1. Dataset Development and Construction

Data Collection:

The research team collected 216 videos of family scenarios from 83 participants, totaling about 25 hours. Each video lasts 5 to 15 minutes and was filmed in an authentic home environment using handheld cameras to simulate real, complex family dynamics. The children had an average age of 9.72 years and a male-to-female ratio of about 7:3. All autistic children were clinically diagnosed by practicing physicians; some non-diagnosed children were screened using the Social Communication Questionnaire (SCQ).

Task Design and Behavioral Assessment:

The participating children performed three types of tasks: playing with specific toys, following one of four versions of step-by-step instructions, and free play, to demonstrate various cognitive, motor, and social skills. Their behavioral performance was evaluated using the Problem Behavior Checklist, covering 14 typical issues (self-injury, aggression, repetitive motions, noncompliance, eating disorders, hyperactivity, etc.), rated on a 5-point Likert scale. The sample average score was 33, reflecting moderate levels of problematic behavior.

Data Annotation:

All videos were manually coded by five trained psychology graduate students (supervised by a certified psychologist and BCBA), using the FOS-R-III scale to record interaction styles (IS)—23 types—every 10 seconds, covering both child and parent behaviors (e.g., Praise, Affection, Non-compliance, etc.). The annotation was precise, with positive and negative markers indicating emotional valence (for instance, sa+ for positive social attention, sa- for negative social attention). The team conducted rigorous annotation training (20 hours) and cross-coder reliability checks with 30% of the video samples achieving 90% agreement, well above the 80% industry standard, providing a reliable data foundation for subsequent AI model training.

2. Data Preprocessing and Feature Extraction

Video Processing:

Each original video was segmented into 10-second clips for subsequent behavioral coding. Three visual sampling strategies were employed: (a) Middle Frame Spatial Attention—using the center frame and splitting it into 196 spatial patches; (b) Cross-frame Attention—segmenting video into four parts, sampling keyframes from each, also yielding 196 patches; © Averaged Key Frame Attention—pixel-wise averaging of first, middle, and last frames before patching. Experiments showed the third strategy, balancing spatial and temporal information, achieved the best results and was adopted as the default.

Audio Processing:

Audio data was normalized (mean removed, amplitude unified), preserving the original 16000Hz sample rate. Feature extraction used the Mel filter bank algorithm with a 25ms window and 10ms frame shift, yielding 128-dimension log Mel filter features, unified to 1024 frames by padding or truncation. The audio features were ultimately split into 512 patches of 16×16 for model input.

3. Model Architecture Design

Transformer-based Encoder and Decoder:

At the core is a Transformer model integrating visual and audio modalities. Data is patch-tokenized via linear projection embedding of spatial, positional, and modality information (Positional Embedding + Modality Embedding), with each token in 768 dimensions and 2D sinusoidal positional encoding. The encoder processes only unmasked tokens; the decoder takes all tokens including masked ones for reconstruction and advanced feature extraction.

Self-supervised Pretraining:

The model’s pretraining innovatively employs the CAV-MAE (Contrastive Audio-Visual Masked Autoencoder) approach, introducing both contrastive loss and reconstruction loss to integrate modality relationships and contextual information. 75% of tokens are masked, then restored using the pretrained encoder and decoder. The contrastive loss brings together audio and video features from the same context and pushes apart those from different contexts. Reconstruction loss enables learning of latent connections within data, enhancing unsupervised data utility.

Supervised Learning for FOS-R-III Encoding:

The pretrained model is streamlined to remove redundant structures, and a specialized multilabel classification layer is introduced to recognize 13 FOS-R-III interaction styles. Mean pooling of tokens is used in the decision layer, with an MLP outputting probability predictions for each style. Behavioral occurrence is determined by thresholds, and training is driven by binary cross entropy loss for optimal recognition accuracy.

Baseline and Comparison Model Setup:

The baseline uses GPT-4V (OpenAI’s latest multimodal large model) combined with prompt engineering. Comparative models include Slowfast Networks (a CNN-based video understanding model pretrained on Kinetics-400) and Vision Transformer (ViT, pretrained on ImageNet-21k), both fine-tuned on the custom dataset in this study.

4. Experimental Design and Evaluation Methods

All model training and inference was conducted on a server equipped with four NVIDIA A5000 GPUs, with hardware and software suitable for clinical deployment. Dataset splitting used subject-based partitioning to ensure generalization. Evaluation metrics included multilabel accuracy, F1 score, strict accuracy, AUC, and MAP, comprehensively reflecting classification performance and tolerance for data imbalance. GPT-4V outputs were processed by a specialized algorithm for unified formatting.

4. Main Research Results

1. Industry-leading Performance

The AV-FOS model (audio-visual fusion transformer structure) outperformed the baseline GPT-4V prompt model and mainstream comparison models (Slowfast Networks, ViT) in multiple evaluation metrics. When tested on samples never seen before, it achieved over 85% accuracy, exceeding the 80% inter-rater standard (though slightly below the 90% agreement in manual annotation for this study). On an extremely imbalanced dataset, AV-FOS achieved AUC, MAP, and F1 scores of 0.88, 0.67, and 0.59 respectively—far ahead of other models, demonstrating high robustness in small-sample, class-imbalanced scenarios. Inference speed was near real-time, needing as little as 0.0018 seconds to analyze a 10-second video clip, handily outpacing the large GPT-4V model (whose local deployment is hampered by hardware constraints and greater latency).

2. Class-wise Differences and Error Analysis

The AV-FOS model clearly excels in recognizing individual interaction styles, particularly those requiring audio information (such as Positive Vague Instruction or Positive Specific Instruction), showcasing the AI’s sensitive recognition of complex medical behaviors. Pure visual models can infer some audio-relevant behaviors via cues like lip movement or head gestures, but the fusion model proves superior in comprehension. For rare classes (such as Complaint, Parent Affection, Non-compliance), sample scarcity drives conservative predictions across all models, but AV-FOS remains superior for these minority categories. Statistical significance of performance differences was validated by the Wilcoxon signed-rank test.

3. Multimodal Fusion Advantages and Ablation Studies

Ablation experiments revealed that the audio-only A-FOS model outperformed video-only V-FOS, particularly for instructional and social behaviors; fusion leads to further gains. Removing CAV-MAE pretraining yielded only a 2% drop in generalization accuracy but a pronounced drop in F1 and MAP, underscoring the unique value of self-supervised pretraining for handling imbalance. Visual sampling strategy ablation showed that “averaged key frame attention” best balances spatial and temporal cues and is highly efficient for clinical application.

4. Inference Visualization and Model Interpretability

Attention map visualization revealed four prominent focus regions in the fusion layer: “visual-to-visual”, “visual-to-audio”, “audio-to-visual”, and “audio-to-audio”, demonstrating profound cross-modal integration—a breakthrough for multimodal deep model technology. This characteristic supports behavioral medicine experts in interpreting AI reasoning paths.

5. Conclusions and Research Value

This work presents, for the first time, a dataset based on the FOS-R-III scale and an AV-FOS automatic encoding model that effectively addresses several long-standing issues in autism: difficulty in behavioral assessment, heavy manual annotation burden, insufficient clinical data, and poor model interpretability. It offers a new paradigm and technological route for the automation and intelligence of autism behavior analysis. The model not only achieves audio-visual multimodal fusion but also generalizes widely to complex real clinical settings, showing practical value for diagnosis, risk assessment, and intervention support.

Scientifically, this research advances medicine-oriented behavioral analysis AI with international-level innovations: self-supervised pretraining, cross-modal attention mechanisms, and medical feature engineering. In terms of application, these outcomes are expected to be implemented in hospitals and rehabilitation centers, greatly improving diagnostic efficiency and reducing costs, providing autistic children’s families with more personalized and timely support.

6. Highlights and Significance

  1. Clinically-sourced Original Data: Data collection and annotation strictly followed ethical and disciplinary standards, ensuring high-quality training for AI models.
  2. Novel Audio-Visual Multimodal Deep Model: First implementation of FOS-R-III-based fine-grained automatic encoding, combining self-supervised and supervised learning for significant improvements in medical behavior recognition.
  3. Handling Imbalanced Data and Small Sample Challenges: By integrating universal pretraining and medical-specific feature engineering, the model achieves world-leading performance for minority classes.
  4. Industry-leading Real-Time Inference: Extremely fast inference meets the clinical demand for rapid and accurate diagnostics.
  5. Model Interpretability and Transparency: Attention visualization aids medical experts in understanding AI reasoning, enhancing trust and adoption in medical practice.

7. Additional Information and Outlook

Both the research dataset and model algorithms are planned for academic release, supporting global collaboration toward standardizing automated autism behavior analysis. The paper strictly adheres to IEEE and ethical committee guidelines, with rigorous data privacy protections to ensure human subject rights.

As the team continues to collect data and optimize the model, the system will further improve recognition of minority behaviors and expand to more application scenarios, with anticipated broader impact in autism diagnosis/intervention and affective disorder analysis. The future of AI-medical integration is being gradually shaped and realized by these innovative studies.