Tried Everything, Still Failing at CSLR with Transformer-Based Model
Overcoming Challenges in Transformer-Based Continuous Sign Language Recognition: Insights and Strategies
Implementing Transformer Architectures for CSLR: A Deep Dive
Developing effective models for Continuous Sign Language Recognition (CSLR) remains a complex challenge, especially when leveraging advanced Transformer-based architectures. Recently, many researchers have embarked on this journey, exploring innovative approaches to enhance performance, yet some encounters remain frustratingly difficult. Here, we share a detailed account of the hurdles faced when constructing a dual-stream transformer model using the RWTH-PHOENIX-Weather 2014 dataset, with the hope of fostering constructive discussion and shared insights.
Model Foundations and Design Choices
The approach centers around a dual-stream architecture designed to capture comprehensive sign language cues:
-
Separate processing of visual and keypoint data: One stream handles raw RGB video frames, while the other processes keypoint data extracted via Mediapipe, aiming to incorporate both appearance and pose information.
-
Transformer Encoder with ViViT: Each stream is encoded through a ViViT model with a depth of 12 layers, enabling rich spatiotemporal feature extraction.
Fusion Strategy and Interaction
The two streams are integrated through strategic cross-attention mechanisms placed after the 4th and 8th ViViT layers, fostering interaction between appearance and pose features. To facilitate balanced learning without overloading individual streams, adapter modules are inserted into the intermediate layers.
Decoding Approaches and Challenges
Various decoding strategies have been experimented with, but none have yielded consistent success:
-
Utilizing T5 as a decoder: Despite its strength in text generation, T5 integration faced difficulties, possibly due to mismatched input-output modalities.
-
PyTorch’s TransformerDecoder: Attempts included:
-
Decoding each stream separately followed by combining outputs via cross-attention.
-
Fusing encoded features through addition or concatenation before decoding via a shared decoder.
-
Using separate decoders for each stream, each with dedicated fully connected layers.
Pretraining and Its Limitations
Pretraining the ViViT encoder on 96-frame sequences proved beneficial in some contexts but failed to lead to significant improvements when integrated into the full end-to-end pipeline. Multiple variations in model depth, fusion points, and training schedules were explored without notable success.
Training Difficulties
Despite diligent experimentation with loss functions, optimizers (primarily Adam), learning rates, and scheduling techniques, the model struggles to converge. Validation metrics often plateau or fluctuate, suggesting potential issues
Post Comment