VNL-STES: A Benchmark Dataset and Model for Spatiotemporal Event Spotting in Volleyball Analytics

Hoang Nguyen1,2, Ankhzaya Jamsrandorj2, Vanyi Chao1,2, Yin May Oo1,2, Muhammad Amrulloh Robbani1,2, Kyung-Ryoul Mun2, Jinwook Kim2
1AI-Robotics, KIST School, University of Science and Technology, Seoul 02792, Republic of Korea 2Korea Institute of Science and Technology, Seoul 02792, South Korea

Watch the STES model in action as it detects and localizes key volleyball events.

Abstract

Volleyball video analytics require precisely detecting both the timing and location of key events. We introduce a novel task: Precise Spatiotemporal Event Spotting, which seeks to accurately determine when and where important events occur within a video. To this end, we created the Volleyball Nations League (VNL) Dataset, including 8 full games, 1,028 rally videos, and 6,137 annotated events with both temporal and spatial localization. Our best model, the Spatiotemporal Event Spotter (STES), outperforms the current state-of-the-art (SOTA) in temporal action spotting by 9.86 mean Temporal Average Precision (mTAP) and achieves a notable 80.21 mAP for spatial localization, accurately pinpointing event locations within a 2-6 pixel range.

To the best of our knowledge, this is the first work addressing Precise Spatiotemporal Event Spotting in volleyball, establishing a strong baseline for future research in this domain.

Example sequence for a serve event, highlighting the exact timing t with the spatial location annotated by the white circle and the event class indicated by the text above it.
Figure 1: Example sequence for a serve event, highlighting the exact timing t with the spatial location annotated by the white circle and the event class indicated above it.

Volleyball Nations League (VNL) Dataset

Overview

The VNL Dataset consists of 8 full match videos from the 2022 and 2023 Volleyball Nations League seasons (sourced with permission from Volleyball World), segmented into 1,028 rally videos with 251,110 frames and 6,137 annotated events. Each rally is captured at 25 FPS, with full HD MP4 files available for visualization. The dataset features variations in player jerseys, court visuals, and audience presence.

Annotation Process

We employed a semi-automated annotation workflow using a custom interface:

  1. Video Download & Rally Extraction: Full matches were downloaded and segmented into rallies using court keypoint detection.
  2. Pseudolabeling: A pre-trained model generated initial event proposals.
  3. Manual Correction: Human annotators verified, corrected, added, or removed labels, precisely marking the temporal frame and spatial (x,y) coordinates for each event.
Six core actions—Serve, Receive, Set, Spike, Block, and Score—were selected for their fundamental roles in gameplay. Detailed definitions are available in the paper.

Data Acquisition Pipeline
Figure 2: Data acquisition pipeline for the VNL dataset.

Dataset Statistics

Action # Events Percentage (%)
Serve 1,071 17.45
Receive 1,558 25.39
Set 1,393 22.70
Spike 1,321 21.53
Block 550 8.96
Score 244 3.97

Dataset Splits

The dataset is divided into training (811 rallies, 78.9%), validation (102 rallies, 9.9%), and testing (115 rallies, 11.2%) subsets. Each split contains unique matches to ensure model generalization.

Split # Clips (Rallies) # Events # Frames
Train 811 4,901 202,909
Validation 102 600 21,846
Test 115 636 26,355
All 1,028 6,137 251,110

Proposed Method

Spatiotemporal Event Spotter (STES)

The Spatiotemporal Event Spotter (STES) is a multi-task deep learning model designed to simultaneously predict the timing (precise frame) and spatial location (x,y coordinates) of key volleyball events. It comprises:

  • Feature Extractor: Utilizes a RegNet-Y backbone with a Gate Shift Module (GSM) to capture rich spatiotemporal features from input frames.
  • Temporal Aggregator & Predictor: Employs a multi-layered bidirectional GRU (3 layers in our best model) to learn long-term dependencies and classify events frame-by-frame.
  • Spatial Predictor: Uses a lightweight MLP head operating on backbone features to regress the precise spatial coordinates of events.
The model is trained end-to-end with a multi-task loss combining weighted cross-entropy for temporal classification and L1 loss for spatial localization (applied only on event frames), balancing both tasks.

STES Architecture
Figure 3: The Spatiotemporal Event Spotter (STES) framework consists of a Feature Extractor, Temporal Aggregator & Predictor, and Spatial Predictor.

Results

Performance Overview

Our STES model demonstrates superior performance on the VNL Dataset. It achieves a mean Temporal Average Precision (mTAP@0-4F) of 68.44, significantly surpassing the state-of-the-art temporal spotting models like T-DEED (58.58 mTAP) by 9.86 mAP. For the novel spatial localization task, STES attains a mean Spatial Average Precision (mSAP@2-6P) of 80.21, accurately localizing events within a 2-6 pixel tolerance and outperforming a baseline spatial model (E2E Spatial, 73.61 mSAP).

Overall Temporal Event Spotting Performance (mTAP)

Model mTAP@0F mTAP@1F mTAP@2F mTAP@4F mTAP@0-4F
E2E Spot 32.37 57.95 63.37 67.58 55.32
T-DEED 30.41 63.39 69.00 71.51 58.58
E2E Spatial 44.56 70.03 72.15 72.51 64.81
STES (Ours) 46.76 73.64 76.29 77.06 68.44

Event-Specific Temporal Performance (mTAP@0-4F)

Event E2E Spot T-DEED E2E Spatial STES (Ours)
Block 50.81 60.32 56.27 59.39
Receive 67.40 68.97 72.19 73.56
Score 18.22 10.28 25.55 43.05
Serve 44.14 58.68 78.71 79.73
Set 77.07 78.16 79.04 77.88
Spike 74.28 75.06 77.12 77.03

Spatial Event Spotting Performance (mSAP)

Metric E2E Spatial STES (Ours)
mSAP@2P 57.16 69.63
mSAP@4P 79.86 84.52
mSAP@6P 83.82 86.47
mSAP@2-6P 73.61 80.21

Confusion Matrix Analysis

Confusion Matrix of STES Model
Figure 4: Confusion matrix of the STES model on the test split (σf = 4 frames).

The confusion matrix highlights strengths and areas for improvement. While distinct actions like spike and block are generally well-recognized, some confusion occurs due to temporal proximity and visual similarity. Similarly, score events are sometimes confused with receive (e.g., a failed receive attempt) or missed (none) if the ball is occluded when touching the ground.

Conclusions and Future Work

We introduced Precise Spatiotemporal Event Spotting for volleyball and presented the VNL Dataset along with our STES model, which sets a new state-of-the-art baseline. Our model excels in temporal spotting and provides strong spatial localization, though challenges remain in distinguishing visually similar or closely timed events (e.g., spike vs. block, score vs. receive) and handling occlusions.

Limitations: The current VNL Dataset size, while significant, may limit generalization. The semi-automated annotation process, though efficient, is susceptible to human error and class imbalance (e.g., fewer block and score events) can bias model performance.

Future Work: We plan to expand the dataset diversity, employ techniques like weighted loss or resampling to mitigate class imbalance, and explore integrating automated ball detection to aid annotation. Incorporating multimodal data, such as audio cues, could further enhance detection robustness. These efforts aim to advance volleyball analytics by providing richer data and more sophisticated models.

BibTeX

@inproceedings{nqh25vnlstes,
  title={VNL-STES: A Benchmark Dataset and Model for Spatiotemporal Event Spotting in Volleyball Analytics},
  author={Nguyen, Hoang and Jamsrandorj, Ankhzaya and Chao, Vanyi and Oo, Yin May and Robbani, Muhammad Amrulloh and Mun, Kyung-Ryoul and Kim, Jinwook},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
  month     = {June},
  year      = {2025},
}