Volleyball video analytics require precisely detecting both the timing and location of key events. We introduce a novel task: Precise Spatiotemporal Event Spotting, which seeks to accurately determine when and where important events occur within a video. To this end, we created the Volleyball Nations League (VNL) Dataset, including 8 full games, 1,028 rally videos, and 6,137 annotated events with both temporal and spatial localization. Our best model, the Spatiotemporal Event Spotter (STES), outperforms the current state-of-the-art (SOTA) in temporal action spotting by 9.86 mean Temporal Average Precision (mTAP) and achieves a notable 80.21 mAP for spatial localization, accurately pinpointing event locations within a 2-6 pixel range.
To the best of our knowledge, this is the first work addressing Precise Spatiotemporal Event Spotting in volleyball, establishing a strong baseline for future research in this domain.
The VNL Dataset consists of 8 full match videos from the 2022 and 2023 Volleyball Nations League seasons (sourced with permission from Volleyball World), segmented into 1,028 rally videos with 251,110 frames and 6,137 annotated events. Each rally is captured at 25 FPS, with full HD MP4 files available for visualization. The dataset features variations in player jerseys, court visuals, and audience presence.
We employed a semi-automated annotation workflow using a custom interface:
| Action | # Events | Percentage (%) |
|---|---|---|
| Serve | 1,071 | 17.45 |
| Receive | 1,558 | 25.39 |
| Set | 1,393 | 22.70 |
| Spike | 1,321 | 21.53 |
| Block | 550 | 8.96 |
| Score | 244 | 3.97 |
The dataset is divided into training (811 rallies, 78.9%), validation (102 rallies, 9.9%), and testing (115 rallies, 11.2%) subsets. Each split contains unique matches to ensure model generalization.
| Split | # Clips (Rallies) | # Events | # Frames |
|---|---|---|---|
| Train | 811 | 4,901 | 202,909 |
| Validation | 102 | 600 | 21,846 |
| Test | 115 | 636 | 26,355 |
| All | 1,028 | 6,137 | 251,110 |
The Spatiotemporal Event Spotter (STES) is a multi-task deep learning model designed to simultaneously predict the timing (precise frame) and spatial location (x,y coordinates) of key volleyball events. It comprises:
Our STES model demonstrates superior performance on the VNL Dataset. It achieves a mean Temporal Average Precision (mTAP@0-4F) of 68.44, significantly surpassing the state-of-the-art temporal spotting models like T-DEED (58.58 mTAP) by 9.86 mAP. For the novel spatial localization task, STES attains a mean Spatial Average Precision (mSAP@2-6P) of 80.21, accurately localizing events within a 2-6 pixel tolerance and outperforming a baseline spatial model (E2E Spatial, 73.61 mSAP).
| Model | mTAP@0F | mTAP@1F | mTAP@2F | mTAP@4F | mTAP@0-4F |
|---|---|---|---|---|---|
| E2E Spot | 32.37 | 57.95 | 63.37 | 67.58 | 55.32 |
| T-DEED | 30.41 | 63.39 | 69.00 | 71.51 | 58.58 |
| E2E Spatial | 44.56 | 70.03 | 72.15 | 72.51 | 64.81 |
| STES (Ours) | 46.76 | 73.64 | 76.29 | 77.06 | 68.44 |
| Event | E2E Spot | T-DEED | E2E Spatial | STES (Ours) |
|---|---|---|---|---|
| Block | 50.81 | 60.32 | 56.27 | 59.39 |
| Receive | 67.40 | 68.97 | 72.19 | 73.56 |
| Score | 18.22 | 10.28 | 25.55 | 43.05 |
| Serve | 44.14 | 58.68 | 78.71 | 79.73 |
| Set | 77.07 | 78.16 | 79.04 | 77.88 |
| Spike | 74.28 | 75.06 | 77.12 | 77.03 |
| Metric | E2E Spatial | STES (Ours) |
|---|---|---|
| mSAP@2P | 57.16 | 69.63 |
| mSAP@4P | 79.86 | 84.52 |
| mSAP@6P | 83.82 | 86.47 |
| mSAP@2-6P | 73.61 | 80.21 |
The confusion matrix highlights strengths and areas for improvement. While distinct actions like spike and block are generally well-recognized, some confusion occurs due to temporal proximity and visual similarity. Similarly, score events are sometimes confused with receive (e.g., a failed receive attempt) or missed (none) if the ball is occluded when touching the ground.
We introduced Precise Spatiotemporal Event Spotting for volleyball and presented the VNL Dataset along with our STES model, which sets a new state-of-the-art baseline. Our model excels in temporal spotting and provides strong spatial localization, though challenges remain in distinguishing visually similar or closely timed events (e.g., spike vs. block, score vs. receive) and handling occlusions.
Limitations: The current VNL Dataset size, while significant, may limit generalization. The semi-automated annotation process, though efficient, is susceptible to human error and class imbalance (e.g., fewer block and score events) can bias model performance.
Future Work: We plan to expand the dataset diversity, employ techniques like weighted loss or resampling to mitigate class imbalance, and explore integrating automated ball detection to aid annotation. Incorporating multimodal data, such as audio cues, could further enhance detection robustness. These efforts aim to advance volleyball analytics by providing richer data and more sophisticated models.
@inproceedings{nqh25vnlstes,
title={VNL-STES: A Benchmark Dataset and Model for Spatiotemporal Event Spotting in Volleyball Analytics},
author={Nguyen, Hoang and Jamsrandorj, Ankhzaya and Chao, Vanyi and Oo, Yin May and Robbani, Muhammad Amrulloh and Mun, Kyung-Ryoul and Kim, Jinwook},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
month = {June},
year = {2025},
}