Automatically Labeling Offensive Formations in American Football Film Using Deep Learning
DOI:
https://doi.org/10.47611/jsrhs.v12i1.4278Keywords:
American Football, Sports Analytics, Computer Vision, Offensive Football Formations, Image Classification, Deep Learning, Convolutional Neural Network, TransformerAbstract
Web services for storing, annotating, and sharing sports videos by high-school athletic teams have become more prevalent in recent years. However, most services lack the ability for coaches to automatically tag film, leading to many hours of manual annotation. For American football videos, coaches need to label formations, plays, and field positions in order to extract insights and create strategic game plans. This paper presents an end-to-end machine learning pipeline for automatically labeling American football offensive formations in videos. The pipeline includes pre-processing of videos, image classification, and a novel inference approach. The study compares a custom CNN model with pre-trained image classifier models using transfer learning. Specifically, CNN-based architectures (MobileNet, Inception, EfficientNet, etc.) and a transformer-based Vision Transformer (ViT) are compared. All models are trained on ~1400 images with the three most popular formation labels extracted from video clips of high-school football team games. The results show that several models, including the custom CNN model, achieved greater than 90% classification accuracy on the test dataset. The inference is performed by sampling multiple frames from a video clip, passing them through a trained image classifier, and taking a majority vote on the classification results to determine the final outcome. Our study found that using a sampling rate of 0.5 seconds, starting at 1 second, and taking five frames yields the highest inference accuracy of 95.4% using a trained customized CNN model. This system can assist all levels of football coaches in the analysis of game footage and formation identification.
Downloads
References or Bibliography
Ajmeri, O., & Shah, A. (2018). MIT Sloan Sports Analytics Conference. In Using Computer Vision and Machine Learning to Automatically Classify NFL Game Film and Develop a Player Tracking System. Boston. Retrieved December 31, 2022, from https://www.sloansportsconference.com/research-papers/using-computer-vision-and-machine-learning-to-automatically-classify-nfl-game-film-and-develop-a-player-tracking-system.
Atmosukarto, I., Ghanem, B., Ahuja, S., Muthuswamy, K., & Ahuja, N. (2013, June). Automatic Recognition of Offensive Team Formation in American Football Plays. In 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops (pp. 991-998). https://doi.org/10.1109/CVPRW.2013.144
Bertasius, G., Wang, H., & Torresani, L. (2021). Is Space-Time Attention All You Need for Video Understanding? https://doi.org/10.48550/ARXIV.2102.05095
Dickmanns, L. (2021). Pose Estimation and Analysis for American Football Videos (dissertation).
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv. https://doi.org/10.48550/arxiv.2010.11929
Feichtenhofer, C. (2020). X3D: Expanding Architectures for Efficient Video Recognition. CoRR, abs/2004.04730. https://doi.org/10.48550/arXiv.2004.04730
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 770-778).
Hess, R., Fern, A., & Mortensen, E. (2007). Mixture-of-parts pictorial structures for objects with variable part sets. In Proceedings of the International Conference on Computer Vision (ICCV).
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., ... & Adam, H. (2017). MobileNets: Efficient convolutional neural networks for mobile vision applications. https://doi.org/10.48550/arXiv.1704.04861
Ji, S., Xu, W., Yang, M., & Yu, K. (2013). 3D Convolutional Neural Networks for Human Action Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1), 221-231. https://doi.org/10.1109/TPAMI.2012.59
Li, Z., Liu, F., Yang, W., Peng, S., & Zhou, J. (2022). A Survey of Convolutional Neural Networks: Analysis, Applications, and Prospects. IEEE Transactions on Neural Networks and Learning Systems, 33(12), 6999-7019. https://doi.org/10.1109/TNNLS.2021.3084827
Newman, J. D. (2022). Automated Pre-Play Analysis of American Football Formations Using Deep Learning. Theses and Dissertations, 9623.
Radosavovic, I., Kosaraju, R. P., Girshick, R., He, K., & Dollár, P. (2020). Designing Network Design Spaces. arXiv. https://doi.org/10.48550/arxiv.2003.13678
Ribani, R., & Marengoni, M. (2019, August). A Survey of Transfer Learning for Convolutional Neural Networks. In 2019 32nd SIBGRAPI Conference on Graphics, Patterns and Images Tutorials (SIBGRAPI-T) (pp. 47-57). https://doi.org/10.1109/SIBGRAPI-T.2019.00010
Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems, 5, 568-576.
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2015). Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1-9).
Tan, M., Le, Q. V., & Doraswamy, P. (2019). EfficientNet: Rethinking model scaling for convolutional neural networks. https://doi.org/10.48550/arXiv.1905.11946
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016). Temporal Segment Networks: Towards Good Practices for Deep Action Recognition. arXiv. https://doi.org/10.48550/ARXIV.1608.00859
Zhang, H., Wu, C., Zhang, Z., Zhu, Y., Zhang, Z., Lin, H., Sun, Y., He, T., Mueller, J., Manmatha, R., Li, M., & Smola, A. (2020). ResNeSt: Split-Attention Networks. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2735-2745.
Published
How to Cite
Issue
Section
Copyright (c) 2023 Kyle Zhou; Jason Galbraith
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Copyright holder(s) granted JSR a perpetual, non-exclusive license to distriute & display this article.