Real time Semantic Segmentation for Human-Labeled Data: A comparative Study Between CNN and Transformer
DOI:
https://doi.org/10.47611/jsrhs.v13i1.6329Keywords:
Computer Vision, Semantic SegmentationAbstract
Convolutional Neural Networks have rapidly developed in the field of computer vision, notably in image classification, semantic segmentation, and object detection. These networks efficiently extract image features through local receptive fields and shared weights. Despite their effectiveness in various applications, CNNs face limitations, such as challenges in managing large-scale parameters and a tendency to overfit, especially in complex scenarios requiring contextual understanding. On the other hand, Transformer-based models, originally designed for natural language processing, have recently gained prominence in computer vision. They are particularly adept at capturing long-range dependencies, a critical aspect for interpreting complex visual scenes. Their scalability and adaptability open up new avenues for innovation. However, these models also come with drawbacks, including a need for extensive training data and higher computational costs. Their complex structures make optimization particularly challenging in resource-constrained environments. In this paper, our focus is on real-time model comparisons between CNN and Transformer architectures. We represent CNNs with the STDC model and Transformer-based models with the SegFormer. Our analysis revealed that the STDC model significantly outperforms in inference speed, achieving around 97 frames per second (fps), which is notably faster than the 50 fps of the SegFormer B0. However, when it comes to accuracy, the SegFormer B0 demonstrates superiority with a mean Intersection over Union (mIoU) of 86.78, which is more favorable compared to the 82.9 mIoU of the STDC. This study underscores the efficiency and accuracy trade-offs between these two architectures, highlighting the strengths of both CNNs and Transformer-based models in real-time applications.
Downloads
References or Bibliography
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3431-3440).
Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2881-2890).
Chen, Liang-Chieh, et al. "Encoder-decoder with atrous separable convolution for semantic image segmentation." Proceedings of the European conference on computer vision (ECCV). 2018.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... & Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., ... & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 10012-10022).
Wang, W., Xie, E., Li, X., Fan, D. P., Song, K., Liang, D., ... & Shao, L. (2021). Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 568-578).
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., ... & Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861.
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L. C. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4510-4520).
Howard, A., Sandler, M., Chu, G., Chen, L. C., Chen, B., Tan, M., ... & Adam, H. (2019). Searching for mobilenetv3. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 1314-1324).
Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J. M., & Luo, P. (2021). SegFormer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems, 34, 12077-12090.
Fan, M., Lai, S., Huang, J., Wei, X., Chai, Z., Luo, J., & Wei, X. (2021). Rethinking bisenet for real-time semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9716-9725).
Wu, H., Xiao, B., Codella, N., Liu, M., Dai, X., Yuan, L., & Zhang, L. (2021). Cvt: Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 22-31).
Xu, W., Xu, Y., Chang, T., & Tu, Z. (2021). Co-scale conv-attentional image transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 9981-9990).
Li, H., Xiong, P., Fan, H., & Sun, J. (2019). Dfanet: Deep feature aggregation for real-time semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9522-9531).
Zhao, H., Qi, X., Shen, X., Shi, J., & Jia, J. (2018). Icnet for real-time semantic segmentation on high-resolution images. In Proceedings of the European conference on computer vision (ECCV) (pp. 405-420).
Published
How to Cite
Issue
Section
Copyright (c) 2024 Connie Chen
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Copyright holder(s) granted JSR a perpetual, non-exclusive license to distriute & display this article.