Transformers only look once with nonlinear combination for real-time object detection
Journal article
Xia, R., Li, G., Huang, Z., Pang, Y. and Qi, M. 2022. Transformers only look once with nonlinear combination for real-time object detection. Neural Computing and Applications. https://doi.org/10.1007/s00521-022-07333-y
Authors | Xia, R., Li, G., Huang, Z., Pang, Y. and Qi, M. |
---|---|
Abstract | In this article, a novel real-time object detector called Transformers Only Look Once (TOLO) is proposed to resolve two problems. The first problem is the inefficiency of building long-distance dependencies among local features for amounts of modern real-time object detectors. The second one is the lack of inductive biases for vision Transformer networks with heavily computational cost. TOLO is composed of Convolutional Neural Network (CNN) backbone, Feature Fusion Neck (FFN), and different Lite Transformer Heads (LTHs), which are used to transfer the inductive biases, supply the extracted features with high-resolution and high-semantic properties, and efficiently mine multiple long-distance dependencies with less memory overhead for detection, respectively. Moreover, to find the massive potential correct boxes during prediction, we propose a simple and efficient nonlinear combination method between the object confidence and the classification score. Experiments on the PASCAL VOC 2007, 2012, and the MS COCO 2017 datasets demonstrate that TOLO significantly outperforms other state-of-the-art methods with a small input size. Besides, the proposed nonlinear combination method can further elevate the detection performance of TOLO by boosting the results of potential correct predicted boxes without increasing the training process and model parameters. |
Keywords | Real-time object detector; TOLO; Vision Transformer networks; Non-linear combination |
Year | 2022 |
Journal | Neural Computing and Applications |
Publisher | Springer Nature |
ISSN | 0941-0643 |
1433-3058 | |
Digital Object Identifier (DOI) | https://doi.org/10.1007/s00521-022-07333-y |
Official URL | https://link.springer.com/article/10.1007/s00521-022-07333-y |
Publication dates | |
Online | 21 May 2022 |
Publication process dates | |
Accepted | 19 Apr 2022 |
Deposited | 30 Jun 2022 |
Accepted author manuscript | License File Access Level Open |
Output status | Published |
References | Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 1440–1448 Ren S, He K, Girshick R, Sun J (2016) Faster r-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149 Article Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: single shot multibox detector. In: European conference on computer vision. Springer, pp 21–37 Redmon J, Farhadi A (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767 Law H, Deng J (2018) Cornernet: Detecting objects as paired keypoints. In: Proceedings of the European conference on computer vision (ECCV), pp 734–750 Duan K, Bai S, Xie L, Qi H, Huang Q, Tian Q (2019) Centernet: keypoint triplets for object detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6569–6578 Tian Z, Shen C, Chen H, He T (2019) Fcos: Fully convolutional one-stage object detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9627–9636 Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7794–7803 Luo W, Li Y, Urtasun R, Zemel R (2016) Understanding the effective receptive field in deep convolutional neural networks. In: Proceedings of the 30th international conference on neural information processing systems, pp 4905–4913 Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141 Woo S, Park J, Lee Y, Kweon S (2018) Cbam: convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV), pp 3–19 Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, Lu H (2019) Dual attention network for scene segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3146–3154 Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008 Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H (2021) Training data-efficient image transformers & distillation through attention. In: International conference on machine learning. PMLR, pp 10347–10357 Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030 d’Ascoli S, Touvron H, Leavitt M, Morcos A, Biroli G, Sagun L (2021) Convit: improving vision transformers with soft convolutional inductive biases. arXiv preprint arXiv:2103.10697 Li Y, Zhang K, Cao J, Timofte R, Van Gool L (2021) Localvit: bringing locality to vision transformers. arXiv preprint arXiv:2104.05707 Wang W, Xie E, Li X, Fan DP, Song K, Liang D, Lu T, Luo P, Shao L (2021) Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. arXiv preprint arXiv:2102.12122 Chen J, Lu Y, Yu Q, Luo X, Adeli E, Wang Y, Lu L, Yuille AL, Zhou Y (2021) Transunet: transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306 Wu S, Li X, Wang X (2020) Iou-aware single-stage object detector for accurate localization. Image Vis Comput 97:103911 Article He Y, Zhang X, Savvides M, Kitani K (2018) Softer-NMS: rethinking bounding box regression for accurate object detection, vol 2, no. 3. arXiv preprint arXiv:1809.08545 Everingham M, Van Gool L, Williams CK, Winn J, Zisserman A (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338 Article Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European conference on computer vision. Springer, pp 740–755 Zhang S, Chi C, Yao Y, Lei Z, Li SZ (2020) Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9759–9768 He K, Gkioxari G, P. Dollár, Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 2961–2969 Lin TY, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp 2980–2988 Leng J, Liu Y (2019) An enhanced SSD with feature fusion and visual reasoning for object detection. Neural Comput Appl 31(10):6549–6558 Article Lim JS, Astrid M, Yoon HJ, Lee SI (2019) Small object detection using context and attention. arXiv preprint arXiv:1912.06319 Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) In: European conference on computer vision. Springer, pp 213–229 Zhu X, Su W, Lu L, Li B, Wang X, Dai J (2020) Deformable detr: deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 Fang Y, Liao B, Wang X, Fang J, Qi J, Wu R, Niu J, Liu W (2021) You only look at one sequence: rethinking transformer in vision through object detection. arXiv preprint arXiv: 2106.00666 Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 Lin TY, P. Dollár, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2117–2125 Dong Y, Cordonnier JB, Loukas A (2021) Attention is not all you need: pure attention loses rank doubly exponentially with depth. arXiv preprint arXiv:2103.03404 Zhang Z, He T, Zhang H, Zhang Z, Xie J, Li M (2019) Bag of freebies for training object detection neural networks. arXiv preprint arXiv:1902.04103 Touvron H, Cord M, Sablayrolles A, Synnaeve G, Jégou H (2021) Going deeper with image transformers. arXiv preprint arXiv:2103.17239 Kong T, Sun F, Yao A, Liu H, Lu M, Chen Y (2017) Ron: reverse connection with objectness prior networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5936–5944 Fu CY, Liu W, Ranga A, Tyagi A, Berg AC (2017) DSSD: deconvolutional single shot detector. arXiv preprint arXiv:1701.06659 Zhang S, Wen L, Bian X, Lei Z, Li SZ (2018) Single-shot refinement neural network for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4203–4212 Wang K, Lin L, Yan X, Chen Z, Zhang D, Zhang L (2018) Cost-effective object detection: active sample mining with switchable selection criteria. IEEE Trans Neural Netw Learn Syst 30(3):834–850 Article Bell S, Zitnick CL, Bala K, Girshick R (2016) Inside-outside net: detecting objects in context with skip pooling and recurrent neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2874–2883 Shrivastava A, Gupta A, Girshick R (2016) Training region-based object detectors with online hard example mining. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 761–769 Kong T, Yao A, Chen Y, Sun F (2016) Hypernet: towards accurate region proposal generation and joint object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 845–853 Liu Z, Du J, Tian F, Wen J (2019) Mr-CNN: a multi-scale region-based convolutional neural network for small traffic sign recognition. IEEE Access 7:57120–57128 Article Nie J, Anwer RM, Cholakkal H, Khan FS, Pang Y, Shao L (2019) Enriched feature guided refinement network for object detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9537–9546 Zhao Q, Sheng T, Wang Y, Tang Z, Chen Y, Cai L, Ling H (2019) M2det: A single-shot object detector based on multi-level feature pyramid network. In: Proceedings of the AAAI conference on artificial intelligence, vol 33, pp 9259–9266 Cao J, Pang Y, Han J, Li X (2019) Hierarchical shot detector. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9705–9714 Li S, Yang L, Huang J, Hua XS, Zhang L (2019) Dynamic anchor feature selection for single-shot object detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6609–6618 Tan M, Pang R, Le QV (2020) Efficientdet: Scalable and efficient object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10781–10790 Wang T, Anwer RM, Cholakkal H, Khan FS, Pang Y, Shao L (2019) Learning rich features at high-speed for single-shot object detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1971–1980 Deng L, Yang M, Li T, He Y, Wang C (2019) Rfbnet: deep multimodal networks with residual fusion blocks for RGB-d semantic segmentation. arXiv preprint arXiv:1907.00135 |
https://repository.canterbury.ac.uk/item/915w2/transformers-only-look-once-with-nonlinear-combination-for-real-time-object-detection
Download files
146
total views15
total downloads4
views this month0
downloads this month
Export as
Related outputs
Default clearing and ex-ante contagion in financial systems with a two-layer network structure
Qi, M., Ding, Y., Chun, Y., Liu, W. and Liu, J. 2025. Default clearing and ex-ante contagion in financial systems with a two-layer network structure. Communications in Nonlinear Science and Numerical Simulation. 142 (1), p. 108515. https://doi.org/10.1016/j.cnsns.2024.108515Two-stage trading mechanism in enabling the design and optimization of flexible resources interaction in smart grid
Qi, M., Chen, Y., Li, H., Liu, Y., Wu, S. and Li, X. 2024. Two-stage trading mechanism in enabling the design and optimization of flexible resources interaction in smart grid. Computing and Informatics. 43 (6), p. 1483–1515. https://doi.org/10.31577/cai_2024_6_1483Corporate fraud detection based on improved BP neural network
Qi, M., Liu, W., Liu, M,, Yan, C. and Zhang, L. 2024. Corporate fraud detection based on improved BP neural network. Computing and Informatics. 43 (3), pp. 611–632-611–632. https://doi.org/10.31577/cai_2024_3_611Normalizing flow based uncertainty estimation for deep regression analysis
Qi, M., Zhang, B., Sui, W, Li, M. and Huang, Z. 2024. Normalizing flow based uncertainty estimation for deep regression analysis. Neurocomputing. 585 (6), p. 127645. https://doi.org/10.1016/j.neucom.2024.127645Focusing on the golden skills of effective communication and collaboration to further enhance graduate employability attributes
O'Leary, S., McGee, M., Qi, M., Millns, S. and Eyden, A. 2024. Focusing on the golden skills of effective communication and collaboration to further enhance graduate employability attributes. Graduate College Working Papers.Feed-forward ANN with Random Forest Technique for Identifying malicious Internet of Things network intrusion
Obarafor, Victor, Qi, Mandi and Zhang, Leishi 2024. Feed-forward ANN with Random Forest Technique for Identifying malicious Internet of Things network intrusion. in: 2024 2nd International Conference on Cyber Resilience (ICCR) IEEE.
The role of use cases when adopting augmented reality into higher education pedagogy
Ward, G., Turner, S., Pitt, C., Qi, M., Richmond-Fuller, A. and Jackson, T. 2024. The role of use cases when adopting augmented reality into higher education pedagogy.A review of privacy-preserving federated learning, deep learning, and machine learning IIoT and IoTs solutions
Obarafor, Victor, Qi, Man and Zhang, L. 2023. A review of privacy-preserving federated learning, deep learning, and machine learning IIoT and IoTs solutions. in: 2023 8th IEEE International Conference on Signal and Image Processing (ICSIP) Wuxi, China IEEE. pp. 1074-1078Pre-processing of social media remarks for forensics
Gao, Xuhao and Qi, Man 2023. Pre-processing of social media remarks for forensics. IEEE. https://doi.org/10.1109/icnc-fskd59587.2023.10280980Alignment-based conformance checking of hierarchical process models
Wang, L, Han, X., Qi, M., Wang, K. and Lu, P. 2023. Alignment-based conformance checking of hierarchical process models. Computing and Informatics. 43 (1), pp. 149-180. https://doi.org/10.31577/cai_2024_1_149Research on UBI auto insurance pricing model based on parameter adaptive SAPSO optimal fuzzy controller
Wang, X., Liu, W., Ou, Z., Yan, C., Qi, M. and Jia, W. 2022. Research on UBI auto insurance pricing model based on parameter adaptive SAPSO optimal fuzzy controller. Computing and Informatics. 41 (4), pp. 1078-1113. https://doi.org/10.31577/cai_2022_4_1078Novel architecture for human re-identification with a two-stream neural network and attention ,echanism
Rahi, B. and Qi, M. 2022. Novel architecture for human re-identification with a two-stream neural network and attention ,echanism. Computing and Informatics. 41 (4), pp. 905-930. https://doi.org/10.31577/cai_2022_4_905A novel data analytic model for mining user insurance demands from microblogs
Chun Yan, Lu Liu, Wei Liu and Man Qi 2022. A novel data analytic model for mining user insurance demands from microblogs. Computing and Informatics. 41 (3). https://doi.org/10.31577/cai_2022_3_689Deep learning based real-time facial mask detection and crowd monitoring
Chan-Yun Yang, Hooman Samani, Nana Ji, Chunxu Li, Ding-Bang Chen and Man Qi 2022. Deep learning based real-time facial mask detection and crowd monitoring. Computing and Informatics. 40 (6), pp. 1263-11294. https://doi.org/10.31577/cai_2021_6_1263Repairing process models with non-free-choice constructs based on token replay
Bai, E., Qi, M., Luan, W., Li, P. and Du, Y. 2022. Repairing process models with non-free-choice constructs based on token replay. Computing and Informatics. 41 (4), pp. 1054-1077. https://doi.org/10.31577/cai_2022_4_1054Deviation detection in clinical pathways based on business alignment
Tian, Y., Li, X., Qi, Man, Han, D. and Du, Yuyue 2022. Deviation detection in clinical pathways based on business alignment. Scientific Programming. 2022, pp. 1-13. https://doi.org/10.1155/2022/6993449Security vulnerabilities of popular smart home appliances
Qi, M., Induruwa, A. and Hussain, F. 2021. Security vulnerabilities of popular smart home appliances. in: Proceeding of The Twentieth International Conference on Networks April 18, 2021 to April 22, 2021 - Porto, PortugalImproved adaptive genetic algorithm for the vehicle insurance fraud identification model based on a BP neural network
Yan, C., Li, M., Liu, W. and Qi, M. 2020. Improved adaptive genetic algorithm for the vehicle insurance fraud identification model based on a BP neural network. Theoretical Computer Science. 817, pp. 12-23. https://doi.org/10.1016/j.tcs.2019.06.025Hybrid Intrusion Detection System for Smart Home Applications
Hussain, Fida, Induruwa, Abhaya and Qi, Man 2020. Hybrid Intrusion Detection System for Smart Home Applications. in: Mahmood, Z. (ed.) Developing and Monitoring Smart Environments for Intelligent Cities IGI Global. pp. 300-322Payments per claim model of outstanding claims reserve based on fuzzy linear regression
Yan, C., Liu, Q., Liu, J., Liu, W., Li, M. and Qi, M. 2019. Payments per claim model of outstanding claims reserve based on fuzzy linear regression. International Journal of Fuzzy Systems. 21, pp. 1950-1960. https://doi.org/10.1007/s40815-019-00617-xFuzzy interacting multiple model H∞ particle filter algorithm based on current statistical model
Wang, Q., Chen, X., Zhang, L., Li, J., Zhao, C. and Qi, M. 2019. Fuzzy interacting multiple model H∞ particle filter algorithm based on current statistical model. International Journal of Fuzzy Systems. 21, pp. 1894-1905. https://doi.org/10.1007/s40815-019-00678-yTemporal sparse feature auto-combination deep network for video action recognition
Wang, Q., Gong, D., Qi, M., Shen, Y. and Lei, Y. 2018. Temporal sparse feature auto-combination deep network for video action recognition. Concurrency and Computation: Practice and Experience. https://doi.org/10.1002/cpe.4487Soundness analytics of composed logical workflow nets
Liu, W., Wang, L., Feng, X., Qi, M., Yan, C. and Li, M. 2017. Soundness analytics of composed logical workflow nets. International Journal of Parallel Programming. https://doi.org/10.1007/s10766-017-0536-8A sliding window-based dynamic load balancing for heterogeneous Hadoop clusters
Liu, Y., Jing, W., Liu, Y., Lv, L., Qi, M. and Xiang, Y. 2016. A sliding window-based dynamic load balancing for heterogeneous Hadoop clusters. Concurrency and Computation: Practice and Experience. 29 (3). https://doi.org/10.1002/cpe.3763Gaussian-Gamma collaborative filtering: a hierarchical Bayesian model for recommender systems
Luo, C., Zhang, B., Xiang, Y. and Qi, M. 2017. Gaussian-Gamma collaborative filtering: a hierarchical Bayesian model for recommender systems. Journal of Computer and System Sciences. https://doi.org/10.1016/j.jcss.2017.03.007Facilitating visual surveillance with motion detections
Qi, M. 2017. Facilitating visual surveillance with motion detections. Concurrency and Computation: Practice and Experience. 29 (3). https://doi.org/10.1002/cpe.3770Data security of android applications
Obiri-Yeboah, J. and Qi, M. 2016. Data security of android applications. in: 2016 12th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery : ICNC-FSKD 2016 : 13-15 August, Changsha, China IEEE Xplore.AL-DDCNN : a distributed crossing semantic gap learning for person re-identification
Cheng, K., Zhan, Y. and Qi, M. 2017. AL-DDCNN : a distributed crossing semantic gap learning for person re-identification. Concurrency and Computation: Practice and Experience. 29 (3). https://doi.org/10.1002/cpe.3766