2025_programme: Transformers and Ensembles for Object Detection in Sidescan Sonar Images
- Day: June 17, Tuesday
Location / Time: D. CHLOE at 11:00-11:20
- Last minutes changes: -
- Session: 18. Towards Automatic Target Recognition. Detection, Classification and Modelling
Organiser(s): Johannes Groen, Yan Pailhas, Roy Edgar Hansen, Narada Warakagoda
Chairperson(s): Johannes Groen, Yan Pailhas
- Lecture: Transformers and Ensembles for Object Detection in Sidescan Sonar Images [Invited]
Paper ID: 2106
Author(s): Yannik Steiniger
Presenter: Yannik Steiniger
Abstract: Deep leaning based computer vision models can surpass traditional methods not only in tasks on large optical benchmark datasets but also in the automatic analysis of sonar images. Convolutional neural networks (CNN) have been developed and trained to classify or to detect objects in sonar images. Recently, Vision Transformers have emerged as the state-of-the-art in many computer vision tasks and their application to the classification of sonar images has been investigated as well. However, transformer-based models for detecting objects in sonar images have not been analysed yet.\n\nIn this work, we compare CNN and Transformer models for the detection of objects in sidescan sonar images. As representatives for the CNN methods we select one-stage as well as two-stage detectors. The transformer-based methods are Retina-SWIN and Deformable DETR, which use the attention mechanism in the network backbone or detection head, respectively. In addition, we run all detectors in two step pipelines, which further process the predictions by another CNN to reduce the number of false alarms. Furthermore, we study the combination of multiple different object detectors into an ensemble. For classification, such ensembles of CNNs have shown to improve the overall performance.\n\nOur results show that the transformer-based method Retina-SWIN can reach a true-positive rate of 100% but at the cost of a high false alarm rate. Filtering detections with a CNN in the two step pipeline significantly reduces the number of false alarms per image but also the maximum true-positive rate. The best trade-off is achieved with a Retina-SWIN model, which is trained for localisation and a subsequent CNN for classification, reaching a maximum true-positive rate of nearly 95% at slightly over 30 false alarms per image. An ensemble of different object detectors can increase the detection performance but at the cost of an increased number of false alarms.
Download the full paper
This paper is a candidate for the "Prof. John Papadakis award for the best paper presented by a young acoustician(under 40)"
- Corresponding author: Dr Yannik Steiniger
Affiliation: German Aerospace Center (DLR)
Country: Germany