gms | German Medical Science

67. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS), 13. Jahreskongress der Technologie- und Methodenplattform für die vernetzte medizinische Forschung e. V. (TMF)

21.08. - 25.08.2022, online

A Comparison of Feature Extraction Models for Medical Image Captioning

Meeting Abstract

  • Sebastian Germer - Institute of Medical Informatics, University of Lübeck, Lübeck, Germany
  • Hristina Uzunova - German Research Center for Artificial Intelligence (DFKI), Lübeck, Germany
  • Jan Ehrhardt - Institute of Medical Informatics, University of Lübeck, Lübeck, Germany
  • Nils Feldhus - German Research Center for Artificial Intelligence (DFKI), Berlin, Germany
  • Philippe Thomas - German Research Center for Artificial Intelligence (DFKI), Berlin, Germany
  • Heinz Handels - Institute of Medical Informatics, University of Lübeck, Lübeck, Germany; German Research Center for Artificial Intelligence (DFKI), Lübeck, Germany

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 67. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS), 13. Jahreskongress der Technologie- und Methodenplattform für die vernetzte medizinische Forschung e.V. (TMF). sine loco [digital], 21.-25.08.2022. Düsseldorf: German Medical Science GMS Publishing House; 2022. DocAbstr. 41

doi: 10.3205/22gmds036, urn:nbn:de:0183-22gmds0362

Published: August 19, 2022

© 2022 Germer et al.
This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License. See license information at http://creativecommons.org/licenses/by/4.0/.


Outline

Text

Introduction: In recent years, there has been significant progress in the area of image captioning using a combination of convolutional neural networks for feature extraction and recurrent neural networks for language generation [1], [2]. Inspired by this, this work suggests the automatic generation of medical image descriptions. However, findings from general-domain image captioning typically cannot be transferred one-to-one due to the specifics of medical images.

Many of the published works in this domain focus on improving the language generation component while relying on known image detection networks as feature extraction models [3], [4], [5]. On the contrary, in our work, we aim to find out which features and therefore extraction models are suitable for the task of medical image captioning.

Methods: For our study we take three feature extraction models into consideration: The DenseNet-121 [6] is used as a baseline. Another classifier with a reduced number of layers as well as an autoencoder architecture are developed by us. For language generation, we use an architecture similar to the Show-and-Tell approach [7]. It uses an LSTM to determine the next word in a sentence based on the image features received from the Feature Extractor and the already generated words. We chose to inspect two publicly available datasets for our study, the Open-I Indiana University Chest X-Ray dataset (IU-XRAY) [3] and the chest X-ray dataset of the National Institutes of Health (NIH-XRAY) [2]. Based on the textual findings of the IU-XRAY, we achieved shorter, more streamlined captions for each image based on the presence or absence of each of 15 disease categories in both datasets.

Results: The sentences generated by the Language Decoder are evaluated quantitatively for the different datasets and feature extraction architectures using several scoring methods (BLEU [8], METEOR [9], BERTScore [10]). Two major conclusions can be drawn from this evaluation: Firstly, the results are comparable for all of the features used for text generation. Also, the metrics for text evaluation seem to be correlated. This is especially interesting since n-gram-based metrics like BLEU and METEOR are intuitively less suited for this task than an embedding-based metric like BERTScore. One possible reason for this is that the BERT embedding is not designed for the medical domain.

Discussion: The quantitative results achieved by the different architectures are comparable to each other. On the one hand, this is interesting since the developed classifier has significantly less parameters than the Densenet, which indicates that simpler architectures yield similar results but require less computational resources. On the other hand, the usage of an autoencoder as a feature extractor for image captioning has hardly been mentioned in the literature so far. By not explicitly learning according to a given class distribution, they could remedy the problem of unevenly distributed classes, which is especially common in the medical field. We argue that this promising direction should be investigated in future work.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.


References

1.
Kougia V, Pavlopoulos J, Androutsopoulos I. A Survey on Biomedical Image Captioning. In: Proceedings of the Second Workshop on Shortcomings in Vision and Language. 2019. DOI: 10.48550/arXiv.1905.13302 External link
2.
Wang X, Peng Y, Lu L, Lu Z, Bagheri M, Summers RM. ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2017 Jul 21-26; Honolulu, HI, USA. p. 3462–3471. DOI: 10.1109/CVPR.2017.369 External link
3.
Boag W, Hsu TMH, Mcdermott M, Berner G, Alesentzer E, Szolovits P. Baselines for Chest X-Ray Report Generation. In: Dalca AV, McDermott MB, Alsentzer E, Finlayson SG, Oberst M, Falck F, Beaulieu-Jones B, editors. Proceedings of the Machine Learning for Health NeurIPS Workshop 2019. (Proceedings of Machine Learning Research; vol. 116). 2020. p. 126–140. Available from: https://proceedings.mlr.press/v116/boag20a.html External link
4.
Jing B, Xie P, Xing E. On the Automatic Generation of Medical Imaging Reports. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. pp. 2577–2586. DOI: 10.18653/v1/P18-1240 External link
5.
Shin HC, Roberts K, Lu L, Demner-Fushman D, Yao J, Summers RM. Learning to Read Chest X-Rays: Recurrent Neural Cascade Model for Automated Image Annotation. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2016 Jun 27-30; Las Vegas, NV, USA. p. 2497–2506. DOI: 10.1109/CVPR.2016.274 External link
6.
Huang G, Liu Z, Maaten LVD, Weinberger KQ. Densely connected convolutional networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2017 Jul 21-26; Honolulu, HI, USA. p. 2261–2269. DOI: 10.1109/CVPR.2017.243 External link
7.
Vinyals O, Toshev A, Bengio S, Erhan D. Show and Tell: A Neural Image Caption Generator. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2015 Jun 7-12 Jun; Boston, MA, USA. p. 3156–3164. DOI: 10.1109/CVPR.2015.7298935 External link
8.
Papineni K, Roukos S, Ward T, Zhu WJ. Bleu: A Method for Automatic Evaluation of Machine Translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics; 2002 Jul 7-12; Philadelphia, PA, USA. p. 311–318. DOI: 10.3115/1073083.1073135 External link
9.
Banerjee S, Lavie A. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In: Association for Computational Linguistics, editor. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Jun 2005. pp. 65–72. Available from: https://aclanthology.org/W05-0909 External link
10.
Zhang T, Kishore V, Wu F, Weinberger KQ, Artzi Y. BERTScore: Evaluating Text Generation with BERT. In: International Conference on Learning Representations; 2020 Apr 30; Addis Ababa, Ethiopia. Available from: https://openreview.net/forum?id=SkeHuCVFDr External link
11.
Demner-Fushman D, Kohli MD, Rosenman MB, Shooshan SE, Rodriguez L, Antani S, Thoma GR, McDonald CJ. Preparing a collection of radiology examinations for distribution and retrieval. Journal of the American Medical Informatics Association. 2016;23(2):304–310. DOI: 10.1093/jamia/ocv080 External link