Article
A Comparison of Feature Extraction Models for Medical Image Captioning
Search Medline for
Authors
Published: | August 19, 2022 |
---|
Outline
Text
Introduction: In recent years, there has been significant progress in the area of image captioning using a combination of convolutional neural networks for feature extraction and recurrent neural networks for language generation [1], [2]. Inspired by this, this work suggests the automatic generation of medical image descriptions. However, findings from general-domain image captioning typically cannot be transferred one-to-one due to the specifics of medical images.
Many of the published works in this domain focus on improving the language generation component while relying on known image detection networks as feature extraction models [3], [4], [5]. On the contrary, in our work, we aim to find out which features and therefore extraction models are suitable for the task of medical image captioning.
Methods: For our study we take three feature extraction models into consideration: The DenseNet-121 [6] is used as a baseline. Another classifier with a reduced number of layers as well as an autoencoder architecture are developed by us. For language generation, we use an architecture similar to the Show-and-Tell approach [7]. It uses an LSTM to determine the next word in a sentence based on the image features received from the Feature Extractor and the already generated words. We chose to inspect two publicly available datasets for our study, the Open-I Indiana University Chest X-Ray dataset (IU-XRAY) [3] and the chest X-ray dataset of the National Institutes of Health (NIH-XRAY) [2]. Based on the textual findings of the IU-XRAY, we achieved shorter, more streamlined captions for each image based on the presence or absence of each of 15 disease categories in both datasets.
Results: The sentences generated by the Language Decoder are evaluated quantitatively for the different datasets and feature extraction architectures using several scoring methods (BLEU [8], METEOR [9], BERTScore [10]). Two major conclusions can be drawn from this evaluation: Firstly, the results are comparable for all of the features used for text generation. Also, the metrics for text evaluation seem to be correlated. This is especially interesting since n-gram-based metrics like BLEU and METEOR are intuitively less suited for this task than an embedding-based metric like BERTScore. One possible reason for this is that the BERT embedding is not designed for the medical domain.
Discussion: The quantitative results achieved by the different architectures are comparable to each other. On the one hand, this is interesting since the developed classifier has significantly less parameters than the Densenet, which indicates that simpler architectures yield similar results but require less computational resources. On the other hand, the usage of an autoencoder as a feature extractor for image captioning has hardly been mentioned in the literature so far. By not explicitly learning according to a given class distribution, they could remedy the problem of unevenly distributed classes, which is especially common in the medical field. We argue that this promising direction should be investigated in future work.
The authors declare that they have no competing interests.
The authors declare that an ethics committee vote is not required.
References
- 1.
- Kougia V, Pavlopoulos J, Androutsopoulos I. A Survey on Biomedical Image Captioning. In: Proceedings of the Second Workshop on Shortcomings in Vision and Language. 2019. DOI: 10.48550/arXiv.1905.13302
- 2.
- Wang X, Peng Y, Lu L, Lu Z, Bagheri M, Summers RM. ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2017 Jul 21-26; Honolulu, HI, USA. p. 3462–3471. DOI: 10.1109/CVPR.2017.369
- 3.
- Boag W, Hsu TMH, Mcdermott M, Berner G, Alesentzer E, Szolovits P. Baselines for Chest X-Ray Report Generation. In: Dalca AV, McDermott MB, Alsentzer E, Finlayson SG, Oberst M, Falck F, Beaulieu-Jones B, editors. Proceedings of the Machine Learning for Health NeurIPS Workshop 2019. (Proceedings of Machine Learning Research; vol. 116). 2020. p. 126–140. Available from: https://proceedings.mlr.press/v116/boag20a.html
- 4.
- Jing B, Xie P, Xing E. On the Automatic Generation of Medical Imaging Reports. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. pp. 2577–2586. DOI: 10.18653/v1/P18-1240
- 5.
- Shin HC, Roberts K, Lu L, Demner-Fushman D, Yao J, Summers RM. Learning to Read Chest X-Rays: Recurrent Neural Cascade Model for Automated Image Annotation. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2016 Jun 27-30; Las Vegas, NV, USA. p. 2497–2506. DOI: 10.1109/CVPR.2016.274
- 6.
- Huang G, Liu Z, Maaten LVD, Weinberger KQ. Densely connected convolutional networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2017 Jul 21-26; Honolulu, HI, USA. p. 2261–2269. DOI: 10.1109/CVPR.2017.243
- 7.
- Vinyals O, Toshev A, Bengio S, Erhan D. Show and Tell: A Neural Image Caption Generator. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2015 Jun 7-12 Jun; Boston, MA, USA. p. 3156–3164. DOI: 10.1109/CVPR.2015.7298935
- 8.
- Papineni K, Roukos S, Ward T, Zhu WJ. Bleu: A Method for Automatic Evaluation of Machine Translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics; 2002 Jul 7-12; Philadelphia, PA, USA. p. 311–318. DOI: 10.3115/1073083.1073135
- 9.
- Banerjee S, Lavie A. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In: Association for Computational Linguistics, editor. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Jun 2005. pp. 65–72. Available from: https://aclanthology.org/W05-0909
- 10.
- Zhang T, Kishore V, Wu F, Weinberger KQ, Artzi Y. BERTScore: Evaluating Text Generation with BERT. In: International Conference on Learning Representations; 2020 Apr 30; Addis Ababa, Ethiopia. Available from: https://openreview.net/forum?id=SkeHuCVFDr
- 11.
- Demner-Fushman D, Kohli MD, Rosenman MB, Shooshan SE, Rodriguez L, Antani S, Thoma GR, McDonald CJ. Preparing a collection of radiology examinations for distribution and retrieval. Journal of the American Medical Informatics Association. 2016;23(2):304–310. DOI: 10.1093/jamia/ocv080