Article
Against all Odds – How a Gene Analysis spelled my Name
Search Medline for
Authors
Published: | September 15, 2023 |
---|
Outline
Text
Introduction: Sometimes coincidence plays a strange game. As already described in the infinite monkey theorem, everything imaginable will happen at some point [1]. Such an occasion occurred when my colleague compared results from a published gene analysis with her current research data. The aim of this article is to record this unbelievable coincidence.
Methods: From the publication of Chatsirisupachai et al. the top 200 most significantly over-expressed and top 200 under-expressed genes regarding cellular senescence were exported for further comparative purposes [2]. The original list of genes can be found in the supplemental material spreadsheet "Data S4.xlsx". The gene symbols were exported manually, and both over- and under-expressed genes were ordered by significance, and stored in separate sub-sheets.
Results: Due to several header lines, gene 200 was located in line 213. Therefore, special attention was paid to this line during the export in order to copy exactly 200 genes. When exporting the under-expressed genes, the exact 200th gene symbol was "BRIX1". Without the 1 at the end, this is a perfect match of my surname. However, as if that was not strange enough, when exporting the over-expressed genes, the 200th gene symbol was "TOB1", which, by interpreting the 1 as a capital "i", is exactly a common abbreviation of my first name.
Discussion: What is the probability to find gene symbols, which represent my name at exactly this position? We simplify the calculation by assuming an independence and equal distribution of all considered variables. According to the guidelines of the HUGO Gene Nomenclature Committee, gene symbols may only consist of Latin letters and Arabic numerals [3]. A gene symbol must start with a letter. Thus, the probability P(ExistT) that a 4-digit symbol "TOB1" do exist is 1 in 1,213,056. Similarly, for "BRIX1", the probability P(ExistB) is 1 in 43,670,016. At the time of writing, there are 45,453 genes in the human gene database. The conditional probability P(ChooseT|ExistT) to choose the symbol "TOB1" given the symbol exists is therefore 1 in 45,453. Same is true for P(ChooseB|ExistB). Finally, the line 200 must be considered, because this combination of the symbols would have been probably overlooked in another place. At most, the combination could have been in 526 places in the file, which are the number of listed over-expressed genes. Thus, the probability P(Line) is 1 in 526. The searched probability is therefore P(ExistT AND ChooseT) * P(ExistB AND ChooseB) * P(Line) which can be converted to P(ChooseT|ExistT) * P(ExistT) * P(ChooseB|ExistB) * P(ExistB) * P(Line). If we multiply all values, we get a number of about 5.76 * 1025 or a probability of 1 in 57.6 septillion.
Conclusion: That there are two gene symbols, which represent my first and last name, respectively, is a great coincidence. The fact that both symbols appear in an analysis in a predefined line, and together form my complete name, will probably never happen again. Perhaps this is a sign that I, as a medical informatics scientist, should switch to the field of genetic analysis.
The authors declare that they have no competing interests.
The authors declare that an ethics committee vote is not required.
References
- 1.
- Borges JL. Selected non-fictions: Edited by Eliot Weinberger translated by Esther Allen, Suzanne Jill Levine & Eliot Weinberger. New York: Viking; 1999.
- 2.
- Chatsirisupachai K, Palmer D, Ferreira S, de Magalhães JP. A human tissue-specific transcriptomic analysis reveals a complex relationship between aging, cancer, and cellular senescence. Aging Cell. 2019 Dec;18(6):e13041. DOI: 10.1111/acel.13041
- 3.
- Bruford EA, Braschi B, Denny P, Jones TEM, Seal RL, Tweedie S. Guidelines for Human Gene Nomenclature. Nature Genetics. 2020;52(8):754–8.