gms | German Medical Science

66. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS), 12. Jahreskongress der Technologie- und Methodenplattform für die vernetzte medizinische Forschung e. V. (TMF)

26. - 30.09.2021, online

A systematic review and evaluation of methods for group variable selection

Meeting Abstract

  • Gregor Buch - Preventive Cardiology and Preventive Medicine, Department of Cardiology, University Medical Center of the Johannes Gutenberg University Mainz, Mainz, Germany; Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI), University Medical Center of the Johannes Gutenberg University Mainz, Mainz, Germany
  • Andreas Schulz - Preventive Cardiology and Preventive Medicine, Department of Cardiology, University Medical Center of the Johannes Gutenberg University Mainz, Mainz, Germany
  • Irene Schmidtmann - Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI), University Medical Center of the Johannes Gutenberg University Mainz, Mainz, Germany
  • Konstantin Strauch - Institute of Medical Biostatistics, Epidemiology and Informatics (IMBEI), University Medical Center of the Johannes Gutenberg University Mainz, Mainz, Germany
  • Philipp Wild - Preventive Cardiology and Preventive Medicine, Department of Cardiology, University Medical Center of the Johannes Gutenberg University Mainz, Mainz, Germany; German Center for Cardiovascular Research (DZHK), partner site Rhine Main, Mainz, Germany; Clinical Epidemiology and Systems Medicine, Center for Thrombosis and Hemostasis, University Medical Center of the Johannes Gutenberg University Mainz, Mainz, Germany

Deutsche Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie. 66. Jahrestagung der Deutschen Gesellschaft für Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS), 12. Jahreskongress der Technologie- und Methodenplattform für die vernetzte medizinische Forschung e.V. (TMF). sine loco [digital], 26.-30.09.2021. Düsseldorf: German Medical Science GMS Publishing House; 2021. DocAbstr. 198

doi: 10.3205/21gmds097, urn:nbn:de:0183-21gmds0976

Veröffentlicht: 24. September 2021

© 2021 Buch et al.
Dieser Artikel ist ein Open-Access-Artikel und steht unter den Lizenzbedingungen der Creative Commons Attribution 4.0 License (Namensnennung). Lizenz-Angaben siehe http://creativecommons.org/licenses/by/4.0/.


Gliederung

Text

Introduction: Many datasets have a natural group structure due to high correlations or contextual similarities of variables, like in proteomics. Group variable selection methods are able to account for such structure in the selection process to identify variables that are related to each other and share a common and traceable relationship with the response variable.

To date, only selective comparisons of group variable selection methods are available, but a review is needed that systematically identifies and evaluates the wide range of existing approaches.

Methods: A structured literature search was conducted, adhering to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses recommendations, to identify group variable selection methods that were sufficiently programmed and suitable for studying Gaussian, binomial, or time-to-event data types.

The selection performance of the identified methods was evaluated based on the correlation between true and generated models within simulation studies defined by a varying number of variables associated with a Gaussian distributed response variable.

Results: The systematic literature review revealed 14 methods for selecting variable groups, which can be classified into knowledge-driven and data-driven approaches. The first category includes group-level and bi-level selection methods that use pre-defined group formations, while two-step and collinear tolerant approaches constitute the second category, which use the correlation structure of the data to select related variables. Group-level and two-step approaches select all or none of the variables in a group, while bi-level and collinear tolerant methods propose sparsity even within groups of variables.

Simulation studies demonstrate that group-level selection methods, such as Group MCP, are superior to other methods in selecting relevant variable groups, but are inferior in identifying important individual variables once not all variables in the groups are predictive. This can be better achieved by bi-level selection methods such as Group Bridge. Two-step and collinearity tolerant approaches such as the Elastic Net and Ordered Homogeneity Pursuit LASSO are inferior to knowledge-driven methods but provide comparable results without prior knowledge.

Discussion: Methods in all four categories are suitable for analyzing data with variables that have a natural group structure. The choice of the appropriate method depends on the objective and the availability of prior information. If the interest is to identify related variables associated with a response variable, group-level selection, and two-step methods are recommended, while bi-level selection and collinear tolerant methods are appropriate, when identifying variables associated with a given response from a structure of related variables is of interest.

Since the results of the simulation study indicate that inclusion of prior information improves the selection process, such information should be used when available. A potential application for analyzing omics datasets could use the information on coexpression or biological function to group variables.

Conclusions: A variety of methods can incorporate a natural group structure of predictors in selection. This improves selection, especially when the group structure is known and does not need to be estimated via the correlation structure. Since the identified methods are specialized for different situations, the choice of an appropriate method strongly depends on the research question.

The authors declare that they have no competing interests.

The authors declare that an ethics committee vote is not required.