HGT ensemble image

@Article{Wijaya2026,
  author={Wijaya, Andre Jatmiko
  and An{\v{z}}el, Aleksandar
  and Hattab, Georges},
  title={Evaluating ensemble learning approaches for horizontal gene transfer detection},
  journal={Scientific Reports},
  year={2026},
  month={May},
  day={28},
  volume={16},
  number={1},
  pages={16582},
  abstract={Horizontal gene transfer (HGT) is widely recognized as a major driver of antimicrobial resistance (AMR) dissemination, with genomic islands (GIs) as one of the drivers facilitating the spread. Detecting GIs is essential for improving AMR surveillance. Numerous computational approaches have been developed for GIs detection, including recent advances in machine learning (ML). Several studies in other fields have shown that ML model performance depends on data representations. Combining multiple data representations in ensemble learning has been shown to improve performance in other genomics tasks. However, this approach has not yet been evaluated for GIs detection. To this end, we investigate the efficacy of integrating diverse data representations in ensemble learning for GIs detection, particularly for classification task. Then, we assess its applicability to localizing GIs, which are clusters of genes acquired through HGT, in a genomic sequence. We implemented a two-stage ensemble selection strategy to determine the optimal combination of data representations. Our ensemble selection strategy reveals that combining low-correlated data representations in an ensemble classifier yields a slightly higher Recall than individual representation for the classification task, but the improvement is not statistically significant. Nevertheless, the ensemble classifier could not localize GIs better, suggesting that the cross-task generalizability remains constrained. This finding presents an opportunity for future research to advance the field by redefining the problem formulation of GIs detection.},
  issn={2045-2322},
  doi={10.1038/s41598-026-53037-x},
  url={https://doi.org/10.1038/s41598-026-53037-x}
}

Horizontal gene transfer (HGT) is widely recognized as a major driver of antimicrobial resistance (AMR) dissemination, with genomic islands (GIs) as one of the drivers facilitating the spread. Detecting GIs is essential for improving AMR surveillance. Numerous computational approaches have been developed for GIs detection, including recent advances in machine learning (ML). Several studies in other fields have shown that ML model performance depends on data representations. Combining multiple data representations in ensemble learning has been shown to improve performance in other genomics tasks. However, this approach has not yet been evaluated for GIs detection. To this end, we investigate the efficacy of integrating diverse data representations in ensemble learning for GIs detection, particularly for classification task. Then, we assess its applicability to localizing GIs, which are clusters of genes acquired through HGT, in a genomic sequence. We implemented a two-stage ensemble selection strategy to determine the optimal combination of data representations. Our ensemble selection strategy reveals that combining low-correlated data representations in an ensemble classifier yields a slightly higher Recall than individual representation for the classification task, but the improvement is not statistically significant. Nevertheless, the ensemble classifier could not localize GIs better, suggesting that the cross-task generalizability remains constrained. This finding presents an opportunity for future research to advance the field by redefining the problem formulation of GIs detection.