Nguyen_2025_Sci.Rep_15

Reference

Title : Identifying representative sequences of protein families using submodular optimization - Nguyen_2025_Sci.Rep_15_1069

Author(s) : Nguyen H , Nguyen P , Luu AN , Cantu DC , Nguyen T

Ref : Sci Rep , 15 :1069 , 2025

Abstract :

Identifying representative sequences for groups of functionally similar proteins and enzymes poses significant computational challenges. In this study, we applied submodular optimization, a method effective in data summarization, to select representative sequences for thioesterase enzyme families. We introduced and validated two algorithms, Greedy and Bidirectional Greedy, using curated protein sequence data from the ThYme (Thioester-active enzYmes) database. Both algorithms generated sequence subsets that preserved completeness (inclusion of all known family sequences) and specificity (accurate family representation). The Greedy algorithm outperformed the Bidirectional Greedy algorithm and other methods, particularly in reducing redundancy. Our study offers an efficient approach for identifying representative protein sequences within families that have significant sequence similarity, likely to deliver results close to theoretical optima in polynomial time, with the potential to improve the selection and optimization of representative sequences in protein databases.

PubMedSearch : Nguyen_2025_Sci.Rep_15_1069

PubMedID: 39774134

Citations formats

Nguyen H, Nguyen P, Luu AN, Cantu DC, Nguyen T (2025)
Identifying representative sequences of protein families using submodular optimization
Sci Rep 15 :1069

Nguyen H, Nguyen P, Luu AN, Cantu DC, Nguyen T (2025)
Sci Rep 15 :1069

Nguyen_2025_Sci.Rep_15_1069

Reference

Related information

Citations formats