Title : Identifying representative sequences of protein families using submodular optimization - Nguyen_2025_Sci.Rep_15_1069 |
Author(s) : Nguyen H , Nguyen P , Luu AN , Cantu DC , Nguyen T |
Ref : Sci Rep , 15 :1069 , 2025 |
Abstract :
Identifying representative sequences for groups of functionally similar proteins and enzymes poses significant computational challenges. In this study, we applied submodular optimization, a method effective in data summarization, to select representative sequences for thioesterase enzyme families. We introduced and validated two algorithms, Greedy and Bidirectional Greedy, using curated protein sequence data from the ThYme (Thioester-active enzYmes) database. Both algorithms generated sequence subsets that preserved completeness (inclusion of all known family sequences) and specificity (accurate family representation). The Greedy algorithm outperformed the Bidirectional Greedy algorithm and other methods, particularly in reducing redundancy. Our study offers an efficient approach for identifying representative protein sequences within families that have significant sequence similarity, likely to deliver results close to theoretical optima in polynomial time, with the potential to improve the selection and optimization of representative sequences in protein databases. |
PubMedSearch : Nguyen_2025_Sci.Rep_15_1069 |
PubMedID: 39774134 |
Nguyen H, Nguyen P, Luu AN, Cantu DC, Nguyen T (2025)
Identifying representative sequences of protein families using submodular optimization
Sci Rep
15 :1069
Nguyen H, Nguyen P, Luu AN, Cantu DC, Nguyen T (2025)
Sci Rep
15 :1069