Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dbsub.out matches multiple hits per Gene.ID. Do we keep all or the best hit? #149

Open
Jigyasa3 opened this issue Jan 9, 2024 · 2 comments

Comments

@Jigyasa3
Copy link

Jigyasa3 commented Jan 9, 2024

Hi @linnabrown and @yinlabniu

I am examining the "dbsub.out" file, and have about 500 Gene IDs with multiple dbCAN.subfam and Substrate. Do you recommend keeping all the hits or selecting the best hit?

For example, in the screenshot below, I am interested in examining all the Chitin degrading Gene IDs, so I am worried that I might lose that information if I only end up selecting the best hit. Keeping all the hits would suggest that this protein can target cellulose, chitin, xylan.
Additionally, while both GH and CBM annotation is important for a Gene ID to determine if the GH has an accessory domain or not. Do you recommend keeping Gene IDs with only CBM annotation for substrate selection (eg. CBM will only have accessory roles in chitin degradation, and I should only examine GHs with CBM for this substrate)?

I am looking for suggestions if my understanding of the output is correct.

Screenshot 2024-01-08 at 8 02 31 PM
@linnabrown
Copy link
Owner

@yinlabniu @QiweiGe Could you please answer this question?

@yinlabniu
Copy link
Collaborator

You should keep all of them. This file has already been parsed and considered the presence of multiple domains in the same query protein. In your shown case, the protein has four domains and each gave you a substrate prediction, so you should keep all of them.

Note in our new run_dbcan release, the file name dbsub.out is changed dbcan-sub.hmm.out. To give you another example, in the following file: https://bcb.unl.edu/dbCAN_tutorial/dataset1-Carter2023/individual_assembly/Dry2014.dbCAN/dbcan-sub.hmm.out, there are 12947 rows but only 11827 proteins. That's because 894 proteins have multiple domains (each domain match has one row in the file). So this protein Dry2014_81126 has three domains: GH43_e159, GH43_e22, GH43_e159, and the domain positions (cols 11 and 12) are different in the full length.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants