Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

finding %completeness of CGC in the microbial genome of interest <Theory question and suggestions> #161

Open
Jigyasa3 opened this issue Feb 6, 2024 · 2 comments

Comments

@Jigyasa3
Copy link

Jigyasa3 commented Feb 6, 2024

Dear @yinlabniu,

Thank you again for a very important tool to annotate CAZymes and identify CGCs in the microbial genomes of interest.
I am interested in examining how complete are the CGCs in my microbial genome of interest.
For example, if dbcan3 identifies 5 CGCs in my microbial genome of interest. To understand the %completeness of these CGCs, I extract out nucleotide sequences spanning the start and end coordinates of the CGCs and PULs from dbcan-PUL database. Then I do a BLASTn search of the 5 CGC sequences against the complete dbcan-PUL database to get %similarity and %coverage.

Is that a correct approach?
My goal is to bioinformatically say that we found 5 CGCs in the microbial genome, which are XYZ % similar to known PULs and have ABC % of completeness so we can speculate that these CGCs would be functional. But if the similarity and coverage are less than ~40% (arbitrary cutoff) then it's either a novel CGC or a non-functional CGC.

Looking forward to your suggestions and reply!
Regards,
Jigyasa

@yinlabniu
Copy link
Collaborator

The short answer is yes. We used a similar strategy in dbCAN3 when predicting substrates for CGCs by blast search against dbCAN-PULs, while our parsing thresholds are more relaxed (min identity 20% and min 2 CAZyme matches to call a CGC-PUL pair). However, I should mention that the boundary of CGCs (which affects the length of CGCs) is never rigorously evaluated. PUL boundaries are often experimentally determined (e.g., through rna-seq differential expression), but CGC boundaries are arbitrarily determined based on our CGC prediction criteria (default: at least one CAZyme and one transporter, and the number of inserted non-signature genes are less than 2; this can be customized by users). Therefore, in many cases, the %coverage or completeness cutoff you mentioned is difficult to determine.

@linnabrown
Copy link
Owner

Do you still have questions? @Jigyasa3 If not, please close the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants