Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dbcan_utils CGC_substrate_abund and dbcan_utils CGC_abund error #179

Open
Ben-41 opened this issue Jun 18, 2024 · 18 comments
Open

dbcan_utils CGC_substrate_abund and dbcan_utils CGC_abund error #179

Ben-41 opened this issue Jun 18, 2024 · 18 comments
Labels
bug Something isn't working

Comments

@Ben-41
Copy link

Ben-41 commented Jun 18, 2024

Report

hi, I have encounter issues with the estimation of CGC substrate abundance and CGC abundance.
I followed all the steps from the manual and it ran smoothly, including dbcan_utils fam_abund and dbcan_utils fam_substrate_abund, however, when I ran dbcan_utils CGC_substrate_abund and dbcan_utils CGC_abund, error raise:

You are estimating the abundance of CGC/CGC substrate!
Reads are single end!
Total reads count: 218847!
Traceback (most recent call last):
File "/home/cdd/anaconda3/envs/dbcan/bin/dbcan_utils", line 10, in
sys.exit(main())
File "/home/cdd/anaconda3/envs/dbcan/lib/python3.8/site-packages/dbcan/utils/utils.py", line 621, in main
PUL_abundance(args)
File "/home/cdd/anaconda3/envs/dbcan/lib/python3.8/site-packages/dbcan/utils/utils.py", line 492, in PUL_abundance
PUL_abund = CAZyme_Abundance_estimate(paras)
File "/home/cdd/anaconda3/envs/dbcan/lib/python3.8/site-packages/dbcan/utils/utils.py", line 254, in init
seqid2dbcan_annotation,cgcid2cgc_standard = Read_cgc_standard_out(parameters.PUL_annotation)
File "/home/cdd/anaconda3/envs/dbcan/lib/python3.8/site-packages/dbcan/utils/utils.py", line 203, in Read_cgc_standard_out
tmp_record = cgc_standard_line(line.rstrip().split("\t"))
File "/home/cdd/anaconda3/envs/dbcan/lib/python3.8/site-packages/dbcan/utils/utils.py", line 191, in init
self.gene_start = int(lines[4])
ValueError: invalid literal for int() with base 10: 'Gene Start'

Version information

No response

@Ben-41 Ben-41 added the bug Something isn't working label Jun 18, 2024
@ZhengJinfang1220
Copy link

Report

hi, I have encounter issues with the estimation of CGC substrate abundance and CGC abundance. I followed all the steps from the manual and it ran smoothly, including dbcan_utils fam_abund and dbcan_utils fam_substrate_abund, however, when I ran dbcan_utils CGC_substrate_abund and dbcan_utils CGC_abund, error raise:

You are estimating the abundance of CGC/CGC substrate! Reads are single end! Total reads count: 218847! Traceback (most recent call last): File "/home/cdd/anaconda3/envs/dbcan/bin/dbcan_utils", line 10, in sys.exit(main()) File "/home/cdd/anaconda3/envs/dbcan/lib/python3.8/site-packages/dbcan/utils/utils.py", line 621, in main PUL_abundance(args) File "/home/cdd/anaconda3/envs/dbcan/lib/python3.8/site-packages/dbcan/utils/utils.py", line 492, in PUL_abundance PUL_abund = CAZyme_Abundance_estimate(paras) File "/home/cdd/anaconda3/envs/dbcan/lib/python3.8/site-packages/dbcan/utils/utils.py", line 254, in init seqid2dbcan_annotation,cgcid2cgc_standard = Read_cgc_standard_out(parameters.PUL_annotation) File "/home/cdd/anaconda3/envs/dbcan/lib/python3.8/site-packages/dbcan/utils/utils.py", line 203, in Read_cgc_standard_out tmp_record = cgc_standard_line(line.rstrip().split("\t")) File "/home/cdd/anaconda3/envs/dbcan/lib/python3.8/site-packages/dbcan/utils/utils.py", line 191, in init self.gene_start = int(lines[4]) ValueError: invalid literal for int() with base 10: 'Gene Start'

Version information

No response

It seems the issue happens when the script reads the file "cgc_standard.out". Can you share this file here? So I can debug the code.

Jinfang

@Ben-41
Copy link
Author

Ben-41 commented Jun 19, 2024

Hi Jinfang, the cgc_standard.out looks like this:
$ head cgc_standard.out
CGC# Gene Type Contig ID Protein ID Gene Start Gene Stop Direction Protein Family
CGC1 TC Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_47 44478 45509 - 1.A.33.1.5
CGC1 CAZyme Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_48 45726 48518 + GH2|GH2_e50
CGC1 STP Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_65 72550 73521 - SIS+CBS+CBS
CGC1 CAZyme Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_71 77847 79862 - GH36|GH36_e10
CGC1 TC Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_76 87219 88157 + 3.A.2.1.7
CGC1 TC Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_80 89523 91055 + 3.A.2.1.7
CGC1 TC Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_81 91077 91949 + 3.A.2.1.1
CGC1 TC Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_82 92024 93445 + 3.A.2.1.2
CGC1 STP Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_87 96421 97605 + Aminotran_1_2

During the prediction of CGCs, the manual said I need to have my own gff file, so I modified the gff file from Prodigal output, which change :
1 Group_1_2_bin.20_contig-100_0 Prodigal_v2.6.3 CDS 3 449 62.4 + 0 ID=0_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.651;conf=100.00;score=62.40;cscore=59.18;sscore=3.22;rsco re=0.00;uscore=0.00;tscore=3.22;
2 Group_1_2_bin.20_contig-100_0 Prodigal_v2.6.3 CDS 658 1245 11.3 - 0 ID=0_2;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.505;conf=93.04;score=11.28;cscore=6.79;sscore=4.49;rsc ore=1.54;uscore=0.03;tscore=3.57;
to
1 Group_1_2_bin.20_contig-100_0 Prodigal_v2.6.3 CDS 3 449 62.4 + 0 ID=Group_1_2_bin.20_contig-100_0_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.651;conf=100.00;score=62.40;cscore=59.18;sscore=3.22;rsco re=0.00;uscore=0.00;tscore=3.22;
2 Group_1_2_bin.20_contig-100_0 Prodigal_v2.6.3 CDS 658 1245 11.3 - 0 ID=Group_1_2_bin.20_contig-100_0_2;partial=00;start_type=ATG;rbs_motif=None;rbs_spacer=None;gc_cont=0.505;conf=93.04;score=11.28;cscore=6.79;sscore=4.49;rsc ore=1.54;uscore=0.03;tscore=3.57;

I don't know if this is wrong.

@ZhengJinfang1220
Copy link

Hi Jinfang, the cgc_standard.out looks like this: $ head cgc_standard.out CGC# Gene Type Contig ID Protein ID Gene Start Gene Stop Direction Protein Family CGC1 TC Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_47 44478 45509 - 1.A.33.1.5 CGC1 CAZyme Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_48 45726 48518 + GH2|GH2_e50 CGC1 STP Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_65 72550 73521 - SIS+CBS+CBS CGC1 CAZyme Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_71 77847 79862 - GH36|GH36_e10 CGC1 TC Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_76 87219 88157 + 3.A.2.1.7 CGC1 TC Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_80 89523 91055 + 3.A.2.1.7 CGC1 TC Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_81 91077 91949 + 3.A.2.1.1 CGC1 TC Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_82 92024 93445 + 3.A.2.1.2 CGC1 STP Group_1_2_bin.20_contig-100_0 Group_1_2_bin.20_contig-100_0_87 96421 97605 + Aminotran_1_2

Thank you, Ben,
It looks normal in your input file.
Can you check the function "Read_cgc_standard_out", this function should defined at line 199 in file "/home/cdd/anaconda3/envs/dbcan/lib/python3.8/site-packages/dbcan/utils/utils.py". The code in line 201 in the same file means to open and read the file "cgc_standard.out" by line. It should skip the 1st line which is the header. Like in the following screenshot. Can you check what these are like in your code?

20240619094256

@ZhengJinfang1220
Copy link

Group_1_2_bin.20_contig-100_0_47

Yes, you did the correct modification on gff file. And you got the output file "cgc_standard.out". Otherwise, you can not get this output.

@Ben-41
Copy link
Author

Ben-41 commented Jun 19, 2024

image

@ZhengJinfang1220
Copy link

'Gene Start

It seems the codes also look normal. So, what happens? Could you check the input file again, to look for another string "Gene Start" except for 1st line? If this still does not work.

Can you send me all the input files? I will debug on my PC.
Here is my email: zhengjinfang1220@gamil.com

@powerby66
Copy link

Hello,i want to ask you some questions please,
dbcan_utils fam_abund -bt /home/jpc/project/db_test/output/EscheriaColiK12MG1655_abund/EscheriaColiK12MG1655.depth.txt -i /home/jpc/project/db_test/output/EscheriaColiK12MG1655_fna.dbCAN -a TPMoutput/EscheriaColiK12MG1655_abund/EscheriaColiK12MG1655.depth.txt -i /home/jpc/project/db_test/output/EscheriaColiK12MG1655_fna.dbCAN -a TPM
dbcan_utils CGC_abund -bt /home/jpc/project/db_test/output/EscheriaColiK12MG1655_abund/EscheriaColiK12MG1655.depth.txt -i /home/jpc/project/db_test/output/EscheriaColiK12MG1655_fna.dbCAN -a TPM
dbcan_utils CGC_substrate_abund -bt /home/jpc/project/db_test/output/EscheriaColiK12MG1655_abund/EscheriaColiK12MG1655.depth.txt -i /home/jpc/project/db_test/output/EscheriaColiK12MG1655_fna.dbCAN -a TPM
bash: dbcan_utils: command not found...
(/data2/jpc/env/dbcan414) jpc 18:09:17 ~/project/db_test/output/EscheriaColiK12MG1655_abund
dbcan_utils fam_substrate_abund -bt /home/jpc/project/db_test/output/EscheriaColiK12MG1655_abund/EscheriaColiK12MG1655.depth.txt -i /home/jpc/project/db_test/output/EscheriaColiK12MG1655_fna.dbCAN -a TPM
bash: dbcan_utils: command not found...
(/data2/jpc/env/dbcan414) jpc 18:09:18 ~/project/db_test/output/EscheriaColiK12MG1655_abund
dbcan_utils CGC_abund -bt /home/jpc/project/db_test/output/EscheriaColiK12MG1655_abund/EscheriaColiK12MG1655.depth.txt -i /home/jpc/project/db_test/output/EscheriaColiK12MG1655_fna.dbCAN -a TPMbash: dbcan_utils: command not found...
(/data2/jpc/env/dbcan414) jpc 18:09:18 ~/project/db_test/output/EscheriaColiK12MG1655_abund
dbcan_utils CGC_substrate_abund -bt /home/jpc/project/db_test/output/EscheriaColiK12MG1655_abund/EscheriaColiK12MG1655.depth.txt -i /home/jpc/project/db_test/output/EscheriaColiK12MG1655_fna.dbCAN -a TPM
bash: dbcan_utils: command not found...
but i cant find the utils.py in /data2/jpc/env/dbcan414/lib/python3.7/site-packages/dbcan/utils,how can i solve this question?
1718964724642

@PaolaDiGianvito
Copy link

Hi, I have found a similar issue.
I'm trying ti follow the tutorial on raw reads. I have shotgun sequencing. I arrived in the tutorial at this poin: P13. dbcan_utils to calculate the abundance of CAZyme families, subfamilies, CGCs, and substrates (i have skipped the point P12 because I don't need a particular region, is it correct?...
when i run this command:
dbcan_utils fam_abund -bt IS1_EF.depth.txt -i ../subs/IS1_ef.dbCAN -a TPM
i have this error:
you are estimating the abundance of CAZyme!
Reads are single end!
Total read count: 156453394!
Can not find read count information for CAZyme: k141_10018_1.
In the directory IS3_ef.dbCAN i have all the 17 files...Can you help me?

@powerby66
Copy link

Hi, I have found a similar issue. I'm trying ti follow the tutorial on raw reads. I have shotgun sequencing. I arrived in the tutorial at this poin: P13. dbcan_utils to calculate the abundance of CAZyme families, subfamilies, CGCs, and substrates (i have skipped the point P12 because I don't need a particular region, is it correct?... when i run this command: dbcan_utils fam_abund -bt IS1_EF.depth.txt -i ../subs/IS1_ef.dbCAN -a TPM i have this error: you are estimating the abundance of CAZyme! Reads are single end! Total read count: 156453394! Can not find read count information for CAZyme: k141_10018_1. In the directory IS3_ef.dbCAN i have all the 17 files...Can you help me?

Hello,Have you successfully solved this problem?I met the question too

@PaolaDiGianvito
Copy link

PaolaDiGianvito commented Jul 1, 2024 via email

@ZhengJinfang1220
Copy link

Not yet, really. I have used metaeuk for genes prediction. Could be this the problem? Il Lun 1 Lug 2024, 11:11 powerby66 @.> ha scritto:

Hi, I have found a similar issue. I'm trying ti follow the tutorial on raw reads. I have shotgun sequencing. I arrived in the tutorial at this poin: P13. dbcan_utils to calculate the abundance of CAZyme families, subfamilies, CGCs, and substrates (i have skipped the point P12 because I don't need a particular region, is it correct?... when i run this command: dbcan_utils fam_abund -bt IS1_EF.depth.txt -i ../subs/IS1_ef.dbCAN -a TPM i have this error: you are estimating the abundance of CAZyme! Reads are single end! Total read count: 156453394! Can not find read count information for CAZyme: k141_10018_1. In the directory IS3_ef.dbCAN i have all the 17 files...Can you help me? Hello,Have you successfully solved this problem?I met the question too — Reply to this email directly, view it on GitHub <#179 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/BBFTCGZWCPNEMRKC3EXOXMLZKEMMFAVCNFSM6AAAAABJQCXDWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJZGYZTIMRRGM . You are receiving this because you commented.Message ID: @.
>

Not yet, really. I have used metaeuk for genes prediction. Could be this the problem? Il Lun 1 Lug 2024, 11:11 powerby66 @.> ha scritto:

Hi, I have found a similar issue. I'm trying ti follow the tutorial on raw reads. I have shotgun sequencing. I arrived in the tutorial at this poin: P13. dbcan_utils to calculate the abundance of CAZyme families, subfamilies, CGCs, and substrates (i have skipped the point P12 because I don't need a particular region, is it correct?... when i run this command: dbcan_utils fam_abund -bt IS1_EF.depth.txt -i ../subs/IS1_ef.dbCAN -a TPM i have this error: you are estimating the abundance of CAZyme! Reads are single end! Total read count: 156453394! Can not find read count information for CAZyme: k141_10018_1. In the directory IS3_ef.dbCAN i have all the 17 files...Can you help me? Hello,Have you successfully solved this problem?I met the question too — Reply to this email directly, view it on GitHub <#179 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/BBFTCGZWCPNEMRKC3EXOXMLZKEMMFAVCNFSM6AAAAABJQCXDWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJZGYZTIMRRGM . You are receiving this because you commented.Message ID: @.
>

Hi, guys. We have fixed the bug in the updated version of dbCAN(several months ago). If still use the older version. please follow the steps:
1 please check "line 93, utils.py", the definition of ReadBedtoos
2 please modify line 95: seqid2info = {line.split()[0]:bedtools_read_count(line.split()) for line in lines[1:]}
to seqid2info = {line.split()[0]:bedtools_read_count(line.split()) for line in lines[0:]}, a bug ignoring the first line of read count information.

@PaolaDiGianvito
Copy link

Thank you, i have another question.
I have used MetaEuk gor genes prediction, consequently i have to generate file.ffn with bedtools, what file is better to use? The output of metaeuk or the profigal.gff files generated at the substrate prediction? Than you
Paola

@yinlabniu
Copy link
Collaborator

yinlabniu commented Jul 1, 2024 via email

@PaolaDiGianvito
Copy link

PaolaDiGianvito commented Jul 1, 2024 via email

@PaolaDiGianvito
Copy link

PaolaDiGianvito commented Jul 1, 2024 via email

@yinlabniu
Copy link
Collaborator

yinlabniu commented Jul 1, 2024 via email

@PaolaDiGianvito
Copy link

PaolaDiGianvito commented Jul 2, 2024 via email

@PaolaDiGianvito
Copy link

Good morning,
i have tried to run my data from contigs of megahit avoiding metaeuk step and using file.fna and meta for CAZyme annotation, followed by ffn generation from prodigal gffr files and stp 8 and 11. At sep 13 I have ever this error
dbcan_utils fam_abund -bt GC1_D2.depth.txt -i ../GC1_D2.CAZyme -a TPM
You are estimating the abundance of CAZyme!
Reads are single end!
Total reads count: 43824341!
Can not find read count information for CAZyme: k141_111_4
...
I have changed line 95 in utils.py as suggested
def ReadBedtoos(filename):
lines = open(filename).readlines()
seqid2info = {line.split()[0]:bedtools_read_count(line.split()) for line in lines[0:]}
normalized_tpm = 0.
for seqid in seqid2info:
seqid_depth = seqid2info[seqid]
normalized_tpm += seqid_depth.read_count/seqid_depth.length
return seqid2info,normalized_tpm

Can you help me?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants