Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High Load Issue: dbcan_sub Creating Excessive Threads #151

Closed
trx296554555 opened this issue Jan 14, 2024 · 2 comments
Closed

High Load Issue: dbcan_sub Creating Excessive Threads #151

trx296554555 opened this issue Jan 14, 2024 · 2 comments

Comments

@trx296554555
Copy link

Thank you for your hard work and recent updates. However, I wanted to bring to your attention an ongoing issue with the latest version of run_dbcan-4.1.1. When processing large input sequence files, dbcan_sub tends to create an excessive number of threads, resulting in high system load. #117

This issue persists even when specifying parameters such as --dbcan_thread and --hmm_cpu, as there seems to be no effective limitation on the number of threads being created.
1705219070263

After reviewing the code of run_dbcan.py, I have identified that the issue lies within the function split_uniInput. This section of code directly launches as many subprocesses as the number of small files generated by splitting the large input file.https://github.com/linnabrown/run_dbcan/blob/707aed21a0ef455828126f1afb5820963e8274ca/dbcan/cli/run_dbcan.py#L139C1-L157C22

I made modifications to this specific code section to prevent excessive load when I used it myself. I implemented a simple ThreadPool, but I'm unsure if this could potentially affect other parts of the program. Therefore, I offer it as a reference only.

from concurrent.futures import ThreadPoolExecutor, as_completed

def run_command(cmd):
    hmmer = Popen(cmd)
    hmmer.wait()
    return cmd
    
max_workers = dbcan_thread  
cmds = []
for j in split_files:
    cmds.append(["hmmsearch", "--domtblout", f"{outPath}d{j}", "--cpu", "2", "-o", "/dev/null",
                 f"{dbDir}dbCAN_sub.hmm", f"{outPath}{j}"])

with ThreadPoolExecutor(max_workers=max_workers) as executor:
    futures = [executor.submit(run_command, cmd) for cmd in cmds]
    for future in as_completed(futures):
        try:
            command = future.result()
            print(f"Command: {' '.join(command)} already completed.")
        except Exception as e:
            print(f"An error occurred: {e}")

Best,
Robin

@linnabrown
Copy link
Owner

Hi Robin,

Thank you so much for bringing this out. We previously utilized this manner due to hmmscan does not support multithreads but hmmsearch does. Therefore, we will remove the multi-processing part and just use the multi-threading butil-in function from hmmsearch.

Let me just delete and test codes and I will put 4.1.2 version. Thank you so much!

Best,
Le

@HaidYi
Copy link
Collaborator

HaidYi commented Jan 16, 2024

Our 4.1.2 version is already issued. Problem solved.

@HaidYi HaidYi closed this as completed Jan 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants