Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize file parsing #247

Open
YuriRomanowski opened this issue Dec 16, 2022 · 2 comments
Open

Parallelize file parsing #247

YuriRomanowski opened this issue Dec 16, 2022 · 2 comments

Comments

@YuriRomanowski
Copy link
Contributor

YuriRomanowski commented Dec 16, 2022

Clarification and motivation

This topic is a part of #221.
After we read file contents, we should perform parsing of the files, which is (in theory) pure action and can be parallelized.
But we use C library under the hood, so the parallelization may be tricky. Here we can try some approaches and discuss results.

Acceptance criteria

  • Some approaches of parallelization are tried
  • We decided how to handle foreign calls during parsing for parallelization
  • Speedup is obtained and proved with measurements
@YuriRomanowski
Copy link
Contributor Author

I uploaded some commits where different variations of xrefcheck can be load-tested (in branch YuriRomanowski/#247-parallelize-file-parsing-scaffolding):

  • Original version (from master) with lazy readFile: 2a959d0
  • Replace lazy readFile with strict one: b3368c3
  • Force reading files and then process them in parallel using Eval monad: a1d5f56
  • Force reading files and then process them using mapConcurrently: 8f65374
    The latter two ones produce similar results.

@Martoon-00
Copy link
Member

Thanks for this investigation!

I tried, and from what I can see:

  • Repo scanning time is not extremely different in all of these scenarios (0.9s / 0.7s / 0.5s / 0.5s)
  • My impression was that in the given load testing there was simply no space for parallelization (this is what we saw on this picture. Sparks tab shows that a few sparks were bind to different cores, but most of them went to one core, probably simply because parsing was fast enough to process it all.
  • I tried to create 4 dummy markdown files, 50Kb each, and sparks solution showed 4 cores being used.

(the selected area corresponds to repo scanning time)
Screenshot from 2023-01-31 21-39-58

Although I'm not exactly sure why "Activity" graph at the top shows so few CPU feed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants