Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Periodically pause copying ledger nodes during online_delete #4907

Open
wants to merge 32 commits into
base: develop
Choose a base branch
from

Conversation

ximinez
Copy link
Collaborator

@ximinez ximinez commented Jan 31, 2024

High Level Overview of Change

Mitigates disk write contention while old ledgers are being deleted, and specifically while the full ledger is being copied to the "new" node store.

Context of Change

PR #4503 (reverted by #4882) attempted to improve rippled performance by writing batches to NuDB asynchronously. However, it had an unintended side effect that when online_delete writes the entire ledger to disk, it tends to cause the buffer to fill up, which results in blocking new ledgers from being persisted.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • Performance (increase or change in throughput and/or latency)

Before / After

Cost: This change will cause online_delete to take significantly longer to copy the full ledger from the old node store to the new one, which is the last significant step in the process.
Benefit: rippled should be much less likely to desync, and put less load on the disk during online_delete.

online_delete is still a demanding process, so this won't be a panacea, but it should be a significant improvment. (Significant improvement to be measured.)

Performance details

  1. This is an improvement to existing functionality, which could be considered a bug fix.
  2. The change impacts node store writes. Specifically, it should reduce contention between online_delete and writing new ledgers.
  3. The impact should be measured in a couple of different ways.
    1. rippled should struggle less during online_delete to stay synced, and other functions.
    2. rippled should put less strain/demand on the disk during online_delete.
  4. This change affects concurrent processing, in the sense that multiple threads are writing to the node store, especially during online_delete.

Note that back_off_milliseconds is configurable, defaulting to 100. Node operators can de-prioritize online_delete operations more by increasing this value to whatever they are comfortable with.

* Gives other processes (notable ledger persistence during consensus)
  more time to complete their writes.
* The period is set as half the node store batch write limit size, so
  that there is plenty of room for other processes.
* Reuses the `back_off_milliseconds` configuration value, which
  is used for other database delays.
@ximinez ximinez added Bug Perf Attn Needed Attention needed from RippleX Performance Team labels Jan 31, 2024
@ximinez ximinez added this to the 2.1.0 (Mar 2024) milestone Jan 31, 2024
@ximinez
Copy link
Collaborator Author

ximinez commented Jan 31, 2024

Internal tracker: https://ripplelabs.atlassian.net/browse/RPFC-107

@codecov-commenter
Copy link

codecov-commenter commented Jan 31, 2024

Codecov Report

Attention: Patch coverage is 40.00000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 71.4%. Comparing base (0f32109) to head (88ce966).

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##           develop   #4907   +/-   ##
=======================================
  Coverage     71.4%   71.4%           
=======================================
  Files          796     796           
  Lines        67031   67035    +4     
  Branches     10867   10866    -1     
=======================================
+ Hits         47827   47833    +6     
+ Misses       19204   19202    -2     
Files Coverage Δ
src/xrpld/app/misc/SHAMapStoreImp.h 96.3% <100.0%> (ø)
src/xrpld/app/misc/SHAMapStoreImp.cpp 73.3% <25.0%> (-1.1%) ⬇️

... and 3 files with indirect coverage changes

Impacted file tree graph

Copy link
Collaborator

@mtrippled mtrippled left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 LGTM

* upstream/develop:
  Set version to 2.1.0-rc1
  `fixInnerObjTemplate`: set inner object template (4906)
  feat: allow `port_grpc` to be specified in `[server]` stanza (4728)
  build: add headers needed in Conan package for libxrpl (4885)
  `fixNFTokenReserve`: ensure NFT tx fails when reserve is not met (4767)
  fix(libxrpl): change library names in Conan recipe (4831)
  test: check for success/failure of Windows CI unit tests (4871)
* upstream/develop:
  Set version to 2.1.0
  test: guarantee proper lifetime for temporary Rules object: (4917)
Copy link
Collaborator

@scottschurr scottschurr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 LGTM!

* upstream/develop:
  fix compile error on gcc 13: (4932)
  Price Oracle (XLS-47d): (4789) (4789)
* upstream/develop:
  build: add STCurrency.h to xrpl_core to fix clio build (4939)
  feat: add user version of `feature` RPC (4781)
  Fast base58 codec: (4327)
  Remove default ctors from SecretKey and PublicKey: (4607)
* upstream/develop:
  Update remaining actions (4949)
  Upgrade to xxhash 0.8.2 as a Conan requirement, enable SIMD hashing (4893)
  Fix workflows (4948)
  fix: order book update variable swap: (4890)
  Embed patched recipe for RocksDB 6.29.5 (4947)
* upstream/develop:
  Install more public headers (4940)
* upstream/develop:
  test: Env unit test RPC errors return a unique result: (4877)
* upstream/develop:
  fixXChainRewardRounding: round reward shares down: (4933)
  Don't reach consensus as quickly if no other proposals seen: (4763)
  Write improved `forAllApiVersions` used in NetworkOPs (4833)
  Remove zaphod.alloy.ee hub from default server list: (4903)
  Enforce no duplicate slots from incoming connections: (4944)
  `fixEmptyDID`: fix amendment to handle empty DID edge case: (4950)
  perf: improve `account_tx` SQL query: (4955)
  Fix workflows (4951)
* upstream/develop:
  chore: change Github Action triggers for build/test jobs (4956)
* upstream/develop:
  Set version to 2.1.1
  fix: improper handling of large synthetic AMM offers:
* upstream/develop:
  chore: Default validator-keys-tool to master branch: (4943)
* upstream/develop:
  Set version to 2.2.0-b2
  fix Conan component reference typo
  Remove unused lambdas from MultiApiJson_test
* Reduce the copy backoff interval from 32k to 128
* Increase the backoff time from 100ms to 2s
* upstream/develop:
  fix: resolve database deadlock: (4989)
  test: verify the rounding behavior of equal-asset AMM deposits (4982)
  test: Add tests to raise coverage of AMM (4971)
  chore: Improve codecov coverage reporting (4977)
  test: Unit test for AMM offer overflow (4986)
  fix amendment to add `PreviousTxnID`/`PreviousTxnLgrSequence` (4751)
* upstream/develop:
  fix: Remove redundant STAmount conversion in test (4996)
* upstream/develop:
  Ignore more commits
  Address compiler warnings
  Add markers around source lists
  Fix source lists
  Rewrite includes
  Format formerly .hpp files
  Rename .hpp to .h
  Simplify protobuf generation
  Consolidate external libraries
  Remove packaging scripts
  Remove unused files
* upstream/develop:
  Set version to 2.2.0-rc1
  fix amendment: AMM swap should honor invariants: (5002)
  Add global access to the current ledger rules:
  chore: fix typos (4958)
  test: Add RPC error checking support to unit tests (4987)
* upstream/develop:
  Remove flow assert: (5009)
  Update list of maintainers: (4984)
* upstream/develop:
  Add external directory to Conan recipe's exports (5006)
  Add missing includes (5011)
@intelliot
Copy link
Collaborator

Note: As of May 28, 2024 - perf testing is still in progress.

@ximinez ximinez force-pushed the shamapstore-backoff branch 2 times, most recently from 7f88297 to bb4c426 Compare July 1, 2024 22:03
ximinez and others added 9 commits July 1, 2024 18:08
* commit 'c706926': (23 commits)
  Change order of checks in amm_info: (4924)
  Add the fixEnforceNFTokenTrustline amendment: (4946)
  Replaces the usage of boost::string_view with std::string_view (4509)
  docs: explain how to find a clang-format patch generated by CI (4521)
  XLS-52d: NFTokenMintOffer (4845)
  chore: remove repeat words (5041)
  Expose all amendments known by libxrpl (5026)
  fixReducedOffersV2: prevent offers from blocking order books: (5032)
  Additional unit tests for testing deletion of trust lines (4886)
  Fix conan typo: (5044)
  Add new command line option to make replaying transactions easier: (5027)
  Fix compatibility with Conan 2.x: (5001)
  Set version to 2.2.0
  Set version to 2.2.0-rc3
  Add xrpl.libpp as an exported lib in conan (5022)
  Fix Oracle's token pair deterministic order: (5021)
  Set version to 2.2.0-rc2
  Fix last Liquidity Provider withdrawal:
  Fix offer crossing via single path AMM with transfer fee:
  Fix adjustAmountsByLPTokens():
  ...
* commit 'f6879da':
  Add bin/physical.sh (4997)
  Prepare to rearrange sources: (4997)
* upstream/develop:
  fixInnerObjTemplate2 amendment (5047)
  Set version to 2.3.0-b1
  Ignore restructuring commits (4997)
  Recompute loops (4997)
  Rewrite includes (4997)
  Rearrange sources (4997)
  Move CMake directory (4997)
* upstream/develop:
  fix CTID in tx command returns invalidParams on lowercase hex (5049)
  Invariant: prevent a deleted account from leaving (most) artifacts on the ledger. (4663)
  Bump codecov plugin version to version 4.5.0 (5055)
  fix "account_nfts" with unassociated marker returning issue (5045)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Perf Attn Needed Attention needed from RippleX Performance Team
Projects
Status: 🏗 In progress
Development

Successfully merging this pull request may close these issues.

None yet

5 participants