MSC4162: One-Time Key Reset Endpoint #4162

kegsay · 2024-07-02T13:13:39Z

BillCarsonFr

Some comments

BillCarsonFr · 2024-07-02T13:46:13Z

proposals/4162-otk-reset.md

+
+### Proposal
+
+#### Reset endpoint


There is a list of why OTKs could get desync, and an end-point to reset and re-resync.

But how technically do we detect that there is a desync? Unless I missed it I don't think it's noted here.

Would a new end-point that would allow to check that client/server are in sync would be helpfull?
Or will this be always detected too late?

We can't detect desyncs because the client and server are not continually checking (and they do need to continually check, as DB corruption / rollbacks can happen at arbitrary points in time). Only /keys/upload does some basic sync checks by ensuring that if the client says key ID AAAAA is bytes X that it remains the case on the next upload (assuming the key hasn't been used).

BillCarsonFr · 2024-07-02T13:46:48Z

proposals/4162-otk-reset.md

@@ -0,0 +1,90 @@
+## MSC4162: One-Time Key Reset Endpoint


nit: Looks like this MSC is covering more than just reset.

BillCarsonFr · 2024-07-02T13:49:46Z

proposals/4162-otk-reset.md

+
+This does not define a sort ordering (i.e _which_ key should be given out to the caller). This MSC proposes tweaking this wording to be:
+
+> Claims one-time keys for use in pre-key messages. The key claimed MUST be the lowest lexicographically sorted key ID for that algorithm. For example, given two key IDs of `AAAAAu` and `AAAAHg`, `AAAAAu` MUST be returned before `AAAAHg`.


I think there have been discussion for the KeyId to be the public key and not an "increment" like now. Would it be possible to have conflicts here? or it's not important? What about just using a timestamp ?

There needs to be a way to sort the keys and perhaps more importantly, to generate new keys > old keys. Base 64 public keys don't allow for this, so that would make sorting impossible. We'd probably need to use timestamps.

+1 for timestamps here.

Per element-hq/element-meta#2356 (comment): lexicographical ordering on key IDs turns out to be somewhat disastrous anyway

BillCarsonFr · 2024-07-02T13:52:10Z

proposals/4162-otk-reset.md

+
+### Rationale
+
+The primary motivation of this endpoint is to provide a way for a client and server to resync OTKs. To do this, a client would call `/keys/reset` with `all: true` to delete all OTKs on the server. Once this 200 OKs, the client is then safe to perform `/keys/upload` with a fresh set of OTKs. This functionality is particularly important for existing clients who currently have desynced OTKs.


I wonder if we don't want something a bit different for fallback keys as they have a longer life time (option to not reset them?). As it may interfer with #4081

Fallback keys are not one-time keys, so wouldn't be reset here. Fallback keys can be replaced in /keys/upload so a new endpoint isn't really needed.

turt2live · 2024-07-02T15:31:03Z

proposals/4162-otk-reset.md

+
+
+### Context
+One-time keys (OTKs) consist of two parts: a public part which is stored on the server and a private part which is stored on the client device. One-time keys can be uploaded to the server with the `/keys/upload` endpoint. They can then be consumed via the `/keys/claim` endpoint. The client is kept informed of the total number of OTKs stored on the server via the `/sync` endpoint, with the JSON key `device_one_time_keys_count`. It is critical that the set of keys on the client and server remain in-sync, otherwise an encrypted channel between two devices cannot be established. "In-sync" is defined by the set of public keys on the server having their matching private keys on the client.


Please wrap lines to ~100 characters

turt2live · 2024-07-02T15:32:49Z

proposals/4162-otk-reset.md

Implementation requirements:

Client

Server

turt2live

Early review while I'm here triaging - I haven't reviewed the whole thing yet.

turt2live · 2024-07-02T15:33:35Z

proposals/4162-otk-reset.md

+
+#### Client issues
+
+ - If the request to `/keys/upload` fails, the client must retry (including over restarts) the _same set of keys_ as it is not possible to know if the server received the request and has stored them already. If this doesn't happen, the server may return HTTP 400 errors indicating this. [[issue](https://github.com/matrix-org/matrix-rust-sdk/issues/1415)]


This can probably be fixed without a whole new endpoint?

Yes, but it outlines a common failure mode which causes OTKs to desync.

It would probably be helpful to break this down into "implementation bugs which could theoretically be fixed" vs "real protocol problems" (such as backup rollback)

turt2live · 2024-07-02T15:33:59Z

proposals/4162-otk-reset.md

+ - If the client runs across multiple processes, both processes can upload different sets of OTKs in response to some trigger event (e.g being notified that a OTK has been consumed via `/sync`). [[issue](https://github.com/matrix-org/matrix-rust-sdk/issues/3110)]
+ - Clients should only delete a OTK once the key has been successfully used. [[issue](https://github.com/matrix-org/matrix-rust-sdk/issues/1761)]


These two feel like implementation bugs rather than spec concerns.

Yes, but it outlines a common failure mode which causes OTKs to desync.

turt2live · 2024-07-02T15:34:51Z

proposals/4162-otk-reset.md

+ - Server databases may be rolled back (e.g during a bad upgrade process).
+
+#### Protocol issues
+ - Clients only store a limited set of OTKs. When they delete excess OTKs they have no mechanism to tell the server to delete those old OTKs.


Why are they deleting them? If it's just good hygiene, do we add expiry timestamps to the /upload endpoint instead?

Clients delete them to not be forced into storing an unbounded set of OTKs. Clients try to keep to a number (typically 50 or 100) and when the server says the count is < this the client uploads more. But it can't delete any key as there may be in-flight requests. Hence, you can cause clients to use an unbounded amount of disk space by repeatedly telling the client that the OTK count is 0. This is bad. JS SDK and Rust SDK have maximum bounds to the number of OTKs it will store, currently this is set to 5000.

turt2live · 2024-07-02T15:35:24Z

proposals/4162-otk-reset.md

+
+#### Protocol issues
+ - Clients only store a limited set of OTKs. When they delete excess OTKs they have no mechanism to tell the server to delete those old OTKs.
+ - Multiple users could hit `/keys/claim` at the same time as a client is uploading OTKs, which creates race conditions which can cause the same key to be sent to multiple users. [[issue](https://github.com/matrix-org/matrix-spec/issues/1124)]


This feels like an implementation issue, though the spec may need to be clarified how the locking works.

turt2live · 2024-07-02T15:36:10Z

proposals/4162-otk-reset.md

+ - Clients only store a limited set of OTKs. When they delete excess OTKs they have no mechanism to tell the server to delete those old OTKs.
+ - Multiple users could hit `/keys/claim` at the same time as a client is uploading OTKs, which creates race conditions which can cause the same key to be sent to multiple users. [[issue](https://github.com/matrix-org/matrix-spec/issues/1124)]
+
+### Proposal


The problem feels under-described: why is a reset required?

The primary motivation of this endpoint is to provide a way for a client and server to resync OTKs. [...] This functionality is particularly important for existing clients who currently have desynced OTKs.

There's existing devices who have desynced OTKs. This is guaranteed to cause UTDs when those keys get used. It would be nice to fix that.

As I see it, there are basically three usecases for a full reset:

We know we have bugs in clients; we should fix those, but in the meantime we can improve resilience for selected deployments by implementing a healing mechanism.

We know we have bugs in clients; we will fix those in time, but that doesn't help people who are already broken. We will need a healing mechanism for those users.

There are some problems (in particular, client or server rollback from backup) that aren't really bugs, but will result in getting out of sync. Again, we need a healing mechanism.

Our best idea for resolving all three situations is to blow away all the OTKs on the server and start again.

It would be good if the MSC spelled this out.

proposals/4162-otk-reset.md

Co-authored-by: Andy Balaam <andy.balaam@matrix.org>

richvdh · 2024-07-04T15:33:11Z

proposals/4162-otk-reset.md

+
+#### Client issues
+
+ - If the request to `/keys/upload` fails, the client must retry (including over restarts) the _same set of keys_ as it is not possible to know if the server received the request and has stored them already. If this doesn't happen, the server may return HTTP 400 errors indicating this. [[issue](https://github.com/matrix-org/matrix-rust-sdk/issues/1415)]


I don't really follow this. Are you saying "clients might forget to re-upload keys if they do not get a successful response"? Why would that cause a 400?

richvdh · 2024-07-04T15:34:41Z

proposals/4162-otk-reset.md

+
+#### Client issues
+
+ - If the request to `/keys/upload` fails, the client must retry (including over restarts) the _same set of keys_ as it is not possible to know if the server received the request and has stored them already. If this doesn't happen, the server may return HTTP 400 errors indicating this. [[issue](https://github.com/matrix-org/matrix-rust-sdk/issues/1415)]


It would probably be helpful to break this down into "implementation bugs which could theoretically be fixed" vs "real protocol problems" (such as backup rollback)

richvdh · 2024-07-04T15:35:31Z

proposals/4162-otk-reset.md

+
+ - If the request to `/keys/upload` fails, the client must retry (including over restarts) the _same set of keys_ as it is not possible to know if the server received the request and has stored them already. If this doesn't happen, the server may return HTTP 400 errors indicating this. [[issue](https://github.com/matrix-org/matrix-rust-sdk/issues/1415)]
+ - If the client runs across multiple processes, both processes can upload different sets of OTKs in response to some trigger event (e.g being notified that a OTK has been consumed via `/sync`). [[issue](https://github.com/matrix-org/matrix-rust-sdk/issues/3110)]
+ - Clients should only delete a OTK once the key has been successfully used. [[issue](https://github.com/matrix-org/matrix-rust-sdk/issues/1761)]


Suggested change

- Clients should only delete a OTK once the key has been successfully used. [[issue](https://github.com/matrix-org/matrix-rust-sdk/issues/1761)]

- Clients may incorrectly delete the private part of an OTK before it is successfully used. [[issue](https://github.com/matrix-org/matrix-rust-sdk/issues/1761)]

richvdh · 2024-07-04T15:38:24Z

proposals/4162-otk-reset.md

+
+This does not define a sort ordering (i.e _which_ key should be given out to the caller). This MSC proposes tweaking this wording to be:
+
+> Claims one-time keys for use in pre-key messages. The key claimed MUST be the lowest lexicographically sorted key ID for that algorithm. For example, given two key IDs of `AAAAAu` and `AAAAHg`, `AAAAAu` MUST be returned before `AAAAHg`.


+1 for timestamps here.

richvdh · 2024-07-04T15:39:20Z

proposals/4162-otk-reset.md

+ A new endpoint `POST /keys/reset` is added. It accepts the following request body:
+
+ ```js
+ {
+    "all": true|false
+    "key_ids": [
+        "<algorithm>:<key_id>", ...
+    ]
+ }
+ ```
+- If `all` is `true`, all stored one-time keys for the calling device are deleted on the server, and `key_ids` is ignored.
+- If `all` is `false`, every key specified in `key_ids` is deleted from the database. An example key ID is `signed_curve25519:AAAAHQ`.


This feels like two separate endpoints to me. Indeed maybe we should make the selective-delete endpoint a separate MSC: it's quite a different usecase

richvdh · 2024-07-04T15:45:41Z

proposals/4162-otk-reset.md

+ - Clients only store a limited set of OTKs. When they delete excess OTKs they have no mechanism to tell the server to delete those old OTKs.
+ - Multiple users could hit `/keys/claim` at the same time as a client is uploading OTKs, which creates race conditions which can cause the same key to be sent to multiple users. [[issue](https://github.com/matrix-org/matrix-spec/issues/1124)]
+
+### Proposal


As I see it, there are basically three usecases for a full reset:

We know we have bugs in clients; we should fix those, but in the meantime we can improve resilience for selected deployments by implementing a healing mechanism.

We know we have bugs in clients; we will fix those in time, but that doesn't help people who are already broken. We will need a healing mechanism for those users.

There are some problems (in particular, client or server rollback from backup) that aren't really bugs, but will result in getting out of sync. Again, we need a healing mechanism.

Our best idea for resolving all three situations is to blow away all the OTKs on the server and start again.

It would be good if the MSC spelled this out.

richvdh · 2024-07-04T15:48:43Z

proposals/4162-otk-reset.md

+
+
+### Context
+One-time keys (OTKs) consist of two parts: a public part which is stored on the server and a private part which is stored on the client device. One-time keys can be uploaded to the server with the `/keys/upload` endpoint. They can then be consumed via the `/keys/claim` endpoint. The client is kept informed of the total number of OTKs stored on the server via the `/sync` endpoint, with the JSON key `device_one_time_keys_count`. It is critical that the set of keys on the client and server remain in-sync, otherwise an encrypted channel between two devices cannot be established. "In-sync" is defined by the set of public keys on the server having their matching private keys on the client.


It is critical that the set of keys on the client and server remain in-sync

I feel like this is misleading. What we really mean is "the set of keys on the server is a subset of those on the client". It's not a problem if the client has extra keys (generally, we assume they've been claimed but not used, but in any case they are harmless, modulo the fact they take up slots in the client).

kegsay added 2 commits July 2, 2024 14:13

Add MSC4162

267add4

spellcheck

4844737

kegsay mentioned this pull request Jul 2, 2024

Epic: One-Time Keys can desync between client and server causing UTDs element-hq/element-meta#2406

Open

BillCarsonFr reviewed Jul 2, 2024

View reviewed changes

turt2live added proposal A matrix spec change proposal client-server Client-Server API kind:maintenance MSC which clarifies/updates existing spec needs-implementation This MSC does not have a qualifying implementation for the SCT to review. The MSC cannot enter FCP. labels Jul 2, 2024

turt2live reviewed Jul 2, 2024

View reviewed changes

proposals/4162-otk-reset.md

Copy link

Member

turt2live Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation requirements:

Client

Server

turt2live reviewed Jul 2, 2024

View reviewed changes

andybalaam reviewed Jul 3, 2024

View reviewed changes

proposals/4162-otk-reset.md Outdated Show resolved Hide resolved

Update proposals/4162-otk-reset.md

5aeac43

Co-authored-by: Andy Balaam <andy.balaam@matrix.org>

richvdh reviewed Jul 4, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MSC4162: One-Time Key Reset Endpoint #4162

MSC4162: One-Time Key Reset Endpoint #4162

kegsay commented Jul 2, 2024 •

edited

Loading

BillCarsonFr left a comment

BillCarsonFr Jul 2, 2024 •

edited

Loading

kegsay Jul 2, 2024

BillCarsonFr Jul 2, 2024

BillCarsonFr Jul 2, 2024

kegsay Jul 2, 2024

richvdh Jul 4, 2024

richvdh Jul 9, 2024

BillCarsonFr Jul 2, 2024

kegsay Jul 2, 2024 •

edited

Loading

turt2live Jul 2, 2024

turt2live Jul 2, 2024

turt2live left a comment

turt2live Jul 2, 2024

kegsay Jul 2, 2024

richvdh Jul 4, 2024

turt2live Jul 2, 2024

kegsay Jul 2, 2024

turt2live Jul 2, 2024

kegsay Jul 2, 2024

turt2live Jul 2, 2024

turt2live Jul 2, 2024

kegsay Jul 2, 2024

richvdh Jul 4, 2024

richvdh Jul 4, 2024

richvdh Jul 4, 2024

richvdh Jul 4, 2024

richvdh Jul 4, 2024

richvdh Jul 4, 2024

richvdh Jul 4, 2024

richvdh Jul 4, 2024


		This does not define a sort ordering (i.e _which_ key should be given out to the caller). This MSC proposes tweaking this wording to be:

		> Claims one-time keys for use in pre-key messages. The key claimed MUST be the lowest lexicographically sorted key ID for that algorithm. For example, given two key IDs of `AAAAAu` and `AAAAHg`, `AAAAAu` MUST be returned before `AAAAHg`.


		### Rationale

		The primary motivation of this endpoint is to provide a way for a client and server to resync OTKs. To do this, a client would call `/keys/reset` with `all: true` to delete all OTKs on the server. Once this 200 OKs, the client is then safe to perform `/keys/upload` with a fresh set of OTKs. This functionality is particularly important for existing clients who currently have desynced OTKs.



		### Context
		One-time keys (OTKs) consist of two parts: a public part which is stored on the server and a private part which is stored on the client device. One-time keys can be uploaded to the server with the `/keys/upload` endpoint. They can then be consumed via the `/keys/claim` endpoint. The client is kept informed of the total number of OTKs stored on the server via the `/sync` endpoint, with the JSON key `device_one_time_keys_count`. It is critical that the set of keys on the client and server remain in-sync, otherwise an encrypted channel between two devices cannot be established. "In-sync" is defined by the set of public keys on the server having their matching private keys on the client.


		#### Client issues

		- If the request to `/keys/upload` fails, the client must retry (including over restarts) the _same set of keys_ as it is not possible to know if the server received the request and has stored them already. If this doesn't happen, the server may return HTTP 400 errors indicating this. [[issue](https://github.com/matrix-org/matrix-rust-sdk/issues/1415)]

		- If the client runs across multiple processes, both processes can upload different sets of OTKs in response to some trigger event (e.g being notified that a OTK has been consumed via `/sync`). [[issue](https://github.com/matrix-org/matrix-rust-sdk/issues/3110)]
		- Clients should only delete a OTK once the key has been successfully used. [[issue](https://github.com/matrix-org/matrix-rust-sdk/issues/1761)]

	- Clients should only delete a OTK once the key has been successfully used. [[issue](https://github.com/matrix-org/matrix-rust-sdk/issues/1761)]
	- Clients may incorrectly delete the private part of an OTK before it is successfully used. [[issue](https://github.com/matrix-org/matrix-rust-sdk/issues/1761)]

MSC4162: One-Time Key Reset Endpoint #4162

Are you sure you want to change the base?

MSC4162: One-Time Key Reset Endpoint #4162

Conversation

kegsay commented Jul 2, 2024 • edited Loading

BillCarsonFr left a comment

Choose a reason for hiding this comment

BillCarsonFr Jul 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kegsay Jul 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

turt2live left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kegsay commented Jul 2, 2024 •

edited

Loading

BillCarsonFr Jul 2, 2024 •

edited

Loading

kegsay Jul 2, 2024 •

edited

Loading