-
Notifications
You must be signed in to change notification settings - Fork 369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MSC4140: Delayed events (Futures) #4140
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Timo K <toger5@hotmail.de>
Signed-off-by: Timo K <toger5@hotmail.de>
2bc07c4
to
0eb1abc
Compare
Signed-off-by: Timo K <toger5@hotmail.de>
0eb1abc
to
8bf6db7
Compare
Signed-off-by: Timo K <toger5@hotmail.de>
Signed-off-by: Timo K <toger5@hotmail.de>
3e54c2a
to
c82adf7
Compare
Signed-off-by: Timo K <toger5@hotmail.de>
c82adf7
to
54fff99
Compare
…is used to trigger on of the actions Signed-off-by: Timo K <toger5@hotmail.de>
Signed-off-by: Timo K <toger5@hotmail.de>
Add event type to the body Add event id template variable
- One example would be redacting an event. It only makes sense to redact the event | ||
if it exists. | ||
It might be important to have the guarantee, that the redact is received | ||
by the server at the time where the original message is sent. | ||
- In the case of a state event we might want to set the state to `A` and after a | ||
timeout reset it to `{}`. If we have two separate request sending `A` could work | ||
but the event with content `{}` could fail. The state would not automatically | ||
reset to `{}`. | ||
|
||
For this usecase an optional `m.send_now` field can be added to the body. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have a generlized way to batch sent matrix events this could be leverged on this and a future itself is JUST the future. And the batch send event would define the semantics for sending the future and guaranteeing that the send_now event is also sent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MSC2716 allows bulk sending events.
It is limited to application services however and focuses on historic data. Since we also need the additional capability to use a template event_id parameter, this probably is not a good fit.
To make this as generic as possible, the proposed solution is to allow sending | ||
multiple presigned events and delegate the control of when to actually send these | ||
events to an external services. This allows to a very flexible way to mark events as expired, | ||
since the sender can choose what event will be sent once expired. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it entire events that need to be signed, or just their content? If it's the former, then /send/future
should behave more like #4080's /send_pdus
. This could be done by having /send/future
's send_*
fields accept fully-signed events instead of signed content + other fields.
For this to work, there'd also need to be the modified PUT /send
& PUT /state
endpoints for retrieving the PDUs that need signing.
That would make the client flow of sending a Future as follows:
- call PUT
/send
/ PUT/state
for each event that's to be sent in a Future (ideally, this could be batched) - sign each retrieved PDU
- put each signed PDU in a request to
/send/future
, placing them in"send_*"
fields as desired
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also worth mentioning why we want events to be presigned in the first place (for compatibility with Crypto IDs; to ensure that Future events were truly generated by a client and not made up by the homeserver; and possibly other reasons).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call PUT /send / PUT /state for each event that's to be sent in a Future (ideally, this could be batched)
sign each retrieved PDU
put each signed PDU in a request to /send/future, placing them in "send_*" fields as desired
What should be batched is mostly the put PDU's right. Since creating the signed PDU's is okay to not be batched. The client then just needs to be sure that they have created all the events before sending the signed PDU's. if creating the PDU's fails the client just retires until it has the full list of events that need to sent (because they rely on each other)
So what really would need to be batched is the last step. Sending the signed PDU's.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe its worth exploring to just introduce a new type of
PDUInfo:
{
room_version: string,
via_server: string, // optional
pdu: PDU // signed PDU
}
If we include a timeout here + action PDU's
{
room_version: string,
via_server: string, // optional
timout: number, // optional
future_actions:{
actionName: PDU // signed alternative PDU in case an action is trigger
}
future_id: randomString // optional
pdu: PDU // signed PDU
}
response
{
future_tokens:{
future_id_0: token,
future_id_1: ...
}
}
we would not even need a new endpoint. and the homeserver response would need to include tokens for the future PDU's
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a really good idea, especially since the future-aware fields don't conflict at all with the base PDUInfo
. It also allows immediate events to be sent (even several at once!), thus replacing the send_now
events without needing any extra spec!
One question: what is the optional future_id
in the request for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are multiple futures in one send_pdus call multiple future tokens need to be issued. In the response a dictionary with future_id-> token would use the id to map all the tokens.
To make this as generic as possible, the proposed solution is to allow sending | ||
multiple presigned events and delegate the control of when to actually send these | ||
events to an external services. This allows to a very flexible way to mark events as expired, | ||
since the sender can choose what event will be sent once expired. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also worth mentioning why we want events to be presigned in the first place (for compatibility with Crypto IDs; to ensure that Future events were truly generated by a client and not made up by the homeserver; and possibly other reasons).
Co-authored-by: Andrew Ferrazzutti <af_0_af@hotmail.com>
28ddfbb
to
49d5294
Compare
The server will respond with a [`400`](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/400) (`Bad Request`, with a message | ||
containing the maximum allowed `timeout_duration`) if the | ||
client tries to send a timeout future with a larger `timeout_duration`. | ||
- The future is using a group_id that belongs to a future group from another user. In this case the homeserver sends a [`405`] (`Not Allowed`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even better (and what I've been implementing) is for each user to get their own namespace of group IDs, i.e. for the effective group ID to be the tuple of (user ID, group ID). Benefits include:
- Users cannot guess at other users' group IDs by spamming future requests and waiting to receive an error response.
- More group IDs are available for each user, which is especially useful if we want to allow user-defined group IDs that could otherwise clash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if the groupId is a large UUID we can just use that and synapse stores the relation between user and uuid.
But it also does not hurt to include the userId in the group id.
At the end its a hs implementation detail. The important bit is just that the homeserver makes sure its unique in its domain.
Co-authored-by: Andrew Ferrazzutti <af_0_af@hotmail.com>
828486e
to
a663bb4
Compare
and use proper msc number in unstable prefix section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For logging purposes: this MSC crossed the SCT's desk as potentially scary, so we've given it an early review here.
The idea of a generic mechanism to address call participant counts, scheduled messages, and self-destructing events is very much appealing. The fewer moving pieces we have to worry about in the spec, the better. This MSC appears to have 3 major concerns which could be classified as 'scary', which all have their own dedicated threads - please ensure discussion happens in those threads. The highlights are:
- Self-destructing events has a metadata component which likely means it will need a dedicated MSC, despite the preference being fewer moving pieces. The user privacy concerns leading to wanting self-destructing events outweigh the idealistic genericism.
- A comparison to MSC3277 is missing from this proposal. MSC3277 uses a DAG-based approach to ensure events are authorized and servers don't have to implement complicated subsystems for the scheduled messages feature.
- Keep alives are unreliable and can have unexpected consequences for users and clients, particularly when a network partition causes a failed ping. For the call participant count use case in particular, the SFU(s) should already know how many connections it has and can reveal that information back to other users. Where network partitions fail the connection to the SFU, the user is dropped from the call regardless. Otherwise, temporary connection issues can ensure the user is reflected as connected with lost audio/video. This system may very well use a keep alive internally (possibly at the TCP layer), but here it would be appropriate compared to event sending.
I've also left several editorial comments to aide understanding of the MSC. I've not done a complete pass on this - these are just the more notable ones.
As always, if any of my comments require clarification or more information, let me know in the threads :)
|
||
- Updating call member events after the user disconnected. | ||
- Sending scheduled messages (send at a specific time). | ||
- Creating self-destructing events (by sending a delayed redact). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this proposal addresses self-destructing events in a way which is useful/safe for users. Aside from a message's content, the second most important detail users want to destroy is the metadata, which this proposal doesn't address. A self-destructing events MSC would most likely erase the event from the DAG entirely.
I'd suggest eliding this use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thread for highlight 1:
I wasn't aware that it is on the table that self-destructing events would use an entirely different concept to redactions.
I was coming from: https://github.com/matrix-org/matrix-spec-proposals/blob/matthew/msc2228/proposals/2228-self-destructing-events.md
Which basically does the same thing. It redacts the event based on conditions.
What I like about this proposal, is that instead of making a custom event and add logic to create synthetic redaction we generalize the concept of event delays and everything else is a completely normal redaction.
Also
erase the event from the DAG entirely
Sounds like an anti-pattern for a distributed system. Can you guide me on where I can find information about erasing whole events from the DAG including all there metadata without federation conflicts?
- Every future group needs at least one timeout future to guarantee that all future expire eventually. | ||
- If a timeout future is sent without a `future_group_id` a unique identifier will be generated by the | ||
homeserver and is part of the `send_future` response. | ||
- Group id's can only be used by one user. Reasons for this are, that this would basically allow full control over a future group once another matrix user knows the group id. It would also require to federate futures if the users are not on the same homeserver. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in only a single user on the homeserver can have groupA
? I'm not sure there's advantage to that - we should copy the transaction IDs behaviour from the existing spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The group_id
's are server generated UUID's, so I am not sure how this could be phrased like transaction ID's.
Polling based solution have a big overhead in complexity and network requests on the clients. | ||
Example: | ||
|
||
> A room list with 100 rooms where there has been a call before in every room | ||
> (or there is an ongoing call) would require the client to send a to-device message | ||
> (or a request to the SFU) to every user that has an active state event to check if | ||
> they are still online. Just to display the room tile properly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SFU should have a fairly good idea on how many connections it's holding, and this information can be federated when there's multiple SFUs in play. The client shouldn't need to poll for this either: it can likely subscribe directly either as a data stream, or using something like websockets. That subscription can then be used to count the number of 'active' participants.
It could theoretically mean a client connects to get information but isn't producing media, which is something the subscription stream can handle: the client can indicate (or otherwise authenticate) which other media streams it owns for the SFU to count them 'joined'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thread for highlight 3:
This is the most important reason for why we need a heardbeat like expiration system to make matrixRTC performant an realibale!
The main problem we solve here is that we DONT want the client to connect to each SFU all the time and we dont talk about calls this client is connected to. This might sound like a possbile thing todo but is more overhead that one would think (polling or connecting to a socket is not really making a difference here.)
Maybe a more detailed description of the current situation is required here:
Without MatrixRTC your client has no idea about any ongoing call when it starts up.
We introduced call.member
state events so now we can easily read who is connected in a session in each room.
But due to the nature of state events there is no guarantee that the client will not forget to remove (set them to {}
) after they disconnect. Each event a client misses to remove results in a room that looks like there is an ongoing call. So for each of these rooms the client has to:
- Authenticate with the sfu (jwt service token)
- Connect to the sfu wbsocket.
- Check the current participants.
- Invalidate all the member event locally. (Since they cannot write to those member events because its owned by another member they have to do this on every client.)
It is very easy to have such a invalid event: a user pressing ctrl + w while in call in EW is enough.
The commented section of the MSC gives an exmaple. If you have multiple rooms where there are left over member state events you need do the 3 steps for each of them individually.
Workarounds we tried:
Storing a timestamp in the state event and updating the event every 30min. This allows to compute an invalidate event without connecting to an sfu. (but only after a 30min time where you see a call that is not happening)
This would be elegant but since those two endpoint are core to Matrix, changes to them might | ||
be controversial if their return value is altered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be fine. Clients would only get a different response if they use the parameters, meaning they should be expecting a different format. It would be different if lack of the future parameters meant a different response body.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is really nice!
It would be as you describe:
Without future parameters the response would be the same as it is now.
With the parameters the response would be different.
|
||
## Potential issues | ||
|
||
## Alternatives |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This proposal's main competition appears to be MSC3277, where events are scheduled by placing them in the DAG. The homeserver then forwards the previously-created event at a set time, but it is part of the room until that point.
The DAG approach has a few advantages which I think make it more appealing:
- Message delivery is not subject to a third party being online. Specifically, an appservice, SFU, or similar does not need to be looped in to ensure a message gets sent - it's the homeserver's own responsibility to stay online. We may even be able to federate the event outwards with a
soft_fail_until
timestamp for maximum reliability (at the cost of eager delivery and harder cancellation). - Events are known to be authorized and able to be sent because they've been given an event ID (and thus run through the auth rules).
- The tracking overhead is extremely minimal: servers at most need to remember to un-soft-fail an event after a set time, but it can do that with rough precision. They do not need to track different tokens for different actions - if a user redacts a scheduled event while it's still soft failed, the send is cancelled. It knows this intrinsically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are very open to insert the events into the DAG when sent. This was discussed and even has a bunch of advantages (mostly that the event_id is available immediately)
The biggest reason against it was the assumption that this might be too big of a change: It adds the complexity of a state where an event is in the dag but not yet (and maybe never) is distributed to clients.
Message delivery is not subject to a third party being online. Specifically, an appservice, SFU, or similar does not need to be looped in to ensure a message gets sent - it's the homeserver's own responsibility to stay online. We may even be able to federate the event outwards with a soft_fail_until timestamp for maximum reliability (at the cost of eager delivery and harder cancellation).
The delivery does not rely on a third party. It relies on the homeservers timeout computation. (that is enforced by the future group! The example with the client that pings every 10s describes this well. The client getting disconnected and not sending the ping anymore lets the homerserver know it can deliver the event now.
To me, it seems that the least intrusive change is to not add events to the DAG and instaed just queu them on the homeserver/pretend the http request was received later. But I really like it if it's possible to add it to the DAG right after receiving the event and immediately computing an event_id
.
Because Future events can also be canceled, it needs to be valid to have non sent events in the dag forever. As described in the MSC, in most cases you want to schedule multiple events in one future group and only send one of them. Adding them all to the DAG seems unnecassary.
But other than that, this sounds like a very compatible and nice solution. Do I understand it correctly that the homeserver will not send the event to federating homeservers until the timeout condition is met?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The VoIP team had a dedicated meeting to thoroughly investigate the option to use MSC3277's approach of adding the event to the DAG on retrieval.
We have created a summary document here: https://hackmd.io/h0z82KvKSaiW-jYlOnU69w?edit
(this will be posted below as well for visibility)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reliable State events (Future MSC)
The Future MSC is not sending
Soft fail (DAG) vs Homeserver queuing (not in DAG)
Message delivery is not subject to a third party being online. Specifically, an appservice, SFU, or similar does not need to be looped in to ensure a message gets sent - it's the homeserver's own responsibility to stay online. We may even be able to federate the event outwards with a soft_fail_until timestamp for maximum reliability (at the cost of eager delivery and harder cancellation).
This is the case for both proposals. The delivery itself is in full control of the homeserver.
For both proposals the homserver is given additional information to the event and will make it real eventually based on those parameters.
The main differentiator is that with MSC4140 external interaction is optionally possible.
The /future/{futureToken}
endpoint allows interacting with event scheduling.
This is the main feature we require to get Reliable VoIP to work.
It is not important if we add the event to the DAG immediately, but we need the scheduling interaction of /future/{futureToken}
.
Events are known to be authorized and able to be sent because they've been given an event ID (and thus run through the auth rules).
We explicitly only have one auth period: at send time. While scheduled, the auth conditions can change, so the proper time to do auth computation is at send time.
With MSC3277 we would need to rerun the auth rules when sending the event.
The tracking overhead is extremely minimal: servers at most need to remember to un-soft-fail an event after a set time, but it can do that with rough precision. They do not need to track different tokens for different actions - if a user redacts a scheduled event while it's still soft failed, the send is cancelled. It knows this intrinsically.
Using this with the required interaction tokens would result in the following:
- Send a scheduled event with 10s
- On each refresh/reset token retrival the homserver would redact the event and send a new one (with the send_at updated by 10s)
- This would become quite complicated with cryptographic identities, since the user would need sign all of them. Each reset/refresh needs a client server roundtrip for signing.
- Whenever there is no reset/refresh for 10 seconds (e.g. the device has crashed) the homeserver would un-soft-fail (send) the event.
- Automatically
The main issue we have with this is the large trafic that generates on the homeserver. For clients this would be equivalent (at least without cryptographic identities) but the homeserver needs to insert a new event (and send it over federation) for each refresh.
- Automatically
Comment:
- If we delegate leaves to the SFU we would configure the timeout to a duration in the hour range. This would make the above much more bearable.
- We still need to decide if we want to federate the delegation tokens. Otherwise we dont get any benefit from federating the scheduled event if we can only interact with it through the sending homeserver.
Comparison
MSC3277 | MSC4140 | |
---|---|---|
Traffic | Lots of traffic (one evet per refresh) between federating homeservers. For the client it is the same | Since we do not add the event to the DAG until we know it will be sent, there is no traffic because of the scheduling. |
resilience (only scheduling) | Scheduled events are even sent if the sender's homeserver goes down. | The sender's homeserver is required at send time to send the event. If it is down at sent time the event will not be sent until it comes back up. |
resilience (with interaction) | This is just not possible because sending an event (based on an interaction) before the at time requires redacting the scheduled one and creating a new one, which cannot be done by a different homeserver then the one hosting the sending user. |
Cannot be done. It would also require federting all the interaction tokens and the unsent events but we end up in the same scenario where a homeserver would need to send an event for a user on a different homerserver. |
Federation (without interaction) | Works | Doesn't work |
Federation (with interaction) | We need interaction tokens (futureTokens ) and they cannot be federated. Otherwise federating homeserver admins could interact with the Futures/scheduled messages. With interaction there is no way we can make federation work and we lose all the benefit of inserting the event into the DAG. |
The same, we cannot ever send futureTokens to other homeservers since they would be able to interact with the futures. |
Auth | Here interaction could work by redacting the scheduled event and sending a new scheduled event. This is authenticaed but only the client can send the events. If we want interaction that can be delegated to the SFU (which is the best source of truth) we would need to introduce tokens similar to futureTokens . |
With the futureToken based authentication (that is different to the client auth token) we have an extremely scoped auth mechanism that can be send to third parties like the SFU. The SFU can notify the homeserver about a user disconnect. The SFU is the best source of truth for that. |
Tracking overhead | All the schedule tracking information is stored in the DAG and a timer needs to run through all potential expired scheduled messages periadically. This has to happen on ALL homeservers. | Very similar logic to the scheduled events except that the tracking information is stored in a seperate Future database only on the sender's homeserver (less overhead). |
TLDR:
If we don't need interaction, there is a benefit in federating the scheduled message (adding it to the DAG immediately). It increases resiliance: the sender's homeserver can disconnect and the scheduled message still will enter non-soft-failed state (will be sent).
With delegatable interaction (which is the one property we need for reliable state events), we lose the possiblity to federate scheduled messages* and the two solutions would converge to one where we lose the property of the sender's homeserver being able to lose connection but the scheduled message still is sent on other homeservers.
* reason being that this would also require to federate (and hence leak) the interaction tokens and allowing other homeservers to interact with the future without the user explicitly giving consent by sending the tokens.)
Conclusion
We are faced with the decision between:
- Do not assume that the senders homeserver stays online:
- Federate the scheduled message (reslience) but loose the interaction we need for MatrixRTC reliable call member events.
- Assume that the senders homeserver stays online:
- If the senders homeserver is expected to stay online there is no reason to federate the scheduled/future event and we can safely add interaction which allows us to implement reliable call member events
|
||
This would redact the message with content: `"m.text": "my msg"` after 10minutes. | ||
|
||
## Potential issues |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The keep alive approach appears to reverse the expectation of scheduled messages a bit, which I think may be one of the core concerns with this proposal. With MSC3277, senders expect that their event will go out at the time they schedule it. This proposal moves the expectation to much later in the send process, and creates an assumption that any event can be cancelled or "undone" at any point. This behaviour leads to "unexpected" consequences, because the sender was expecting to be an opportunity for their event to never send.
It's a bit subtle, but that expectation changing I think makes MSC3277 more favourable. MSC3277 doesn't really help with the call participant count problem though. For that, I think the SFU can likely count its connections more reliably than a keep alive (a single network partition leads to a false count). This is discussed in more detail in the 'MatrixRTC use case' thread I've started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The core assumption of the VoIP when writing this MSC is that it is of essence that syncing room state is enough to compute all ongoing MatrixRTC sessions. With the reasons described here: #4140 (comment)
So It seems we have the following two scenarios:
- We go with a SFU based state event validation approach (each client requests a token and then logs into the SFU for each SFU mentioned in any call member event it encounters and verifies its validity (if the user is still connected to the session))
- We want the memberships/sessions to be reliably represented in the room state.
If we decide for 1. we really don't need any of the MSC's and I can see how a static timeout as described in MSC3277 could be a solution. Even though no interaction to the scheduling definitively limits its use-cases.
But the experience we made during a whole year concluded that we should make the state events reliable.
(Other logic: historic session computation, changes in call realted UX, building matrixRTC sdk's all are super hard to implement and involve a a lot of duct tape if we cannot trust member state events but have to check the SFU all the time. It also means we can only validate state events in real time but never tell if a state event was invalid in the past.)
Co-authored-by: Travis Ralston <travisr@matrix.org>
…ain proposal of `send_future` and `state_future`.
Rendered
This could also supersede MSC2228 (by making it possible to send a redaction with the
/send
endpoint. This is the case as mentioned here)Implementations: