Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSC4140: Delayed events (Futures) #4140

Open
wants to merge 31 commits into
base: main
Choose a base branch
from

Conversation

toger5
Copy link

@toger5 toger5 commented May 7, 2024

Rendered

This could also supersede MSC2228 (by making it possible to send a redaction with the /send endpoint. This is the case as mentioned here)

Implementations:

toger5 added 2 commits May 7, 2024 18:52
Signed-off-by: Timo K <toger5@hotmail.de>
Signed-off-by: Timo K <toger5@hotmail.de>
@toger5 toger5 force-pushed the toger5/expiring-events-keep-alive branch from 2bc07c4 to 0eb1abc Compare May 7, 2024 17:03
Signed-off-by: Timo K <toger5@hotmail.de>
@toger5 toger5 force-pushed the toger5/expiring-events-keep-alive branch from 0eb1abc to 8bf6db7 Compare May 8, 2024 15:49
Signed-off-by: Timo K <toger5@hotmail.de>
Signed-off-by: Timo K <toger5@hotmail.de>
@turt2live turt2live changed the title Draft for expiring event PR MSC4140: Expiring events with keep alive endpoint May 9, 2024
@turt2live turt2live added proposal A matrix spec change proposal client-server Client-Server API kind:feature MSC for not-core and not-maintenance stuff needs-implementation This MSC does not have a qualifying implementation for the SCT to review. The MSC cannot enter FCP. labels May 9, 2024
@toger5 toger5 force-pushed the toger5/expiring-events-keep-alive branch from 3e54c2a to c82adf7 Compare May 10, 2024 17:54
Signed-off-by: Timo K <toger5@hotmail.de>
@toger5 toger5 force-pushed the toger5/expiring-events-keep-alive branch from c82adf7 to 54fff99 Compare May 10, 2024 18:08
…is used to trigger on of the actions

Signed-off-by: Timo K <toger5@hotmail.de>
Signed-off-by: Timo K <toger5@hotmail.de>
Add event type to the body
Add event id template variable
Comment on lines 142 to 151
- One example would be redacting an event. It only makes sense to redact the event
if it exists.
It might be important to have the guarantee, that the redact is received
by the server at the time where the original message is sent.
- In the case of a state event we might want to set the state to `A` and after a
timeout reset it to `{}`. If we have two separate request sending `A` could work
but the event with content `{}` could fail. The state would not automatically
reset to `{}`.

For this usecase an optional `m.send_now` field can be added to the body.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have a generlized way to batch sent matrix events this could be leverged on this and a future itself is JUST the future. And the batch send event would define the semantics for sending the future and guaranteeing that the send_now event is also sent

Copy link
Author

@toger5 toger5 May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MSC2716 allows bulk sending events.
It is limited to application services however and focuses on historic data. Since we also need the additional capability to use a template event_id parameter, this probably is not a good fit.

proposals/4140-delayed-events-futures.md Outdated Show resolved Hide resolved
Comment on lines 23 to 26
To make this as generic as possible, the proposed solution is to allow sending
multiple presigned events and delegate the control of when to actually send these
events to an external services. This allows to a very flexible way to mark events as expired,
since the sender can choose what event will be sent once expired.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it entire events that need to be signed, or just their content? If it's the former, then /send/future should behave more like #4080's /send_pdus. This could be done by having /send/future's send_* fields accept fully-signed events instead of signed content + other fields.

For this to work, there'd also need to be the modified PUT /send & PUT /state endpoints for retrieving the PDUs that need signing.

That would make the client flow of sending a Future as follows:

  • call PUT /send / PUT /state for each event that's to be sent in a Future (ideally, this could be batched)
  • sign each retrieved PDU
  • put each signed PDU in a request to /send/future, placing them in "send_*" fields as desired

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also worth mentioning why we want events to be presigned in the first place (for compatibility with Crypto IDs; to ensure that Future events were truly generated by a client and not made up by the homeserver; and possibly other reasons).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call PUT /send / PUT /state for each event that's to be sent in a Future (ideally, this could be batched)
sign each retrieved PDU
put each signed PDU in a request to /send/future, placing them in "send_*" fields as desired

What should be batched is mostly the put PDU's right. Since creating the signed PDU's is okay to not be batched. The client then just needs to be sure that they have created all the events before sending the signed PDU's. if creating the PDU's fails the client just retires until it has the full list of events that need to sent (because they rely on each other)

So what really would need to be batched is the last step. Sending the signed PDU's.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe its worth exploring to just introduce a new type of
PDUInfo:

{
    room_version:  string,
    via_server:  string,  //  optional
    pdu: PDU  //  signed  PDU
}

If we include a timeout here + action PDU's

{
    room_version:  string,
    via_server:  string,  //  optional
    timout: number, // optional
    future_actions:{
        actionName: PDU // signed alternative PDU in case an action is trigger
    }
    future_id: randomString // optional 
    pdu: PDU  //  signed  PDU

}

response
{
    future_tokens:{
        future_id_0: token,
        future_id_1: ...
    }
}

we would not even need a new endpoint. and the homeserver response would need to include tokens for the future PDU's

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a really good idea, especially since the future-aware fields don't conflict at all with the base PDUInfo. It also allows immediate events to be sent (even several at once!), thus replacing the send_now events without needing any extra spec!

One question: what is the optional future_id in the request for?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are multiple futures in one send_pdus call multiple future tokens need to be issued. In the response a dictionary with future_id-> token would use the id to map all the tokens.

Comment on lines 23 to 26
To make this as generic as possible, the proposed solution is to allow sending
multiple presigned events and delegate the control of when to actually send these
events to an external services. This allows to a very flexible way to mark events as expired,
since the sender can choose what event will be sent once expired.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also worth mentioning why we want events to be presigned in the first place (for compatibility with Crypto IDs; to ensure that Future events were truly generated by a client and not made up by the homeserver; and possibly other reasons).

proposals/4140-delayed-events-futures.md Outdated Show resolved Hide resolved
proposals/4140-delayed-events-futures.md Outdated Show resolved Hide resolved
proposals/4140-delayed-events-futures.md Outdated Show resolved Hide resolved
proposals/4140-delayed-events-futures.md Outdated Show resolved Hide resolved
proposals/4140-delayed-events-futures.md Outdated Show resolved Hide resolved
proposals/4140-delayed-events-futures.md Outdated Show resolved Hide resolved
proposals/4140-delayed-events-futures.md Outdated Show resolved Hide resolved
proposals/4140-delayed-events-futures.md Outdated Show resolved Hide resolved
proposals/4140-delayed-events-futures.md Outdated Show resolved Hide resolved
toger5 and others added 2 commits May 31, 2024 09:20
Co-authored-by: Andrew Ferrazzutti <af_0_af@hotmail.com>
proposals/4140-delayed-events-futures.md Outdated Show resolved Hide resolved
proposals/4140-delayed-events-futures.md Outdated Show resolved Hide resolved
proposals/4140-delayed-events-futures.md Outdated Show resolved Hide resolved
proposals/4140-delayed-events-futures.md Outdated Show resolved Hide resolved
proposals/4140-delayed-events-futures.md Outdated Show resolved Hide resolved
proposals/4140-delayed-events-futures.md Outdated Show resolved Hide resolved
proposals/4140-delayed-events-futures.md Outdated Show resolved Hide resolved
proposals/4140-delayed-events-futures.md Outdated Show resolved Hide resolved
proposals/4140-delayed-events-futures.md Outdated Show resolved Hide resolved
proposals/4140-delayed-events-futures.md Outdated Show resolved Hide resolved
proposals/4140-delayed-events-futures.md Outdated Show resolved Hide resolved
proposals/4140-delayed-events-futures.md Outdated Show resolved Hide resolved
@toger5 toger5 force-pushed the toger5/expiring-events-keep-alive branch from 28ddfbb to 49d5294 Compare June 13, 2024 14:56
The server will respond with a [`400`](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/400) (`Bad Request`, with a message
containing the maximum allowed `timeout_duration`) if the
client tries to send a timeout future with a larger `timeout_duration`.
- The future is using a group_id that belongs to a future group from another user. In this case the homeserver sends a [`405`] (`Not Allowed`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even better (and what I've been implementing) is for each user to get their own namespace of group IDs, i.e. for the effective group ID to be the tuple of (user ID, group ID). Benefits include:

  • Users cannot guess at other users' group IDs by spamming future requests and waiting to receive an error response.
  • More group IDs are available for each user, which is especially useful if we want to allow user-defined group IDs that could otherwise clash.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if the groupId is a large UUID we can just use that and synapse stores the relation between user and uuid.
But it also does not hurt to include the userId in the group id.
At the end its a hs implementation detail. The important bit is just that the homeserver makes sure its unique in its domain.

toger5 and others added 2 commits June 13, 2024 23:55
Co-authored-by: Andrew Ferrazzutti <af_0_af@hotmail.com>
@toger5 toger5 force-pushed the toger5/expiring-events-keep-alive branch from 828486e to a663bb4 Compare June 14, 2024 11:47
@toger5 toger5 marked this pull request as ready for review June 14, 2024 11:53
@toger5 toger5 changed the title MSC4140: Expiring events with keep alive endpoint MSC4140: Delayed events (Futures) Jun 18, 2024
@turt2live turt2live self-requested a review June 19, 2024 22:24
and use proper msc number in unstable prefix section.
Copy link
Member

@turt2live turt2live left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For logging purposes: this MSC crossed the SCT's desk as potentially scary, so we've given it an early review here.

The idea of a generic mechanism to address call participant counts, scheduled messages, and self-destructing events is very much appealing. The fewer moving pieces we have to worry about in the spec, the better. This MSC appears to have 3 major concerns which could be classified as 'scary', which all have their own dedicated threads - please ensure discussion happens in those threads. The highlights are:

  1. Self-destructing events has a metadata component which likely means it will need a dedicated MSC, despite the preference being fewer moving pieces. The user privacy concerns leading to wanting self-destructing events outweigh the idealistic genericism.
  2. A comparison to MSC3277 is missing from this proposal. MSC3277 uses a DAG-based approach to ensure events are authorized and servers don't have to implement complicated subsystems for the scheduled messages feature.
  3. Keep alives are unreliable and can have unexpected consequences for users and clients, particularly when a network partition causes a failed ping. For the call participant count use case in particular, the SFU(s) should already know how many connections it has and can reveal that information back to other users. Where network partitions fail the connection to the SFU, the user is dropped from the call regardless. Otherwise, temporary connection issues can ensure the user is reflected as connected with lost audio/video. This system may very well use a keep alive internally (possibly at the TCP layer), but here it would be appropriate compared to event sending.

I've also left several editorial comments to aide understanding of the MSC. I've not done a complete pass on this - these are just the more notable ones.

As always, if any of my comments require clarification or more information, let me know in the threads :)


- Updating call member events after the user disconnected.
- Sending scheduled messages (send at a specific time).
- Creating self-destructing events (by sending a delayed redact).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this proposal addresses self-destructing events in a way which is useful/safe for users. Aside from a message's content, the second most important detail users want to destroy is the metadata, which this proposal doesn't address. A self-destructing events MSC would most likely erase the event from the DAG entirely.

I'd suggest eliding this use case.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thread for highlight 1:
I wasn't aware that it is on the table that self-destructing events would use an entirely different concept to redactions.
I was coming from: https://github.com/matrix-org/matrix-spec-proposals/blob/matthew/msc2228/proposals/2228-self-destructing-events.md
Which basically does the same thing. It redacts the event based on conditions.

What I like about this proposal, is that instead of making a custom event and add logic to create synthetic redaction we generalize the concept of event delays and everything else is a completely normal redaction.

Also

erase the event from the DAG entirely

Sounds like an anti-pattern for a distributed system. Can you guide me on where I can find information about erasing whole events from the DAG including all there metadata without federation conflicts?

proposals/4140-delayed-events-futures.md Outdated Show resolved Hide resolved
proposals/4140-delayed-events-futures.md Outdated Show resolved Hide resolved
- Every future group needs at least one timeout future to guarantee that all future expire eventually.
- If a timeout future is sent without a `future_group_id` a unique identifier will be generated by the
homeserver and is part of the `send_future` response.
- Group id's can only be used by one user. Reasons for this are, that this would basically allow full control over a future group once another matrix user knows the group id. It would also require to federate futures if the users are not on the same homeserver.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in only a single user on the homeserver can have groupA? I'm not sure there's advantage to that - we should copy the transaction IDs behaviour from the existing spec.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The group_id's are server generated UUID's, so I am not sure how this could be phrased like transaction ID's.

proposals/4140-delayed-events-futures.md Outdated Show resolved Hide resolved
Comment on lines +288 to +294
Polling based solution have a big overhead in complexity and network requests on the clients.
Example:

> A room list with 100 rooms where there has been a call before in every room
> (or there is an ongoing call) would require the client to send a to-device message
> (or a request to the SFU) to every user that has an active state event to check if
> they are still online. Just to display the room tile properly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SFU should have a fairly good idea on how many connections it's holding, and this information can be federated when there's multiple SFUs in play. The client shouldn't need to poll for this either: it can likely subscribe directly either as a data stream, or using something like websockets. That subscription can then be used to count the number of 'active' participants.

It could theoretically mean a client connects to get information but isn't producing media, which is something the subscription stream can handle: the client can indicate (or otherwise authenticate) which other media streams it owns for the SFU to count them 'joined'.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thread for highlight 3:

This is the most important reason for why we need a heardbeat like expiration system to make matrixRTC performant an realibale!

The main problem we solve here is that we DONT want the client to connect to each SFU all the time and we dont talk about calls this client is connected to. This might sound like a possbile thing todo but is more overhead that one would think (polling or connecting to a socket is not really making a difference here.)

Maybe a more detailed description of the current situation is required here:
Without MatrixRTC your client has no idea about any ongoing call when it starts up.
We introduced call.member state events so now we can easily read who is connected in a session in each room.
But due to the nature of state events there is no guarantee that the client will not forget to remove (set them to {}) after they disconnect. Each event a client misses to remove results in a room that looks like there is an ongoing call. So for each of these rooms the client has to:

  • Authenticate with the sfu (jwt service token)
  • Connect to the sfu wbsocket.
  • Check the current participants.
  • Invalidate all the member event locally. (Since they cannot write to those member events because its owned by another member they have to do this on every client.)
    It is very easy to have such a invalid event: a user pressing ctrl + w while in call in EW is enough.
    The commented section of the MSC gives an exmaple. If you have multiple rooms where there are left over member state events you need do the 3 steps for each of them individually.

Workarounds we tried:
Storing a timestamp in the state event and updating the event every 30min. This allows to compute an invalidate event without connecting to an sfu. (but only after a 30min time where you see a call that is not happening)

Comment on lines 360 to 361
This would be elegant but since those two endpoint are core to Matrix, changes to them might
be controversial if their return value is altered.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be fine. Clients would only get a different response if they use the parameters, meaning they should be expecting a different format. It would be different if lack of the future parameters meant a different response body.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is really nice!
It would be as you describe:
Without future parameters the response would be the same as it is now.
With the parameters the response would be different.

proposals/4140-delayed-events-futures.md Outdated Show resolved Hide resolved

## Potential issues

## Alternatives
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This proposal's main competition appears to be MSC3277, where events are scheduled by placing them in the DAG. The homeserver then forwards the previously-created event at a set time, but it is part of the room until that point.

The DAG approach has a few advantages which I think make it more appealing:

  • Message delivery is not subject to a third party being online. Specifically, an appservice, SFU, or similar does not need to be looped in to ensure a message gets sent - it's the homeserver's own responsibility to stay online. We may even be able to federate the event outwards with a soft_fail_until timestamp for maximum reliability (at the cost of eager delivery and harder cancellation).
  • Events are known to be authorized and able to be sent because they've been given an event ID (and thus run through the auth rules).
  • The tracking overhead is extremely minimal: servers at most need to remember to un-soft-fail an event after a set time, but it can do that with rough precision. They do not need to track different tokens for different actions - if a user redacts a scheduled event while it's still soft failed, the send is cancelled. It knows this intrinsically.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are very open to insert the events into the DAG when sent. This was discussed and even has a bunch of advantages (mostly that the event_id is available immediately)

The biggest reason against it was the assumption that this might be too big of a change: It adds the complexity of a state where an event is in the dag but not yet (and maybe never) is distributed to clients.

Message delivery is not subject to a third party being online. Specifically, an appservice, SFU, or similar does not need to be looped in to ensure a message gets sent - it's the homeserver's own responsibility to stay online. We may even be able to federate the event outwards with a soft_fail_until timestamp for maximum reliability (at the cost of eager delivery and harder cancellation).

The delivery does not rely on a third party. It relies on the homeservers timeout computation. (that is enforced by the future group! The example with the client that pings every 10s describes this well. The client getting disconnected and not sending the ping anymore lets the homerserver know it can deliver the event now.

To me, it seems that the least intrusive change is to not add events to the DAG and instaed just queu them on the homeserver/pretend the http request was received later. But I really like it if it's possible to add it to the DAG right after receiving the event and immediately computing an event_id.
Because Future events can also be canceled, it needs to be valid to have non sent events in the dag forever. As described in the MSC, in most cases you want to schedule multiple events in one future group and only send one of them. Adding them all to the DAG seems unnecassary.
But other than that, this sounds like a very compatible and nice solution. Do I understand it correctly that the homeserver will not send the event to federating homeservers until the timeout condition is met?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The VoIP team had a dedicated meeting to thoroughly investigate the option to use MSC3277's approach of adding the event to the DAG on retrieval.

We have created a summary document here: https://hackmd.io/h0z82KvKSaiW-jYlOnU69w?edit
(this will be posted below as well for visibility)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reliable State events (Future MSC)

The Future MSC is not sending

Soft fail (DAG) vs Homeserver queuing (not in DAG)

MSC4140 and MSC3277

#4140 (comment)

Message delivery is not subject to a third party being online. Specifically, an appservice, SFU, or similar does not need to be looped in to ensure a message gets sent - it's the homeserver's own responsibility to stay online. We may even be able to federate the event outwards with a soft_fail_until timestamp for maximum reliability (at the cost of eager delivery and harder cancellation).

This is the case for both proposals. The delivery itself is in full control of the homeserver.

For both proposals the homserver is given additional information to the event and will make it real eventually based on those parameters.

The main differentiator is that with MSC4140 external interaction is optionally possible.
The /future/{futureToken} endpoint allows interacting with event scheduling.

This is the main feature we require to get Reliable VoIP to work.

It is not important if we add the event to the DAG immediately, but we need the scheduling interaction of /future/{futureToken}.

Events are known to be authorized and able to be sent because they've been given an event ID (and thus run through the auth rules).

We explicitly only have one auth period: at send time. While scheduled, the auth conditions can change, so the proper time to do auth computation is at send time.

With MSC3277 we would need to rerun the auth rules when sending the event.

The tracking overhead is extremely minimal: servers at most need to remember to un-soft-fail an event after a set time, but it can do that with rough precision. They do not need to track different tokens for different actions - if a user redacts a scheduled event while it's still soft failed, the send is cancelled. It knows this intrinsically.

Using this with the required interaction tokens would result in the following:

  • Send a scheduled event with 10s
  • On each refresh/reset token retrival the homserver would redact the event and send a new one (with the send_at updated by 10s)
    • This would become quite complicated with cryptographic identities, since the user would need sign all of them. Each reset/refresh needs a client server roundtrip for signing.
  • Whenever there is no reset/refresh for 10 seconds (e.g. the device has crashed) the homeserver would un-soft-fail (send) the event.
    • Automatically
      The main issue we have with this is the large trafic that generates on the homeserver. For clients this would be equivalent (at least without cryptographic identities) but the homeserver needs to insert a new event (and send it over federation) for each refresh.

Comment:

  • If we delegate leaves to the SFU we would configure the timeout to a duration in the hour range. This would make the above much more bearable.
  • We still need to decide if we want to federate the delegation tokens. Otherwise we dont get any benefit from federating the scheduled event if we can only interact with it through the sending homeserver.

Comparison

MSC3277 MSC4140
Traffic Lots of traffic (one evet per refresh) between federating homeservers. For the client it is the same Since we do not add the event to the DAG until we know it will be sent, there is no traffic because of the scheduling.
resilience (only scheduling) Scheduled events are even sent if the sender's homeserver goes down. The sender's homeserver is required at send time to send the event. If it is down at sent time the event will not be sent until it comes back up.
resilience (with interaction) This is just not possible because sending an event (based on an interaction) before the at time requires redacting the scheduled one and creating a new one, which cannot be done by a different homeserver then the one hosting the sending user. Cannot be done. It would also require federting all the interaction tokens and the unsent events but we end up in the same scenario where a homeserver would need to send an event for a user on a different homerserver.
Federation (without interaction) Works Doesn't work
Federation (with interaction) We need interaction tokens (futureTokens) and they cannot be federated. Otherwise federating homeserver admins could interact with the Futures/scheduled messages. With interaction there is no way we can make federation work and we lose all the benefit of inserting the event into the DAG. The same, we cannot ever send futureTokens to other homeservers since they would be able to interact with the futures.
Auth Here interaction could work by redacting the scheduled event and sending a new scheduled event. This is authenticaed but only the client can send the events. If we want interaction that can be delegated to the SFU (which is the best source of truth) we would need to introduce tokens similar to futureTokens. With the futureToken based authentication (that is different to the client auth token) we have an extremely scoped auth mechanism that can be send to third parties like the SFU. The SFU can notify the homeserver about a user disconnect. The SFU is the best source of truth for that.
Tracking overhead All the schedule tracking information is stored in the DAG and a timer needs to run through all potential expired scheduled messages periadically. This has to happen on ALL homeservers. Very similar logic to the scheduled events except that the tracking information is stored in a seperate Future database only on the sender's homeserver (less overhead).

TLDR:

If we don't need interaction, there is a benefit in federating the scheduled message (adding it to the DAG immediately). It increases resiliance: the sender's homeserver can disconnect and the scheduled message still will enter non-soft-failed state (will be sent).

With delegatable interaction (which is the one property we need for reliable state events), we lose the possiblity to federate scheduled messages* and the two solutions would converge to one where we lose the property of the sender's homeserver being able to lose connection but the scheduled message still is sent on other homeservers.

* reason being that this would also require to federate (and hence leak) the interaction tokens and allowing other homeservers to interact with the future without the user explicitly giving consent by sending the tokens.)

Conclusion

We are faced with the decision between:

  • Do not assume that the senders homeserver stays online:
    • Federate the scheduled message (reslience) but loose the interaction we need for MatrixRTC reliable call member events.
  • Assume that the senders homeserver stays online:
    • If the senders homeserver is expected to stay online there is no reason to federate the scheduled/future event and we can safely add interaction which allows us to implement reliable call member events


This would redact the message with content: `"m.text": "my msg"` after 10minutes.

## Potential issues
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The keep alive approach appears to reverse the expectation of scheduled messages a bit, which I think may be one of the core concerns with this proposal. With MSC3277, senders expect that their event will go out at the time they schedule it. This proposal moves the expectation to much later in the send process, and creates an assumption that any event can be cancelled or "undone" at any point. This behaviour leads to "unexpected" consequences, because the sender was expecting to be an opportunity for their event to never send.

It's a bit subtle, but that expectation changing I think makes MSC3277 more favourable. MSC3277 doesn't really help with the call participant count problem though. For that, I think the SFU can likely count its connections more reliably than a keep alive (a single network partition leads to a false count). This is discussed in more detail in the 'MatrixRTC use case' thread I've started.

Copy link
Author

@toger5 toger5 Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core assumption of the VoIP when writing this MSC is that it is of essence that syncing room state is enough to compute all ongoing MatrixRTC sessions. With the reasons described here: #4140 (comment)

So It seems we have the following two scenarios:

  • We go with a SFU based state event validation approach (each client requests a token and then logs into the SFU for each SFU mentioned in any call member event it encounters and verifies its validity (if the user is still connected to the session))
  • We want the memberships/sessions to be reliably represented in the room state.

If we decide for 1. we really don't need any of the MSC's and I can see how a static timeout as described in MSC3277 could be a solution. Even though no interaction to the scheduling definitively limits its use-cases.

But the experience we made during a whole year concluded that we should make the state events reliable.
(Other logic: historic session computation, changes in call realted UX, building matrixRTC sdk's all are super hard to implement and involve a a lot of duct tape if we cannot trust member state events but have to check the SFU all the time. It also means we can only validate state events in real time but never tell if a state event was invalid in the past.)

toger5 and others added 4 commits June 22, 2024 08:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
client-server Client-Server API kind:feature MSC for not-core and not-maintenance stuff needs-implementation This MSC does not have a qualifying implementation for the SCT to review. The MSC cannot enter FCP. proposal A matrix spec change proposal
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants