Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TNFS sequence out of sequence errors are not handled correctly #741

Open
trekawek opened this issue May 25, 2024 · 0 comments
Open

TNFS sequence out of sequence errors are not handled correctly #741

trekawek opened this issue May 25, 2024 · 0 comments

Comments

@trekawek
Copy link
Contributor

In the logs uploaded to Discord by @tschak909, there are following 2 cases of the out of sequence problems:

17:42:21.704 > FileHandlerLocal::seek
17:42:23.705 > Timeout after 2000 milliseconds. Retrying
17:42:23.708 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 14, Expected: 15
17:42:23.714 > Timeout after 2000 milliseconds. Retrying
17:42:24.701 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 15, Expected: 14
17:42:26.700 > Timeout after 2000 milliseconds. Retrying
17:42:27.702 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 14, Expected: 15
17:42:27.742 > FileHandlerLocal::read

and

17:47:21.652 > sending block packet ...
17:47:21.681 > Command Packet:
17:47:21.681 > fc ff ff ff ff c3 81 80 80 80 80 82 81 80 81 83 b0 80 d2 a2 f3 80 88 80 ab ef c8 cc 
17:47:21.682 > Decoding 9 bytes
17:47:21.682 > handling read block command
17:47:21.683 > Drive 81  Read block 0073a2
17:47:21.683 > iwm_readblock NORMAL READ
17:47:21.686 > FileHandlerLocal::seek
17:47:21.735 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 90, Expected: 91
17:47:21.735 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 3d, Expected: 91
17:47:21.736 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 3f, Expected: 91
17:47:21.738 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 0, Expected: 92
17:47:21.739 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 0, Expected: 91
17:47:21.740 > TNFS OUT OF ORDER SEQUENCE! Rcvd: ee, Expected: 92
17:47:21.740 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 0, Expected: 91
17:47:21.742 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 0, Expected: 92
17:47:21.743 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 48, Expected: 91
17:47:21.744 > TNFS OUT OF ORDER SEQUENCE! Rcvd: e5, Expected: 92
17:47:21.746 > TNFS OUT OF ORDER SEQUENCE! Rcvd: d, Expected: 91
17:47:21.746 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 75, Expected: 92
...

In the first case it seems that the session was ultimately recovered, in the second not really.

In both cases, the same packet is received twice, while the client is expecting the next one already:

17:42:23.708 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 14, Expected: 15
17:47:21.735 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 90, Expected: 91

I'm not sure what's the reason of double delivery, but the firmware should be able to retry the request and deal with this problem. However, it seems it can't do it. There are two issues here.

Root cause - sending response as a request

Let's take a look on the code validating the sequence number:

// Out of order packet received.
if (pkt.sequence_num != current_sequence_num)
{
Debug_printf("TNFS OUT OF ORDER SEQUENCE! Rcvd: %x, Expected: %x\r\n", pkt.sequence_num, current_sequence_num);
// Fall through and let retry logic handle it.
}

this causes the fall through to the:

do
{
if (SYSTEM_BUS.getShuttingDown())
{
Debug_println("TNFS Breakout due to Shutdown");
return true; // false success just to get out
}
#ifndef ESP_PLATFORM
fnSystem.delay_microseconds(2000); // wait short time for (local) data to arrive
#endif
int l = _tnfs_recv(&udp, m_info, pkt);

unless the recv() doesn't return anything and the timeout occurs, then it logs

17:42:23.714 > Timeout after 2000 milliseconds. Retrying

and goes to:

while (retry < m_info->max_retries)
{
#ifdef DEBUG
_tnfs_debug_packet(pkt, payload_size);
#endif
// Send packet
bool sent = _tnfs_send(&udp, m_info, pkt, payload_size);

Please notice, that we're sending the pkt here. However, it's no longer the request packet. It's been overwritten by the tcp->recv() and now it's the response packet, the one with the wrong sequence number.

To fix this, we should keep request and response separately. Also, we should go immediately to send() if the received response is wrong.

Room for improvement - exhausting buffer on wrong response

The bug above explains the first log report. The second report is caused by a different issue. Apparently server sends a large response with a wrong number. The client reads it chunk by chunk and treats each chunk as a atomic response. This causes the random Rcvd numbers - these are not really sequence numbers, but random data from the middle of the data block.

To make this better, we may try to read all pending data from the TCP connection on the wrong sequence number and then send the original request again (see the description of the bug).

Tech debt - method complexity

_tnfs_transaction method is very complex, over 140 lines with two nested loops and multiple if/else. To make it easier to reason about and avoid bugs as above, we should split it into smaller chunks.

Plan

So, we need to following things:

  1. Split the _tnfs_transaction into smaller methods.
  2. On the wrong sequence number
    • exhaust buffer,
    • send the request again,
    • make sure we don't send response as a request.
trekawek referenced this issue May 27, 2024
This should prevent reading unrelated data as a response packet.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant