TNFS sequence out of sequence errors are not handled correctly #741

trekawek · 2024-05-25T16:06:15Z

In the logs uploaded to Discord by @tschak909, there are following 2 cases of the out of sequence problems:

17:42:21.704 > FileHandlerLocal::seek
17:42:23.705 > Timeout after 2000 milliseconds. Retrying
17:42:23.708 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 14, Expected: 15
17:42:23.714 > Timeout after 2000 milliseconds. Retrying
17:42:24.701 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 15, Expected: 14
17:42:26.700 > Timeout after 2000 milliseconds. Retrying
17:42:27.702 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 14, Expected: 15
17:42:27.742 > FileHandlerLocal::read

and

17:47:21.652 > sending block packet ...
17:47:21.681 > Command Packet:
17:47:21.681 > fc ff ff ff ff c3 81 80 80 80 80 82 81 80 81 83 b0 80 d2 a2 f3 80 88 80 ab ef c8 cc 
17:47:21.682 > Decoding 9 bytes
17:47:21.682 > handling read block command
17:47:21.683 > Drive 81  Read block 0073a2
17:47:21.683 > iwm_readblock NORMAL READ
17:47:21.686 > FileHandlerLocal::seek
17:47:21.735 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 90, Expected: 91
17:47:21.735 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 3d, Expected: 91
17:47:21.736 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 3f, Expected: 91
17:47:21.738 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 0, Expected: 92
17:47:21.739 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 0, Expected: 91
17:47:21.740 > TNFS OUT OF ORDER SEQUENCE! Rcvd: ee, Expected: 92
17:47:21.740 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 0, Expected: 91
17:47:21.742 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 0, Expected: 92
17:47:21.743 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 48, Expected: 91
17:47:21.744 > TNFS OUT OF ORDER SEQUENCE! Rcvd: e5, Expected: 92
17:47:21.746 > TNFS OUT OF ORDER SEQUENCE! Rcvd: d, Expected: 91
17:47:21.746 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 75, Expected: 92
...

In the first case it seems that the session was ultimately recovered, in the second not really.

In both cases, the same packet is received twice, while the client is expecting the next one already:

17:42:23.708 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 14, Expected: 15
17:47:21.735 > TNFS OUT OF ORDER SEQUENCE! Rcvd: 90, Expected: 91

I'm not sure what's the reason of double delivery, but the firmware should be able to retry the request and deal with this problem. However, it seems it can't do it. There are two issues here.

Root cause - sending response as a request

Let's take a look on the code validating the sequence number:

fujinet-firmware/lib/TNFSlib/tnfslib.cpp

Lines 1384 to 1389 in 1b0c424

    
           // Out of order packet received. 
        
           if (pkt.sequence_num != current_sequence_num) 
        
           { 
        
               Debug_printf("TNFS OUT OF ORDER SEQUENCE! Rcvd: %x, Expected: %x\r\n", pkt.sequence_num, current_sequence_num); 
        
               // Fall through and let retry logic handle it. 
        
           }

this causes the fall through to the:

fujinet-firmware/lib/TNFSlib/tnfslib.cpp

Lines 1360 to 1371 in 1b0c424

    
                       do 
        
                       { 
        
                           if (SYSTEM_BUS.getShuttingDown()) 
        
                           { 
        
                               Debug_println("TNFS Breakout due to Shutdown"); 
        
                               return true; // false success just to get out 
        
                           } 
        
           #ifndef ESP_PLATFORM 
        
                           fnSystem.delay_microseconds(2000); // wait short time for (local) data to arrive 
        
           #endif 
        
                           int l = _tnfs_recv(&udp, m_info, pkt);

unless the recv() doesn't return anything and the timeout occurs, then it logs

17:42:23.714 > Timeout after 2000 milliseconds. Retrying

and goes to:

fujinet-firmware/lib/TNFSlib/tnfslib.cpp

Lines 1339 to 1346 in 1b0c424

    
               while (retry < m_info->max_retries) 
        
               { 
        
           #ifdef DEBUG 
        
                   _tnfs_debug_packet(pkt, payload_size); 
        
           #endif 
        
                   // Send packet 
        
                   bool sent = _tnfs_send(&udp, m_info, pkt, payload_size);

Please notice, that we're sending the pkt here. However, it's no longer the request packet. It's been overwritten by the tcp->recv() and now it's the response packet, the one with the wrong sequence number.

To fix this, we should keep request and response separately. Also, we should go immediately to send() if the received response is wrong.

Room for improvement - exhausting buffer on wrong response

The bug above explains the first log report. The second report is caused by a different issue. Apparently server sends a large response with a wrong number. The client reads it chunk by chunk and treats each chunk as a atomic response. This causes the random Rcvd numbers - these are not really sequence numbers, but random data from the middle of the data block.

To make this better, we may try to read all pending data from the TCP connection on the wrong sequence number and then send the original request again (see the description of the bug).

Tech debt - method complexity

_tnfs_transaction method is very complex, over 140 lines with two nested loops and multiple if/else. To make it easier to reason about and avoid bugs as above, we should split it into smaller chunks.

Plan

So, we need to following things:

Split the _tnfs_transaction into smaller methods.
On the wrong sequence number
- exhaust buffer,
- send the request again,
- make sure we don't send response as a request.

The text was updated successfully, but these errors were encountered:

This should prevent reading unrelated data as a response packet.

This was referenced May 25, 2024

Put the TNFS transaction in a mutex. #742

Merged

Improve the UDP stability #745

Merged

trekawek referenced this issue May 27, 2024

Exhaust the buffer on out of order packet.

9d72875

This should prevent reading unrelated data as a response packet.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TNFS sequence out of sequence errors are not handled correctly #741

TNFS sequence out of sequence errors are not handled correctly #741

trekawek commented May 25, 2024

TNFS sequence out of sequence errors are not handled correctly #741

TNFS sequence out of sequence errors are not handled correctly #741

Comments

trekawek commented May 25, 2024