Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid behavior when using same NATS connection for both NATS and STAN subscriptions #184

Open
dorny opened this issue Aug 31, 2020 · 10 comments
Assignees

Comments

@dorny
Copy link
Contributor

dorny commented Aug 31, 2020

I had encountered quite invalid behavior:
We have single instance of NATS IConnection. We use it to setup NATS subscriptions and the same instance is also used in IStanConnection (passed via StanOptions.NatsConn).

While setting up STAN subscriptions via IStanSubscription.Subscribe(subject, options, handler) the NATSTimeoutException is thrown. However our handler is still being executed, processing the incoming messages. However as the exception is thrown from the Subscribe(...) method, there is no instance of IStanSubscription we could dispose.

We are encountering this behavior almost all the time during application startup, when most of the subscriptions are configured. However if we add new STAN subscription later, it works as expected.

I tried various things - using locks to serialize access to connection, tested it with or without actual data stored in NATS Streaming (we use FILE store). Nothing helped.
Finally I get rid of this behavior once I stopped passing existing IConnection to STAN and let it create it's own connection.

Is reusing of same connection for both NATS and STAN supported? To my understanding, STAN is using NATS just as a transport layer and it should work just fine. If it's not the case, better mention that in a docs

Assuming it should be supported, then:

  1. There is a bug causing the STAN subscription end with NATSTimeoutException even when it's actually established
  2. If exception is throwing from Subscribe(...) method, it shouldn't actually establish the subscription and invoke the provided handler
@ColinSullivan1
Copy link
Member

Reusing the same connection is supported - we may have a bug here. I'm not sure why you'd be getting a timeout exception upon establishing a subscription. It is concerning that you have a "partial" subscription, meaning the client didn't clean up.

Which version of the client are you using, and would you be able to reproduce this in a simple app you could share?

@dorny
Copy link
Contributor Author

dorny commented Sep 1, 2020

Sure, I will try to reproduce it and post here link to the repo.
However as it happens only in one of our services and only during the startup, it's hard to say how easy it will be to reproduce it.
Maybe it's a threading issue. We will see.

I'm using the latest version of both NATS and STAN clients.

@ColinSullivan1
Copy link
Member

@dorny, We'd appreciate it, it'd really help us out. If you have a full stack trace from the NATSTimeoutException and server logs from the time this happened might shed some light on this.

BTW, I found an issue where the subscription timeout is hardcoded to 2s. That needs to be made configurable. I'll create an issue for that.

@dorny
Copy link
Contributor Author

dorny commented Sep 11, 2020

Minimal example: https://gist.github.com/dorny/6255e9a6679a2f582c98e3fa93b91df0

My initial assumption that it's caused by some threading issue was wrong.
Bug is reproducible once you use IConnection.RequestAsync(...) method.

Output of the code in gist:

Handler is invoked even if subscribe throws exception!
Bug successfully reproduced!
Timeout occurred.
   at NATS.Client.Connection.requestSync(String subject, Byte[] data, Int32 offset, Int32 count, Int32 timeout)
   at NATS.Client.Connection.request(String subject, Byte[] data, Int32 offset, Int32 count, Int32 timeout)
   at NATS.Client.Connection.Request(String subject, Byte[] data, Int32 timeout)
   at STAN.Client.AsyncSubscription.subscribe(String subRequestSubject, String subject, String qgroup, EventHandler`1 handler)
   at STAN.Client.Connection.subscribe(String subject, String qgroup, EventHandler`1 handler, StanSubscriptionOptions options)
   at STAN.Client.Connection.Subscribe(String subject, StanSubscriptionOptions options, EventHandler`1 handler)
   at StanNetBug.Program.Main(String[] args) in C:\Users\dha\Workspace\KiMES\repos\StanNetBug\Program.cs:line 37

@ColinSullivan1 ColinSullivan1 self-assigned this Sep 25, 2020
@ColinSullivan1
Copy link
Member

Unfortunately your test code (ported to tests) doesn't seem reproduce the error, I'll revisit that to be sure.

Here are my thoughts...

STAN subscriptions are created by:

  1. Creating an inbox to represent the subscription channel (STAN subject)
  2. Create a subscription on that inbox
  3. Make a NATS request to the streaming server to perform a streaming "subscribe"
  4. Wait for the response and process it.

I suspect the problem lies in a timeout on request in step 3, where the NATS streaming server receives the message and starts forwards data, but the response is lost before the client times out. I'll continue to look at this.

Regardless, I'll suggest increasing the timeouts - which is impossible at the moment - so I'll add #185 then we can test to see if a longer timeout resolves the issue for you.

When this happens does the call to STAN.Client.Subscribe take more than 2s?

@dorny
Copy link
Contributor Author

dorny commented Sep 26, 2020

I'm pretty sure this is not an performance issue so increasing timeout won't help here.
Of course making timeouts configurable would be good, it's just a different topic.

I tried my test code again and the bug is always reproduced.
I'm running nats-streaming locally in docker.
It's a fresh instance using memory store. No other client is connected.
It's running on localhost so there are no network issues as well.
I got same behavior also when running nats-streaming as windows service so it's not an docker (WSL on windows) issue.

Subscribe() method call blocks for exactly 2s. Then the handler is invoked and just after that the Timeout exception is thrown.
If the line with call to RequestAsync() is commented out, Subscribe() takes 4ms and no exception is thrown.

There must be some bug on at least one side of the connection - this client or NATS Streaming server itself.

@ColinSullivan1
Copy link
Member

I'll keep looking... thanks @dorny.

@ColinSullivan1
Copy link
Member

FYI, outside of xunit I was able to reproduce this with .NET core 3.1. When stepping through things work fine, so likely a timing/threading issue. I noticed .NET core framework itself crash a few times as well, not good.

Wow, you've found a good one... :) I'll let you know what I find.

@ColinSullivan1
Copy link
Member

ColinSullivan1 commented Sep 30, 2020

Interesting find - if get the task and wait... success.

            //await nc.RequestAsync("rpc.test", dummyMessage);
            var t = nc.RequestAsync("rpc.test", dummyMessage);
            t.Wait();

And this fails...

            var t = nc.RequestAsync("rpc.test", dummyMessage);
            await t;

@ColinSullivan1
Copy link
Member

I believe I've found the issue. It's a problem that's been resolved in the core NATS library. See: nats-io/nats.net#404. Building with current source and upgrading the core NATS client to 0.11.0-pre2 cleared this up.

Because the new release of the core NATS client will require net 4.6 compatibility, I'll need to coordinate releases for both NATS.Client and STAN.Client releases. I'm planning on getting a release out of the core NATS client soon (hopefully this week). Then I'll get a pre-release of this client so you can verify.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants