-
Notifications
You must be signed in to change notification settings - Fork 791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Posible deadlock with Java 21 Virtual Threads #10747
Comments
Hey @Moscagus, thanks for the detailed analysis! The root cause of this issue that we try to acquire a contended lock while a thread is about to mount or unmount. I've experienced the same issue while playing around with JVMTI mount / unmount callbacks. Note that this issue doesn't occur (yet) with The deadlock you are seeing here is as you already analyzed because the virtual thread implementation internally seems to use a standard This thread-pool is instrumented by the opentelemetry agent for context propagation. IMO the best solution would be to not instrument those internal pools. That is however not really easily doable to my knowledge, because it uses standard classes which we can't just exclude. Another safe option would be to make the instrumentation No-Op when it runs in a Virtual Thread mounting / unmounting context. A simple workaround for now would be to move the cleanup of This will prevent the issue from occurring because the Another way of circumvent this would be to use wrapping of I'd like some more feedback here from the devs on which route to take. Based on that I should be able to find the time to implement a fix. |
Hi @JonasKunz, I think there are 2 issues: 1 - expungeStaleEntries: I agree with you that a good workaround would be to move the cleanup of VirtualField WeakConcurrentHashMap's stale entries to a separate thread which performs a periodic cleanup.
LOOKUP_KEY_CACHE is ThreadLocal
I may be wrong in my analysis, but I'm wondering if LOOKUP_KEY_CACHE, which should be local to the Virtual Thread, is okay to be consumed by the Carrier Thread ((LOOKUP_KEY_CACHE.get()). |
We already have a background thread that cleans these maps https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/instrumentation-api/src/main/java/io/opentelemetry/instrumentation/api/internal/cache/weaklockfree/AbstractWeakConcurrentMap.java so removing the inline expunge is an option. |
I think that shouldn't be a a problem here: The way the
I'll try to do that and open a PR (soon hopefully). I'll also try if I can reproduce the deadlock in a unit test first, so that we can more easily detect problems with future iterations of Project Loom here. |
Unfortunately it's still using "expungeStaleEntries" I keep investigating |
I think the problem is in:
} Therefore, every time you call find(): .computeIfAbsent(fieldType, c -> new CacheBasedVirtualField<>()); computeIfAbsent in WeakConcurrentMap.WithInlinedExpunction= |
On the other hand, the finds are regenerated by Byte Buddy, and from what I see they will always use "Cache.weak()"
|
I missed out on the fact that I've fixed this now in my PR, @Moscagus could you try again with the new snapshot ? |
Update after a discussion with @laurit : My fix likely won't actually solve the problem, even if it doesn't surface anymore. |
Exactly as you mention the problem is still the same as before:
|
@Moscagus could you try with https://github.com/open-telemetry/opentelemetry-java-instrumentation/actions/runs/8277540924/artifacts/1325242993 an let us know whether this changes anything? |
@laurit after several consecutive tests I confirm that the solution does not work. I keep investigating |
Could it be that since "VirtualThread" is final it cannot be instrumented ? |
No, |
@laurit what do you think about changing the check to: !Thread.currentThread().isVirtual()
If this is valid, there would be no need for the "ThreadLocal propagationDisabled" variable and of the "VirtualThreadInstrumentation" class. |
Sorry, I'm thinking as if all applications were on Java 21 using virtual threads only. I don't think what I just mentioned helps. |
Java 21
Opentelemetry Java Agent 2.1.0
Opentelemetry SDK 1.35.0
I have an application in java21 that makes heavy use of virtual threads. After the stress test, he ended up stuck. After an investigation I verified that the problem came from the java opentelemetry agent, even in its latest version 2.1.0. The problem arises because the agent leaves the carrier thread in an inconsistent state.
jcmd-otel.txt
jstack-otel.txt
1 - "jstack-otel.txt" show the problem. For example:
...
"ForkJoinPool-1-worker-1" #134 [271] daemon prio=5 os_prio=0 cpu=70474.35ms elapsed=722.35s allocated=14189M defined_classes=272 tid=0x000055d55a76ef20 [0x00007f75af255000]
Carrying virtual thread #134 --> Bug: same TID as carrier
..
2 - "jcmd-otel.txt" show the stack with problem. For example:
..
"tid": "159185",
"name": "virtual-Tn3EventsConsumerService-158213",
"stack": [
**"java.base/jdk.internal.misc.Unsafe.park(Native Method)", --> Bug: VT use java.lang.VirtualThread.park, no jdk.internal.misc.Unsafe.park
"java.base/java.util.concurrent.locks.LockSupport.park(LockSupport.java:221)",
"java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:754)",
"java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:990)",
"java.base/java.util.concurrent.locks.ReentrantLock$Sync.lock(ReentrantLock.java:153)",
"java.base/java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:322)",
"java.base/java.lang.ref.ReferenceQueue.poll(ReferenceQueue.java:182)",
"io.opentelemetry.javaagent.shaded.instrumentation.api.internal.cache.weaklockfree.AbstractWeakConcurrentMap.expungeStaleEntries(AbstractWeakConcurrentMap.java:235)", --> opentelemetry java agent
"io.opentelemetry.javaagent.shaded.instrumentation.api.internal.cache.weaklockfree.WeakConcurrentMap$WithInlinedExpunction.getIfPresent(WeakConcurrentMap.java:193)",
"io.opentelemetry.javaagent.shaded.instrumentation.api.internal.cache.WeakLockFreeCache.get(WeakLockFreeCache.java:26)",
"io.opentelemetry.javaagent.bootstrap.field.VirtualFieldImpl$java$util$concurrent$Future$io$opentelemetry$javaagent$bootstrap$executors$PropagatedContext.mapGet(VirtualFieldImplementationsGenerator.java:298)",
"io.opentelemetry.javaagent.bootstrap.field.VirtualFieldImpl$java$util$concurrent$Future$io$opentelemetry$javaagent$bootstrap$executors$PropagatedContext.realGet(VirtualFieldImplementationsGenerator.java)",
"io.opentelemetry.javaagent.bootstrap.field.VirtualFieldImpl$java$util$concurrent$Future$io$opentelemetry$javaagent$bootstrap$executors$PropagatedContext.get(VirtualFieldImplementationsGenerator.java:280)",
"io.opentelemetry.javaagent.bootstrap.executors.ExecutorAdviceHelper.cleanPropagatedContext(ExecutorAdviceHelper.java:92)",
"java.base/java.util.concurrent.FutureTask.cancel(FutureTask.java:181)",
"java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.cancel(ScheduledThreadPoolExecutor.java:291)",
"java.base/java.lang.VirtualThread.cancel(VirtualThread.java:705)",
"java.base/java.lang.VirtualThread.parkNanos(VirtualThread.java:628)",
...
3 - VirtualThread.java: https://github.com/openjdk/loom/blob/fibers/src/java.base/share/classes/java/lang/VirtualThread.java#L681
https://github.com/openjdk/loom/blob/fibers/src/java.base/share/classes/java/lang/VirtualThread.java#L767
4 - AbstractWeakConcurrentMap.java
https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/release/v2.1.x/instrumentation-api/src/main/java/io/opentelemetry/instrumentation/api/internal/cache/weaklockfree/AbstractWeakConcurrentMap.java#L232
REFERENCE_QUEUE.poll() -------> poll() that generates jdk.internal.misc.Unsafe.park, thus achieving a stuck in the application since as explained in point 1:
"ForkJoinPool-1-worker-1" #134 [271] daemon prio=5 os_prio=0 cpu=70474.35ms elapsed=722.35s allocated=14189M defined_classes=272 tid=0x000055d55a76ef20 [0x00007f75af255000]
Carrying virtual thread #134 --> Bug: same TID as carrier
The text was updated successfully, but these errors were encountered: