Posible deadlock with Java 21 Virtual Threads #10747

Moscagus · 2024-03-04T18:30:16Z

Java 21
Opentelemetry Java Agent 2.1.0
Opentelemetry SDK 1.35.0

I have an application in java21 that makes heavy use of virtual threads. After the stress test, he ended up stuck. After an investigation I verified that the problem came from the java opentelemetry agent, even in its latest version 2.1.0. The problem arises because the agent leaves the carrier thread in an inconsistent state.

jcmd-otel.txt
jstack-otel.txt

1 - "jstack-otel.txt" show the problem. For example:
...
"ForkJoinPool-1-worker-1" #134 [271] daemon prio=5 os_prio=0 cpu=70474.35ms elapsed=722.35s allocated=14189M defined_classes=272 tid=0x000055d55a76ef20 [0x00007f75af255000]
Carrying virtual thread #134 --> Bug: same TID as carrier
..

2 - "jcmd-otel.txt" show the stack with problem. For example:
..
"tid": "159185",
"name": "virtual-Tn3EventsConsumerService-158213",
"stack": [
**"java.base/jdk.internal.misc.Unsafe.park(Native Method)", --> Bug: VT use java.lang.VirtualThread.park, no jdk.internal.misc.Unsafe.park
"java.base/java.util.concurrent.locks.LockSupport.park(LockSupport.java:221)",
"java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:754)",
"java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:990)",
"java.base/java.util.concurrent.locks.ReentrantLock$Sync.lock(ReentrantLock.java:153)",
"java.base/java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:322)",
"java.base/java.lang.ref.ReferenceQueue.poll(ReferenceQueue.java:182)",
"io.opentelemetry.javaagent.shaded.instrumentation.api.internal.cache.weaklockfree.AbstractWeakConcurrentMap.expungeStaleEntries(AbstractWeakConcurrentMap.java:235)", --> opentelemetry java agent
"io.opentelemetry.javaagent.shaded.instrumentation.api.internal.cache.weaklockfree.WeakConcurrentMap$WithInlinedExpunction.getIfPresent(WeakConcurrentMap.java:193)",
"io.opentelemetry.javaagent.shaded.instrumentation.api.internal.cache.WeakLockFreeCache.get(WeakLockFreeCache.java:26)",
"io.opentelemetry.javaagent.bootstrap.field.VirtualFieldImpl$java$util$concurrent$Future$io$opentelemetry$javaagent$bootstrap$executors$PropagatedContext.mapGet(VirtualFieldImplementationsGenerator.java:298)",
"io.opentelemetry.javaagent.bootstrap.field.VirtualFieldImpl$java$util$concurrent$Future$io$opentelemetry$javaagent$bootstrap$executors$PropagatedContext.realGet(VirtualFieldImplementationsGenerator.java)",
"io.opentelemetry.javaagent.bootstrap.field.VirtualFieldImpl$java$util$concurrent$Future$io$opentelemetry$javaagent$bootstrap$executors$PropagatedContext.get(VirtualFieldImplementationsGenerator.java:280)",
"io.opentelemetry.javaagent.bootstrap.executors.ExecutorAdviceHelper.cleanPropagatedContext(ExecutorAdviceHelper.java:92)",
"java.base/java.util.concurrent.FutureTask.cancel(FutureTask.java:181)",
"java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.cancel(ScheduledThreadPoolExecutor.java:291)",
"java.base/java.lang.VirtualThread.cancel(VirtualThread.java:705)",
"java.base/java.lang.VirtualThread.parkNanos(VirtualThread.java:628)",
...

3 - VirtualThread.java: https://github.com/openjdk/loom/blob/fibers/src/java.base/share/classes/java/lang/VirtualThread.java#L681
https://github.com/openjdk/loom/blob/fibers/src/java.base/share/classes/java/lang/VirtualThread.java#L767

@Override
void parkNanos(long nanos) {
    ...
    if (nanos > 0) {
        ...
        try {
            yielded = yieldContinuation();  // may throw
        } finally {
            ...
            cancel(unparker);
        }
        ...
        }
    }
}

@ChangesCurrentThread
private void cancel(Future<?> future) {
    ...
    if (!future.isDone()) {                --------> When this condition occurs, the problem is generated
        // need to switch to current carrier thread to avoid nested parking
        switchToCarrierThread();     --------> Here the carrier TID and the carrying TID are the same as shown in point 1 (jstack-otel.txt)
        try {
            future.cancel(false);        --------> Here the opentelemetry java agent begins to act (Bug).
        } finally {
            switchToVirtualThread(this);  ---------> It never comes here
        }
    }
}

4 - AbstractWeakConcurrentMap.java
https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/release/v2.1.x/instrumentation-api/src/main/java/io/opentelemetry/instrumentation/api/internal/cache/weaklockfree/AbstractWeakConcurrentMap.java#L232

public static void expungeStaleEntries() {
  Reference<?> reference;
  while ((reference = REFERENCE_QUEUE.poll()) != null) {
     removeWeakKey((WeakKey<?>) reference);
  }
}

REFERENCE_QUEUE.poll() -------> poll() that generates jdk.internal.misc.Unsafe.park, thus achieving a stuck in the application since as explained in point 1:
"ForkJoinPool-1-worker-1" #134 [271] daemon prio=5 os_prio=0 cpu=70474.35ms elapsed=722.35s allocated=14189M defined_classes=272 tid=0x000055d55a76ef20 [0x00007f75af255000]
Carrying virtual thread #134 --> Bug: same TID as carrier

The text was updated successfully, but these errors were encountered:

JonasKunz · 2024-03-05T12:15:26Z

Hey @Moscagus, thanks for the detailed analysis!
To my understanding this issue is actually slightly different from #10181 because here only certain threads hang instead of the entire JVM, correct?

The root cause of this issue that we try to acquire a contended lock while a thread is about to mount or unmount.
When such a lock is contended, the JVM will try to unmount the current thread while it is already in the process of being mounted / unmounted. This causes the thread to basically deadlock with itself.

I've experienced the same issue while playing around with JVMTI mount / unmount callbacks.

Note that this issue doesn't occur (yet) with synchronized locking, because those currently do not cause virtual threads to be unmounted. However, to my knowledge the JVM devs are working on lifting this restriction, so we'll see what the future holds here.

The deadlock you are seeing here is as you already analyzed because the virtual thread implementation internally seems to use a standard ScheduledThreadPool for scheduling virtual threads,
see https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/VirtualThread.java#L1237.

This thread-pool is instrumented by the opentelemetry agent for context propagation.

IMO the best solution would be to not instrument those internal pools. That is however not really easily doable to my knowledge, because it uses standard classes which we can't just exclude.
We would need to build some exclusion on instance level.

Another safe option would be to make the instrumentation No-Op when it runs in a Virtual Thread mounting / unmounting context.
To my knowledge this would require instrumenting very low-level and frequently called methods, such as VirtualThread.parkNanos or walking the stack, which would be pretty expensive.
In addition this would make us highly dependent on the internal loom implementation.

A simple workaround for now would be to move the cleanup of VirtualField WeakConcurrentHashMap's stale entries to a separate thread which performs a periodic cleanup.
Here is an example implementation how to this.
This has the side benefit of reducing the latency of get/put calls performed on application threads. For this reason the elastic-apm-agent has already been using a separate cleaner thread.

This will prevent the issue from occurring because the ReferenceQueue won't be polled anymore during mount/unmount of a virtual thread.
The underlying WeakHashMap might still be contented, it however to my knowledge uses synchronized instead of other locks, making this a non-issue.

Another way of circumvent this would be to use wrapping of Runnables / Futures instead of VirtualFields, for which already an issue exists: #9324.

I'd like some more feedback here from the devs on which route to take. Based on that I should be able to find the time to implement a fix.

Moscagus · 2024-03-05T18:30:14Z

Hi @JonasKunz, I think there are 2 issues:

1 - expungeStaleEntries: I agree with you that a good workaround would be to move the cleanup of VirtualField WeakConcurrentHashMap's stale entries to a separate thread which performs a periodic cleanup.

2 - getIfPresent:
https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/release/v2.1.x/instrumentation-api/src/main/java/io/opentelemetry/instrumentation/api/internal/cache/weaklockfree/WeakConcurrentMap.java#L191

@Override
public V getIfPresent(K key) {
  expungeStaleEntries();    ------> It would be moved to a separate thread which performs a periodic cleanup.
  return super.getIfPresent(key);    ------->  AbstractWeakConcurrentMap.java
}

https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/release/v2.1.x/instrumentation-api/src/main/java/io/opentelemetry/instrumentation/api/internal/cache/weaklockfree/AbstractWeakConcurrentMap.java#L111

public V getIfPresent(K key) {
  if (key == null) {
    throw new NullPointerException();
  }
  L lookupKey = getLookupKey(key);    ----> WeakConcurrentMap.java
  try {
    return target.get(lookupKey);
  } finally {
    resetLookupKey(lookupKey);
  }
}

https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/release/v2.1.x/instrumentation-api/src/main/java/io/opentelemetry/instrumentation/api/internal/cache/weaklockfree/WeakConcurrentMap.java#L119

@Override
@SuppressWarnings("unchecked")
protected LookupKey<K> getLookupKey(K key) {
  LookupKey<K> lookupKey;
  if (reuseKeys) {
    lookupKey = (LookupKey<K>) LOOKUP_KEY_CACHE.get();
  } else {
    lookupKey = new LookupKey<>();
  }
  return lookupKey.withValue(key);
}

LOOKUP_KEY_CACHE is ThreadLocal

private static final ThreadLocal<LookupKey<?>> LOOKUP_KEY_CACHE =
    new ThreadLocal<LookupKey<?>>() {
      @Override
      protected LookupKey<?> initialValue() {
        return new LookupKey<>();
      }
    };

I may be wrong in my analysis, but I'm wondering if LOOKUP_KEY_CACHE, which should be local to the Virtual Thread, is okay to be consumed by the Carrier Thread ((LOOKUP_KEY_CACHE.get()).
If I'm not mistaken, this is another point to resolve.

laurit · 2024-03-05T19:05:21Z

A simple workaround for now would be to move the cleanup of VirtualField WeakConcurrentHashMap's stale entries to a separate thread which performs a periodic cleanup.

We already have a background thread that cleans these maps https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/instrumentation-api/src/main/java/io/opentelemetry/instrumentation/api/internal/cache/weaklockfree/AbstractWeakConcurrentMap.java so removing the inline expunge is an option.

JonasKunz · 2024-03-06T09:24:39Z

I may be wrong in my analysis, but I'm wondering if LOOKUP_KEY_CACHE, which should be local to the Virtual Thread, is okay to be consumed by the Carrier Thread ((LOOKUP_KEY_CACHE.get()).
If I'm not mistaken, this is another point to resolve.

I think that shouldn't be a a problem here: The way the LOOKUP_KEY_CACHE is used, it is impossible for the thread to be unmounted while the thread currently uses the cached LookupKey for a lookup. The WeakHashMap implementation doesn't care about the values of the cached LookupKey before or after this usage.

so removing the inline expunge is an option.

I'll try to do that and open a PR (soon hopefully). I'll also try if I can reproduce the deadlock in a unit test first, so that we can more easily detect problems with future iterations of Project Loom here.

JonasKunz · 2024-03-06T12:26:47Z

@Moscagus Unfortunately I haven't been able to reproduce the issue.
Could you try the snapshot from #10768 and verify that it fixes the issue?

Moscagus · 2024-03-07T19:48:05Z

Unfortunately it's still using "expungeStaleEntries"
I decompile the jar and I actually saw the changes, but when it runs it still continues creating "WeakConcurrentMap$WithInlinedExpunction"

jcmd-otel.txt
jstack-otel.txt

I keep investigating

Moscagus · 2024-03-07T20:51:32Z

I think the problem is in:

private static final class CacheBasedVirtualFieldSupplier implements VirtualFieldSupplier {

  private final Cache<Class<?>, Cache<Class<?>, VirtualField<?, ?>>>
      ownerToFieldToImplementationMap = Cache.weak();    ---> new WeakLockFreeCache<>(true) = this.delegate = (WeakConcurrentMap<K, V>)new WeakConcurrentMap.WithInlinedExpunction();

  @Override
  @SuppressWarnings("unchecked")
  public <U extends T, V extends F, T, F> VirtualField<U, V> find(
      Class<T> type, Class<F> fieldType) {
    return (VirtualField<U, V>)
        ownerToFieldToImplementationMap
            .computeIfAbsent(type, c -> Cache.weak())
            .computeIfAbsent(fieldType, c -> new CacheBasedVirtualField<>());
  }

}

Therefore, every time you call find():

.computeIfAbsent(fieldType, c -> new CacheBasedVirtualField<>());

computeIfAbsent in WeakConcurrentMap.WithInlinedExpunction=
public V computeIfAbsent(K key, Function<? super K, ? extends V> mappingFunction) {
expungeStaleEntries(); ---> POLL again
return (V)super.computeIfAbsent(key, mappingFunction);
}

Moscagus · 2024-03-07T21:36:11Z

On the other hand, the finds are regenerated by Byte Buddy, and from what I see they will always use "Cache.weak()"

    "io.opentelemetry.javaagent.bootstrap.field.VirtualFieldImpl$java$util$concurrent$Future$io$opentelemetry$javaagent$bootstrap$executors$PropagatedContext.mapGet(VirtualFieldImplementationsGenerator.java:298)",
          "io.opentelemetry.javaagent.bootstrap.field.VirtualFieldImpl$java$util$concurrent$Future$io$opentelemetry$javaagent$bootstrap$executors$PropagatedContext.realGet(VirtualFieldImplementationsGenerator.java)",
          "io.opentelemetry.javaagent.bootstrap.field.VirtualFieldImpl$java$util$concurrent$Future$io$opentelemetry$javaagent$bootstrap$executors$PropagatedContext.get(VirtualFieldImplementationsGenerator.java:280)",

https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/javaagent-tooling/src/main/java/io/opentelemetry/javaagent/tooling/field/VirtualFieldFindRewriter.java#L42

https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/javaagent-tooling/src/main/java/io/opentelemetry/javaagent/tooling/field/VirtualFieldImplementationsGenerator.java#L269

https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/javaagent-tooling/src/main/java/io/opentelemetry/javaagent/tooling/field/VirtualFieldImplementationsGenerator.java#L312

JonasKunz · 2024-03-12T14:52:17Z

I missed out on the fact that RuntimeVirtualFieldSupplier.CacheBasedVirtualField isn't used for Advice VirtualField.find calls, even if it was not possible for the agent to inject fields into the target class.

I've fixed this now in my PR, @Moscagus could you try again with the new snapshot ?

JonasKunz · 2024-03-13T14:50:32Z

Update after a discussion with @laurit : My fix likely won't actually solve the problem, even if it doesn't surface anymore.
ConcurrentHashMap doesn't rely on synchronized only, but also calls LockSupport.park in contended cases.
That would result in the same nested unparking attempts under contention like polling the ReferenceQueue.

Moscagus · 2024-03-14T11:42:41Z

Exactly as you mention the problem is still the same as before:

          "io.opentelemetry.javaagent.shaded.instrumentation.api.internal.cache.weaklockfree.AbstractWeakConcurrentMap.expungeStaleEntries(AbstractWeakConcurrentMap.java:235)",
          "io.opentelemetry.javaagent.shaded.instrumentation.api.internal.cache.weaklockfree.WeakConcurrentMap$WithInlinedExpunction.getIfPresent(WeakConcurrentMap.java:193)",
          "io.opentelemetry.javaagent.shaded.instrumentation.api.internal.cache.WeakLockFreeCache.get(WeakLockFreeCache.java:30)",
          "io.opentelemetry.javaagent.bootstrap.field.VirtualFieldImpl$java$util$concurrent$Future$io$opentelemetry$javaagent$bootstrap$executors$PropagatedContext.mapGet(VirtualFieldImplementationsGenerator.java:298)",
          "io.opentelemetry.javaagent.bootstrap.field.VirtualFieldImpl$java$util$concurrent$Future$io$opentelemetry$javaagent$bootstrap$executors$PropagatedContext.realGet(VirtualFieldImplementationsGenerator.java)",
          "io.opentelemetry.javaagent.bootstrap.field.VirtualFieldImpl$java$util$concurrent$Future$io$opentelemetry$javaagent$bootstrap$executors$PropagatedContext.get(VirtualFieldImplementationsGenerator.java:280)",
          "io.opentelemetry.javaagent.bootstrap.executors.ExecutorAdviceHelper.cleanPropagatedContext(ExecutorAdviceHelper.java:92)",

laurit · 2024-03-14T11:59:43Z

@Moscagus could you try with https://github.com/open-telemetry/opentelemetry-java-instrumentation/actions/runs/8277540924/artifacts/1325242993 an let us know whether this changes anything?

Moscagus · 2024-03-15T14:32:38Z

@laurit after several consecutive tests I confirm that the solution does not work.
jcmd.txt
jstack.txt

I keep investigating

Moscagus · 2024-03-15T16:37:44Z

Could it be that since "VirtualThread" is final it cannot be instrumented ?

https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/VirtualThread.java#L68

https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/instrumentation/executors/javaagent/src/main/java/io/opentelemetry/javaagent/instrumentation/executors/VirtualThreadInstrumentation.java#L21

laurit · 2024-03-15T16:54:29Z

Could it be that since "VirtualThread" is final it cannot be instrumented ?

https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/VirtualThread.java#L68

https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/instrumentation/executors/javaagent/src/main/java/io/opentelemetry/javaagent/instrumentation/executors/VirtualThreadInstrumentation.java#L21

No, final modifier does not prevent instrumenting. I think we need to find a way to reproduce this.

Moscagus · 2024-03-15T22:22:41Z

@laurit what do you think about changing the check to: !Thread.currentThread().isVirtual()

private static boolean isPropagationDisabled() {
  return (!Thread.currentThread().isVirtual());
}

If this is valid, there would be no need for the "ThreadLocal propagationDisabled" variable and of the "VirtualThreadInstrumentation" class.

Moscagus · 2024-03-15T22:25:21Z

Sorry, I'm thinking as if all applications were on Java 21 using virtual threads only.

I don't think what I just mentioned helps.

Moscagus changed the title ~~Deadlock with Java 21 Virtual Threads~~ Posible deadlock with Java 21 Virtual Threads Mar 4, 2024

JonasKunz mentioned this issue Mar 6, 2024

Avoid virtual thread deadlock by not using inline expunction on virtual fields #10768

Closed

laurit mentioned this issue Mar 14, 2024

Disable context propagation when virtual thread is switched to the carrier thread #10854

Merged

trask closed this as completed in #10854 Mar 14, 2024

laurit reopened this Mar 15, 2024

laurit mentioned this issue Mar 18, 2024

Fix disabling virtual thread context propagation #10881

Merged

laurit closed this as completed in #10881 Mar 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Posible deadlock with Java 21 Virtual Threads #10747

Posible deadlock with Java 21 Virtual Threads #10747

Moscagus commented Mar 4, 2024 •

edited

Loading

JonasKunz commented Mar 5, 2024 •

edited

Loading

Moscagus commented Mar 5, 2024

laurit commented Mar 5, 2024

JonasKunz commented Mar 6, 2024

JonasKunz commented Mar 6, 2024

Moscagus commented Mar 7, 2024

Moscagus commented Mar 7, 2024

Moscagus commented Mar 7, 2024 •

edited

Loading

JonasKunz commented Mar 12, 2024 •

edited

Loading

JonasKunz commented Mar 13, 2024

Moscagus commented Mar 14, 2024

laurit commented Mar 14, 2024

Moscagus commented Mar 15, 2024

Moscagus commented Mar 15, 2024

laurit commented Mar 15, 2024

Moscagus commented Mar 15, 2024

Moscagus commented Mar 15, 2024

Posible deadlock with Java 21 Virtual Threads #10747

Posible deadlock with Java 21 Virtual Threads #10747

Comments

Moscagus commented Mar 4, 2024 • edited Loading

JonasKunz commented Mar 5, 2024 • edited Loading

Moscagus commented Mar 5, 2024

laurit commented Mar 5, 2024

JonasKunz commented Mar 6, 2024

JonasKunz commented Mar 6, 2024

Moscagus commented Mar 7, 2024

Moscagus commented Mar 7, 2024

Moscagus commented Mar 7, 2024 • edited Loading

JonasKunz commented Mar 12, 2024 • edited Loading

JonasKunz commented Mar 13, 2024

Moscagus commented Mar 14, 2024

laurit commented Mar 14, 2024

Moscagus commented Mar 15, 2024

Moscagus commented Mar 15, 2024

laurit commented Mar 15, 2024

Moscagus commented Mar 15, 2024

Moscagus commented Mar 15, 2024

Moscagus commented Mar 4, 2024 •

edited

Loading

JonasKunz commented Mar 5, 2024 •

edited

Loading

Moscagus commented Mar 7, 2024 •

edited

Loading

JonasKunz commented Mar 12, 2024 •

edited

Loading