Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Posible deadlock with Java 21 Virtual Threads #10747

Closed
Moscagus opened this issue Mar 4, 2024 · 17 comments · Fixed by #10854 or #10881
Closed

Posible deadlock with Java 21 Virtual Threads #10747

Moscagus opened this issue Mar 4, 2024 · 17 comments · Fixed by #10854 or #10881

Comments

@Moscagus
Copy link

Moscagus commented Mar 4, 2024

Java 21
Opentelemetry Java Agent 2.1.0
Opentelemetry SDK 1.35.0

I have an application in java21 that makes heavy use of virtual threads. After the stress test, he ended up stuck. After an investigation I verified that the problem came from the java opentelemetry agent, even in its latest version 2.1.0. The problem arises because the agent leaves the carrier thread in an inconsistent state.

jcmd-otel.txt
jstack-otel.txt

1 - "jstack-otel.txt" show the problem. For example:
...
"ForkJoinPool-1-worker-1" #134 [271] daemon prio=5 os_prio=0 cpu=70474.35ms elapsed=722.35s allocated=14189M defined_classes=272 tid=0x000055d55a76ef20 [0x00007f75af255000]
Carrying virtual thread #134 --> Bug: same TID as carrier
..

2 - "jcmd-otel.txt" show the stack with problem. For example:
..
"tid": "159185",
"name": "virtual-Tn3EventsConsumerService-158213",
"stack": [
**"java.base/jdk.internal.misc.Unsafe.park(Native Method)", --> Bug: VT use java.lang.VirtualThread.park, no jdk.internal.misc.Unsafe.park
"java.base/java.util.concurrent.locks.LockSupport.park(LockSupport.java:221)",
"java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:754)",
"java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:990)",
"java.base/java.util.concurrent.locks.ReentrantLock$Sync.lock(ReentrantLock.java:153)",
"java.base/java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:322)",
"java.base/java.lang.ref.ReferenceQueue.poll(ReferenceQueue.java:182)",
"io.opentelemetry.javaagent.shaded.instrumentation.api.internal.cache.weaklockfree.AbstractWeakConcurrentMap.expungeStaleEntries(AbstractWeakConcurrentMap.java:235)", --> opentelemetry java agent
"io.opentelemetry.javaagent.shaded.instrumentation.api.internal.cache.weaklockfree.WeakConcurrentMap$WithInlinedExpunction.getIfPresent(WeakConcurrentMap.java:193)",
"io.opentelemetry.javaagent.shaded.instrumentation.api.internal.cache.WeakLockFreeCache.get(WeakLockFreeCache.java:26)",
"io.opentelemetry.javaagent.bootstrap.field.VirtualFieldImpl$java$util$concurrent$Future$io$opentelemetry$javaagent$bootstrap$executors$PropagatedContext.mapGet(VirtualFieldImplementationsGenerator.java:298)",
"io.opentelemetry.javaagent.bootstrap.field.VirtualFieldImpl$java$util$concurrent$Future$io$opentelemetry$javaagent$bootstrap$executors$PropagatedContext.realGet(VirtualFieldImplementationsGenerator.java)",
"io.opentelemetry.javaagent.bootstrap.field.VirtualFieldImpl$java$util$concurrent$Future$io$opentelemetry$javaagent$bootstrap$executors$PropagatedContext.get(VirtualFieldImplementationsGenerator.java:280)",
"io.opentelemetry.javaagent.bootstrap.executors.ExecutorAdviceHelper.cleanPropagatedContext(ExecutorAdviceHelper.java:92)",
"java.base/java.util.concurrent.FutureTask.cancel(FutureTask.java:181)",
"java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.cancel(ScheduledThreadPoolExecutor.java:291)",
"java.base/java.lang.VirtualThread.cancel(VirtualThread.java:705)",
"java.base/java.lang.VirtualThread.parkNanos(VirtualThread.java:628)",
...

3 - VirtualThread.java: https://github.com/openjdk/loom/blob/fibers/src/java.base/share/classes/java/lang/VirtualThread.java#L681
https://github.com/openjdk/loom/blob/fibers/src/java.base/share/classes/java/lang/VirtualThread.java#L767

@Override
void parkNanos(long nanos) {
    ...
    if (nanos > 0) {
        ...
        try {
            yielded = yieldContinuation();  // may throw
        } finally {
            ...
            cancel(unparker);
        }
        ...
        }
    }
}

@ChangesCurrentThread
private void cancel(Future<?> future) {
    ...
    if (!future.isDone()) {                --------> When this condition occurs, the problem is generated
        // need to switch to current carrier thread to avoid nested parking
        switchToCarrierThread();     --------> Here the carrier TID and the carrying TID are the same as shown in point 1 (jstack-otel.txt)
        try {
            future.cancel(false);        --------> Here the opentelemetry java agent begins to act (Bug).
        } finally {
            switchToVirtualThread(this);  ---------> It never comes here
        }
    }
}

4 - AbstractWeakConcurrentMap.java
https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/release/v2.1.x/instrumentation-api/src/main/java/io/opentelemetry/instrumentation/api/internal/cache/weaklockfree/AbstractWeakConcurrentMap.java#L232

public static void expungeStaleEntries() {
  Reference<?> reference;
  while ((reference = REFERENCE_QUEUE.poll()) != null) {
     removeWeakKey((WeakKey<?>) reference);
  }
}

REFERENCE_QUEUE.poll() -------> poll() that generates jdk.internal.misc.Unsafe.park, thus achieving a stuck in the application since as explained in point 1:
"ForkJoinPool-1-worker-1" #134 [271] daemon prio=5 os_prio=0 cpu=70474.35ms elapsed=722.35s allocated=14189M defined_classes=272 tid=0x000055d55a76ef20 [0x00007f75af255000]
Carrying virtual thread #134 --> Bug: same TID as carrier

@Moscagus Moscagus changed the title Hi @trask, I have an application in java21 that makes intensive use of virtual threads. After the stress test, ended up being stucked. After an investigation I verified that the problem came from the java opentelemetry agent, even in its latest version 2.1.0. The problem comes because the agent leaves the carrier-thread in an inconsistent state. Deadlock with Java 21 Virtual Threads Mar 4, 2024
@Moscagus Moscagus changed the title Deadlock with Java 21 Virtual Threads Posible deadlock with Java 21 Virtual Threads Mar 4, 2024
@JonasKunz
Copy link
Contributor

JonasKunz commented Mar 5, 2024

Hey @Moscagus, thanks for the detailed analysis!
To my understanding this issue is actually slightly different from #10181 because here only certain threads hang instead of the entire JVM, correct?

The root cause of this issue that we try to acquire a contended lock while a thread is about to mount or unmount.
When such a lock is contended, the JVM will try to unmount the current thread while it is already in the process of being mounted / unmounted. This causes the thread to basically deadlock with itself.

I've experienced the same issue while playing around with JVMTI mount / unmount callbacks.

Note that this issue doesn't occur (yet) with synchronized locking, because those currently do not cause virtual threads to be unmounted. However, to my knowledge the JVM devs are working on lifting this restriction, so we'll see what the future holds here.

The deadlock you are seeing here is as you already analyzed because the virtual thread implementation internally seems to use a standard ScheduledThreadPool for scheduling virtual threads,
see https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/VirtualThread.java#L1237.

This thread-pool is instrumented by the opentelemetry agent for context propagation.

IMO the best solution would be to not instrument those internal pools. That is however not really easily doable to my knowledge, because it uses standard classes which we can't just exclude.
We would need to build some exclusion on instance level.

Another safe option would be to make the instrumentation No-Op when it runs in a Virtual Thread mounting / unmounting context.
To my knowledge this would require instrumenting very low-level and frequently called methods, such as VirtualThread.parkNanos or walking the stack, which would be pretty expensive.
In addition this would make us highly dependent on the internal loom implementation.

A simple workaround for now would be to move the cleanup of VirtualField WeakConcurrentHashMap's stale entries to a separate thread which performs a periodic cleanup.
Here is an example implementation how to this.
This has the side benefit of reducing the latency of get/put calls performed on application threads. For this reason the elastic-apm-agent has already been using a separate cleaner thread.

This will prevent the issue from occurring because the ReferenceQueue won't be polled anymore during mount/unmount of a virtual thread.
The underlying WeakHashMap might still be contented, it however to my knowledge uses synchronized instead of other locks, making this a non-issue.

Another way of circumvent this would be to use wrapping of Runnables / Futures instead of VirtualFields, for which already an issue exists: #9324.

I'd like some more feedback here from the devs on which route to take. Based on that I should be able to find the time to implement a fix.

@Moscagus
Copy link
Author

Moscagus commented Mar 5, 2024

Hi @JonasKunz, I think there are 2 issues:

1 - expungeStaleEntries: I agree with you that a good workaround would be to move the cleanup of VirtualField WeakConcurrentHashMap's stale entries to a separate thread which performs a periodic cleanup.

2 - getIfPresent:
https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/release/v2.1.x/instrumentation-api/src/main/java/io/opentelemetry/instrumentation/api/internal/cache/weaklockfree/WeakConcurrentMap.java#L191

@Override
public V getIfPresent(K key) {
  expungeStaleEntries();    ------> It would be moved to a separate thread which performs a periodic cleanup.
  return super.getIfPresent(key);    ------->  AbstractWeakConcurrentMap.java
}

https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/release/v2.1.x/instrumentation-api/src/main/java/io/opentelemetry/instrumentation/api/internal/cache/weaklockfree/AbstractWeakConcurrentMap.java#L111

public V getIfPresent(K key) {
  if (key == null) {
    throw new NullPointerException();
  }
  L lookupKey = getLookupKey(key);    ----> WeakConcurrentMap.java
  try {
    return target.get(lookupKey);
  } finally {
    resetLookupKey(lookupKey);
  }
}

https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/release/v2.1.x/instrumentation-api/src/main/java/io/opentelemetry/instrumentation/api/internal/cache/weaklockfree/WeakConcurrentMap.java#L119

@Override
@SuppressWarnings("unchecked")
protected LookupKey<K> getLookupKey(K key) {
  LookupKey<K> lookupKey;
  if (reuseKeys) {
    lookupKey = (LookupKey<K>) LOOKUP_KEY_CACHE.get();
  } else {
    lookupKey = new LookupKey<>();
  }
  return lookupKey.withValue(key);
}

LOOKUP_KEY_CACHE is ThreadLocal

private static final ThreadLocal<LookupKey<?>> LOOKUP_KEY_CACHE =
    new ThreadLocal<LookupKey<?>>() {
      @Override
      protected LookupKey<?> initialValue() {
        return new LookupKey<>();
      }
    };

I may be wrong in my analysis, but I'm wondering if LOOKUP_KEY_CACHE, which should be local to the Virtual Thread, is okay to be consumed by the Carrier Thread ((LOOKUP_KEY_CACHE.get()).
If I'm not mistaken, this is another point to resolve.

@laurit
Copy link
Contributor

laurit commented Mar 5, 2024

A simple workaround for now would be to move the cleanup of VirtualField WeakConcurrentHashMap's stale entries to a separate thread which performs a periodic cleanup.

We already have a background thread that cleans these maps https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/instrumentation-api/src/main/java/io/opentelemetry/instrumentation/api/internal/cache/weaklockfree/AbstractWeakConcurrentMap.java so removing the inline expunge is an option.

@JonasKunz
Copy link
Contributor

I may be wrong in my analysis, but I'm wondering if LOOKUP_KEY_CACHE, which should be local to the Virtual Thread, is okay to be consumed by the Carrier Thread ((LOOKUP_KEY_CACHE.get()).
If I'm not mistaken, this is another point to resolve.

I think that shouldn't be a a problem here: The way the LOOKUP_KEY_CACHE is used, it is impossible for the thread to be unmounted while the thread currently uses the cached LookupKey for a lookup. The WeakHashMap implementation doesn't care about the values of the cached LookupKey before or after this usage.

so removing the inline expunge is an option.

I'll try to do that and open a PR (soon hopefully). I'll also try if I can reproduce the deadlock in a unit test first, so that we can more easily detect problems with future iterations of Project Loom here.

@JonasKunz
Copy link
Contributor

@Moscagus Unfortunately I haven't been able to reproduce the issue.
Could you try the snapshot from #10768 and verify that it fixes the issue?

@Moscagus
Copy link
Author

Moscagus commented Mar 7, 2024

Unfortunately it's still using "expungeStaleEntries"
I decompile the jar and I actually saw the changes, but when it runs it still continues creating "WeakConcurrentMap$WithInlinedExpunction"

jcmd-otel.txt
jstack-otel.txt

I keep investigating

@Moscagus
Copy link
Author

Moscagus commented Mar 7, 2024

I think the problem is in:

private static final class CacheBasedVirtualFieldSupplier implements VirtualFieldSupplier {

  private final Cache<Class<?>, Cache<Class<?>, VirtualField<?, ?>>>
      ownerToFieldToImplementationMap = Cache.weak();    ---> new WeakLockFreeCache<>(true) = this.delegate = (WeakConcurrentMap<K, V>)new WeakConcurrentMap.WithInlinedExpunction();

  @Override
  @SuppressWarnings("unchecked")
  public <U extends T, V extends F, T, F> VirtualField<U, V> find(
      Class<T> type, Class<F> fieldType) {
    return (VirtualField<U, V>)
        ownerToFieldToImplementationMap
            .computeIfAbsent(type, c -> Cache.weak())
            .computeIfAbsent(fieldType, c -> new CacheBasedVirtualField<>());
  }

}

Therefore, every time you call find():

.computeIfAbsent(fieldType, c -> new CacheBasedVirtualField<>());

computeIfAbsent in WeakConcurrentMap.WithInlinedExpunction=
public V computeIfAbsent(K key, Function<? super K, ? extends V> mappingFunction) {
expungeStaleEntries(); ---> POLL again
return (V)super.computeIfAbsent(key, mappingFunction);
}

@Moscagus
Copy link
Author

Moscagus commented Mar 7, 2024

On the other hand, the finds are regenerated by Byte Buddy, and from what I see they will always use "Cache.weak()"

    "io.opentelemetry.javaagent.bootstrap.field.VirtualFieldImpl$java$util$concurrent$Future$io$opentelemetry$javaagent$bootstrap$executors$PropagatedContext.mapGet(VirtualFieldImplementationsGenerator.java:298)",
          "io.opentelemetry.javaagent.bootstrap.field.VirtualFieldImpl$java$util$concurrent$Future$io$opentelemetry$javaagent$bootstrap$executors$PropagatedContext.realGet(VirtualFieldImplementationsGenerator.java)",
          "io.opentelemetry.javaagent.bootstrap.field.VirtualFieldImpl$java$util$concurrent$Future$io$opentelemetry$javaagent$bootstrap$executors$PropagatedContext.get(VirtualFieldImplementationsGenerator.java:280)",

https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/javaagent-tooling/src/main/java/io/opentelemetry/javaagent/tooling/field/VirtualFieldFindRewriter.java#L42

https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/javaagent-tooling/src/main/java/io/opentelemetry/javaagent/tooling/field/VirtualFieldImplementationsGenerator.java#L269

https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/javaagent-tooling/src/main/java/io/opentelemetry/javaagent/tooling/field/VirtualFieldImplementationsGenerator.java#L312

@JonasKunz
Copy link
Contributor

JonasKunz commented Mar 12, 2024

I missed out on the fact that RuntimeVirtualFieldSupplier.CacheBasedVirtualField isn't used for Advice VirtualField.find calls, even if it was not possible for the agent to inject fields into the target class.

I've fixed this now in my PR, @Moscagus could you try again with the new snapshot ?

@JonasKunz
Copy link
Contributor

Update after a discussion with @laurit : My fix likely won't actually solve the problem, even if it doesn't surface anymore.
ConcurrentHashMap doesn't rely on synchronized only, but also calls LockSupport.park in contended cases.
That would result in the same nested unparking attempts under contention like polling the ReferenceQueue.

@Moscagus
Copy link
Author

Exactly as you mention the problem is still the same as before:

          "io.opentelemetry.javaagent.shaded.instrumentation.api.internal.cache.weaklockfree.AbstractWeakConcurrentMap.expungeStaleEntries(AbstractWeakConcurrentMap.java:235)",
          "io.opentelemetry.javaagent.shaded.instrumentation.api.internal.cache.weaklockfree.WeakConcurrentMap$WithInlinedExpunction.getIfPresent(WeakConcurrentMap.java:193)",
          "io.opentelemetry.javaagent.shaded.instrumentation.api.internal.cache.WeakLockFreeCache.get(WeakLockFreeCache.java:30)",
          "io.opentelemetry.javaagent.bootstrap.field.VirtualFieldImpl$java$util$concurrent$Future$io$opentelemetry$javaagent$bootstrap$executors$PropagatedContext.mapGet(VirtualFieldImplementationsGenerator.java:298)",
          "io.opentelemetry.javaagent.bootstrap.field.VirtualFieldImpl$java$util$concurrent$Future$io$opentelemetry$javaagent$bootstrap$executors$PropagatedContext.realGet(VirtualFieldImplementationsGenerator.java)",
          "io.opentelemetry.javaagent.bootstrap.field.VirtualFieldImpl$java$util$concurrent$Future$io$opentelemetry$javaagent$bootstrap$executors$PropagatedContext.get(VirtualFieldImplementationsGenerator.java:280)",
          "io.opentelemetry.javaagent.bootstrap.executors.ExecutorAdviceHelper.cleanPropagatedContext(ExecutorAdviceHelper.java:92)",

@laurit
Copy link
Contributor

laurit commented Mar 14, 2024

@Moscagus
Copy link
Author

@laurit after several consecutive tests I confirm that the solution does not work.
jcmd.txt
jstack.txt

I keep investigating

@laurit laurit reopened this Mar 15, 2024
@laurit
Copy link
Contributor

laurit commented Mar 15, 2024

@Moscagus
Copy link
Author

@laurit what do you think about changing the check to: !Thread.currentThread().isVirtual()

private static boolean isPropagationDisabled() {
  return (!Thread.currentThread().isVirtual());
}

If this is valid, there would be no need for the "ThreadLocal propagationDisabled" variable and of the "VirtualThreadInstrumentation" class.

@Moscagus
Copy link
Author

Sorry, I'm thinking as if all applications were on Java 21 using virtual threads only.

I don't think what I just mentioned helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment