Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCS/ASYNC: Skip released handler in poll missed process #9942

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jeynmann
Copy link

@jeynmann jeynmann commented Jun 7, 2024

What

Skip events in poll missed process if its handler has already been removed.

Why ?

Async handler might be removed while events missed.
When the handler be called in poll missed process and same memory is reused by the other worker, the handler will try to access wrong resources.

How ?

Set MPMC queue element's value to -1 if handler removed, then skip the events in poll missed process.

@brminich
Copy link
Contributor

@jeynmann, pls fix commit titles to trigger CI.
Seems the 1st commit format is incorrect - need to use capitl letters in the very beginning of the title, like described here:

https://github.com/openucx/ucx/wiki/Guidance-for-contributors

@brminich
Copy link
Contributor

pls check CI failures. Coverity error looks relevant:

+ '[' 1 -gt 0 ']'
+ cov-format-errors --dir /home/swx-azure-svc_azpcontainer/82170-20240610.12/build/cov_build_devel --emacs-style
/__w/1/s/test/gtest/ucs/test_async.cc:388
  Type: Calling risky function (DC.WEAK_CRYPTO)

/__w/1/s/test/gtest/ucs/test_async.cc:388:
  dont_call: "rand" should not be used for security-related applications, because linear congruential algorithms are too easy to break.
/__w/1/s/test/gtest/ucs/test_async.cc:388:
  remediation: Use a compliant random number generator, such as "/dev/random" or "/dev/urandom" on Unix-like systems, and CNG (Cryptography API: Next Generation) on Windows.

+ cp -ar /home/swx-azure-svc_azpcontainer/82170-20240610.12/build/cov_build_devel /__w/1/s/cov_build_devel
+ echo 'not ok 1 Coverity Detected 1 failures'

@jeynmann jeynmann changed the title [WIP] Skip released handler in poll missed process UCS/ASYNC: Skip released handler in poll missed process Jun 16, 2024
@jeynmann
Copy link
Author

@evgeny-leksikov
Please help to take a review when you're free.

Comment on lines 586 to 587
ucs_mpmc_queue_for_each(elem, &async->missed)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ucs_mpmc_queue_for_each(elem, &async->missed)
{
ucs_mpmc_queue_for_each(elem, &async->missed) {

ucs_mpmc_queue_for_each(elem, &async->missed)
{
if (ucs_async_missed_event_unpack_id(elem->value) == handler->id) {
elem->value = -1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a constant for special value (-1)

void create_event() {
ucs_status_t status = ucs_async_pipe_create(&m_event_pipe);
ASSERT_UCS_OK(status);
EXPECT_EQ(ucs_atomic_cswap32(&m_event_valid, 0, 1), 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here and below: expected value goes first in arg list, it makes matter in output of failed test

Suggested change
EXPECT_EQ(ucs_atomic_cswap32(&m_event_valid, 0, 1), 0);
EXPECT_EQ(0, ucs_atomic_cswap32(&m_event_valid, 0, 1));

le->set_event();
suspend(20 - wait);
le->check_miss();
++round;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this value used anywhere?

src/ucs/datastruct/mpmc.h Outdated Show resolved Hide resolved
src/ucs/async/async.c Outdated Show resolved Hide resolved
async = handler->async;
if (async != NULL) {
ucs_mpmc_queue_block(&async->missed);
ucs_mpmc_queue_for_each(elem, &async->missed) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add ucs_mpmc_queue_for_each in .clang-format (ForEachMacros)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe instead of exposing ucs_mpmc_queue_lock, ucs_mpmc_queue_unlock and foreach, can add ucs_mpmc_remove_if(mpmc, predicate_cb, arg) to take a callback that returns 0/1 whether the given element should be removed?

src/ucs/datastruct/mpmc.h Outdated Show resolved Hide resolved
/**
* Lock MPMC queue.
*/
void ucs_mpmc_queue_block(ucs_mpmc_queue_t *mpmc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename no lock/unlock?

test/gtest/ucs/test_async.cc Show resolved Hide resolved
test/gtest/ucs/test_async.cc Outdated Show resolved Hide resolved
@@ -811,6 +889,19 @@ UCS_TEST_SKIP_COND_P(test_async_event_mt, multithread,
EXPECT_GE(min_count, exp_min_count);
}

UCS_TEST_SKIP_COND_P(test_async_event_mt, multithread_create_destory,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: multithread_create_destroy

@@ -110,6 +126,7 @@ class base_event : public base_async {
}

ucs_async_pipe_t m_event_pipe;
volatile uint32_t m_event_valid = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: maybe init it in ctor

EXPECT_EQ(0, ucs_atomic_cswap32(&m_event_valid, 0, 1));
}

void destory_event() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: destroy_event

void reset() {
ucs_async_pipe_drain(&m_event_pipe);
}

void create_event() {
ucs_status_t status = ucs_async_pipe_create(&m_event_pipe);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: What happens when we call this function on already "created" event?
Shouldn't we check if object is already initialized, and if so - release the previous state first?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, double initialization is an error.
The life time of event is managed by base_event, since we call create/destroy method inside the constructor/destructor.

    base_event(ucs_async_mode_t mode) : base_async(mode) {
        create_event();
    }
    ...
    virtual ~base_event() {
        destroy_event();
    }

To avoid double initialization, I added assertion inside the create_event method, see

EXPECT_EQ(0, ucs_atomic_cswap32(&m_event_valid, 0, 1));

In addition, if we explicitly call create_event, which means we want to manage the event by ourself, then we're responsible for destroy_event.

check_is_blocked(le, false);

int result = le->count();
delete le;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls consider using std::scoped_ptr to get rid of explicit delete
You can also just return le->count(); in this case without saving in tmp var

async = handler->async;
if (async != NULL) {
ucs_mpmc_queue_block(&async->missed);
ucs_mpmc_queue_for_each(elem, &async->missed) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe instead of exposing ucs_mpmc_queue_lock, ucs_mpmc_queue_unlock and foreach, can add ucs_mpmc_remove_if(mpmc, predicate_cb, arg) to take a callback that returns 0/1 whether the given element should be removed?

test/gtest/ucs/test_async.cc Outdated Show resolved Hide resolved
test/gtest/ucs/test_async.cc Show resolved Hide resolved
test/gtest/ucs/test_async.cc Outdated Show resolved Hide resolved
test/gtest/ucs/test_async.cc Show resolved Hide resolved
@yosefe
Copy link
Contributor

yosefe commented Jul 1, 2024

@jeynmann pls force push the last PR to fix the commit title

src/ucs/async/async.c Outdated Show resolved Hide resolved
src/ucs/async/async.c Outdated Show resolved Hide resolved
src/ucs/async/async.c Outdated Show resolved Hide resolved
src/ucs/datastruct/mpmc.c Outdated Show resolved Hide resolved
yosefe
yosefe previously approved these changes Jul 1, 2024
src/ucs/async/async.c Outdated Show resolved Hide resolved
src/ucs/datastruct/mpmc.c Outdated Show resolved Hide resolved
src/ucs/async/async.c Outdated Show resolved Hide resolved
yosefe
yosefe previously approved these changes Jul 2, 2024
@leibin2014
Copy link
Contributor

/azp run

Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@yosefe
Copy link
Contributor

yosefe commented Jul 4, 2024

@jeynmann pls squash

UCS/ASYNC: avoid call handler if already remove

GTEST/UCS: rename test func name

GTEST/UCS: use ucs rand

UCS/ASYNC: fix test code + format

GTEST/UCS: fix format

UCS/ASYNC: fix code according to comments

UCS/ASYNC: fix code according to comments

UCS/ASYNC: fix code according to comments

UCS/ASYNC: fix code according to comments

UCS/ASYNC: fix code according to comments

UCS/ASYNC: fix code according to CI

UCS/ASYNC: fix code according to CI

UCS/ASYNC: fix code according to CI

UCS/ASYNC: fix code according to comments

UCS/ASYNC: fix code according to comments

UCS/ASYNC: fix code according to comments

GTEST/UCS: fix code according to comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants