Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci] [R-package] Fix memory leaks found by valgrind #3443

Merged
merged 81 commits into from
Oct 18, 2020

Conversation

jameslamb
Copy link
Collaborator

In #3338 , {lightgbm} was rejected from CRAN for multiple failed submission attempts. As of that submission, the package was failing tests with a version of R instrumented to use valgrind.

https://cran-archive.r-project.org/web/checks/2020/2020-10-02_check_results_lightgbm.html

image

This PR attempts to add a CI job that replicates these checks.

Notes

@jameslamb jameslamb mentioned this pull request Oct 8, 2020
12 tasks
@jameslamb
Copy link
Collaborator Author

Ok, I can see a few things from the first run (https://github.com/microsoft/LightGBM/pull/3443/checks?check_run_id=1224028806)

  1. the issues with valgrind are successfully reproduced by this job!
==1312== Memcheck, a memory error detector
==1312== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==1312== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==1312== Command: /usr/local/RDvalgrind/lib/R/bin/exec/R --no-readline --vanilla -f testthat.R
==1312== 
Loading required package: R6
==1312== Invalid write of size 8
==1312==    at 0x154571E0: LGBM_BoosterGetNumPredict_R (lightgbm_R.cpp:509)
==1312==    by 0x4941B8C: R_doDotCall (dotcode.c:607)
==1312==    by 0x494CCC6: do_dotcall (dotcode.c:1281)
==1312==    by 0x499FB01: bcEval (eval.c:7078)
==1312==    by 0x498B67F: Rf_eval (eval.c:727)
==1312==    by 0x498E414: R_execClosure (eval.c:1895)
==1312==    by 0x498E0C7: Rf_applyClosure (eval.c:1821)
==1312==    by 0x498BE7A: Rf_eval (eval.c:850)
==1312==    by 0x49920BF: do_set (eval.c:2967)
==1312==    by 0x498BAD6: Rf_eval (eval.c:802)
==1312==    by 0x4990B42: do_begin (eval.c:2515)
==1312==    by 0x498BAD6: Rf_eval (eval.c:802)
==1312==  Address 0xf8a70d0 is 6,832 bytes inside a block of size 7,960 alloc'd
==1312==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==1312==    by 0x49E1CDB: GetNewPage (memory.c:946)
==1312==    by 0x49F102E: Rf_allocVector3 (memory.c:2784)
==1312==    by 0x49D4EE1: Rf_allocVector (Rinlinedfuns.h:593)
==1312==    by 0x49631FC: duplicate1 (duplicate.c:336)
==1312==    by 0x4962871: duplicate_child (duplicate.c:206)
==1312==    by 0x4962AEA: duplicate_list (duplicate.c:268)
==1312==    by 0x4962F0D: duplicate1 (duplicate.c:312)
==1312==    by 0x4962DAD: duplicate1 (duplicate.c:305)
==1312==    by 0x49627A1: Rf_duplicate (duplicate.c:139)
==1312==    by 0x49D32FB: R_compute_identical (identical.c:107)
==1312==    by 0x49D31D5: do_identical (identical.c:76)
==1312== 
==1312== Conditional jump or move depends on uninitialised value(s)
==1312==    at 0x49CF138: gregexpr_Regexc (grep.c:2439)
==1312==    by 0x49D1F13: do_regexpr (grep.c:3100)
==1312==    by 0x49A0058: bcEval (eval.c:7121)
==1312==    by 0x498B67F: Rf_eval (eval.c:727)
==1312==    by 0x498E414: R_execClosure (eval.c:1895)
==1312==    by 0x498E0C7: Rf_applyClosure (eval.c:1821)
==1312==    by 0x499FC8C: bcEval (eval.c:7089)
==1312==    by 0x498B67F: Rf_eval (eval.c:727)
==1312==    by 0x498B1CB: forcePromise (eval.c:555)
==1312==    by 0x49963AB: FORCE_PROMISE (eval.c:5142)
==1312==    by 0x4996566: getvar (eval.c:5183)
==1312==    by 0x499D1A5: bcEval (eval.c:6873)
==1312== 
==1312== 
==1312== HEAP SUMMARY:
==1312==     in use at exit: 274,972,869 bytes in 54,848 blocks
==1312==   total heap usage: 2,454,254 allocs, 2,399,406 frees, 5,660,846,940 bytes allocated
==1312== 
==1312== LEAK SUMMARY:
==1312==    definitely lost: 1,308,789 bytes in 7,847 blocks
==1312==    indirectly lost: 1,102 bytes in 2 blocks
==1312==      possibly lost: 9,768 bytes in 90 blocks
==1312==    still reachable: 273,653,210 bytes in 46,909 blocks
==1312==                       of which reachable via heuristic:
==1312==                         newarray           : 4,264 bytes in 1 blocks
==1312==         suppressed: 0 bytes in 0 blocks
==1312== Rerun with --leak-check=full to see details of leaked memory
==1312== 
==1312== Use --track-origins=yes to see where uninitialised values come from
==1312== For lists of detected and suppressed errors, rerun with: -s
==1312== ERROR SUMMARY: 6 errors from 2 contexts (suppressed: 0 from 0)
  1. the job took 40+ minutes to run
  2. despite these failures, the script I introduced did not actually cause an ❌ in CI

I'll try some fixes tonight

@guolinke
Copy link
Collaborator

guolinke commented Oct 9, 2020

@jameslamb you can simply try this:

  // R_INT64_PTR(out)[0] = len;
  R_INT_PTR(out)[0] = static_cast<int>(len);

@jameslamb
Copy link
Collaborator Author

jameslamb commented Oct 10, 2020

@jameslamb you can simply try this:

  // R_INT64_PTR(out)[0] = len;
  R_INT_PTR(out)[0] = static_cast<int>(len);

thanks! to be clear, I know how to get valgrind to report errors, I just must have made some mistakes in getting the CI job to actually fail. I'll figure it out, I'm sure it was something simple

UPDATE: nevermind, I read too fast. Now I see you meant "as a way to fix the issue in LGBM_BoosterGetNumPredict_R". Thanks!

@jameslamb
Copy link
Collaborator Author

ok as of e80b68d, I got the valgrind test to fail!!!

image

Notice in https://github.com/microsoft/LightGBM/pull/3443/checks?check_run_id=1236732464 that all other R tests are passing.


Now that I know this test catches what we want, next I'm going to merge in the fix-memory-leakage branch. We can now work on fixing all valgrind errors here.


Once the test is passing, I'll move it to one that is triggered by a comment (not run on every build), since it takes about 30m to run.

@guolinke
Copy link
Collaborator

it seems the error is not related to lightgbm,

==1312== Conditional jump or move depends on uninitialised value(s)
==1312==    at 0x49CF138: gregexpr_Regexc (grep.c:2439)
==1312==    by 0x49D1F13: do_regexpr (grep.c:3100)
==1312==    by 0x49A0058: bcEval (eval.c:7121)
==1312==    by 0x498B67F: Rf_eval (eval.c:727)
==1312==    by 0x498E414: R_execClosure (eval.c:1895)
==1312==    by 0x498E0C7: Rf_applyClosure (eval.c:1821)
==1312==    by 0x499FC8C: bcEval (eval.c:7089)
==1312==    by 0x498B67F: Rf_eval (eval.c:727)
==1312==    by 0x498B1CB: forcePromise (eval.c:555)
==1312==    by 0x49963AB: FORCE_PROMISE (eval.c:5142)
==1312==    by 0x4996566: getvar (eval.c:5183)
==1312==    by 0x499D1A5: bcEval (eval.c:6873)

can we fix them?

Copy link
Collaborator

@guolinke guolinke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these shrinkage_to_fit is not efficiency, let us try to remove them.

R-package/src/lightgbm_R.cpp Outdated Show resolved Hide resolved
src/boosting/gbdt_model_text.cpp Outdated Show resolved Hide resolved
src/boosting/gbdt_model_text.cpp Outdated Show resolved Hide resolved
src/boosting/gbdt_model_text.cpp Outdated Show resolved Hide resolved
src/boosting/gbdt_model_text.cpp Outdated Show resolved Hide resolved
@jameslamb
Copy link
Collaborator Author

@jameslamb

  1. two valgrind errors were caused by tests that used testthat::expect_error(). I've added testthat::skip() statements to these so they will not be a problem on CRAN. Both errors were small memory leaks, not more serious problems like illegal writes, so I think it is ok to skip them for now to get to CRAN, and come back to them later when we have time

Could you please create new issue for this task to not lose it?

added #3462

@jameslamb
Copy link
Collaborator Author

@jameslamb Looks like we have a "chocolate" issue again:

...
-- Check for working C compiler: C:/Rtools/mingw_64/bin/gcc.exe - broken
...
Run Build Command(s):C:/ProgramData/chocolatey/bin/mingw32-make.exe cmTC_48deb/fast && C:/ProgramData/chocolatey/lib/mingw/tools/install/mingw64/bin/mingw32-make  -f CMakeFiles\cmTC_48deb.dir\build.make CMakeFiles/cmTC_48deb.dir/build

I guess make should come from Rtools not from chocolatey...

Can you please point me to a log where you see this? I looked through all the Windows jobs and couldn't find that, maybe I'm missing something.

@StrikerRUS
Copy link
Collaborator

@jameslamb

Can you please point me to a log where you see this? I looked through all the Windows jobs and couldn't find that, maybe I'm missing something.

Sure, I can! Here it is: https://github.com/microsoft/LightGBM/runs/1264985038?check_suite_focus=true.

Also, please take a look at master builds. Looks like we have a false positive for WARNINGS:

Warning: package 'R6' was built under R version 4.0.3

@StrikerRUS
Copy link
Collaborator

Looks like we have a false positive for WARNINGS:

I believe we need the same quick fix: #3225. Can we add that warning message (Warning: package * was built under R version *) into whitelist to not block our development process with each R patch version release?

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jameslamb Thanks a lot for addressing review comments! LGTM, except one lint error about line length and some other nits.Feel free to merge after green CI.

.github/workflows/r_valgrind.yml Outdated Show resolved Hide resolved
.github/workflows/r_valgrind.yml Outdated Show resolved Hide resolved
.github/workflows/r_valgrind.yml Outdated Show resolved Hide resolved
R-package/tests/testthat/test_utils.R Outdated Show resolved Hide resolved
@jameslamb
Copy link
Collaborator Author

Looks like we have a false positive for WARNINGS:

I believe we need the same quick fix: #3225. Can we add that warning message (Warning: package * was built under R version *) into whitelist to not block our development process with each R patch version release?

yep I just noticed that too. Will do it in a separate PR right now

@jameslamb
Copy link
Collaborator Author

@jameslamb Looks like we have a "chocolate" issue again:

...
-- Check for working C compiler: C:/Rtools/mingw_64/bin/gcc.exe - broken
...
Run Build Command(s):C:/ProgramData/chocolatey/bin/mingw32-make.exe cmTC_48deb/fast && C:/ProgramData/chocolatey/lib/mingw/tools/install/mingw64/bin/mingw32-make  -f CMakeFiles\cmTC_48deb.dir\build.make CMakeFiles/cmTC_48deb.dir/build

I guess make should come from Rtools not from chocolatey...

I opened #3469 , let's move this discussion there. Thanks for letting me know! I would have missed it.

@jameslamb
Copy link
Collaborator Author

thanks for the help and reviews! I hope that with these changes, we'll be able to get {lightgbm} onto CRAN in November

@jameslamb jameslamb merged commit 81d7611 into master Oct 18, 2020
@jameslamb jameslamb deleted the ci/r-valgrind-test branch October 18, 2020 04:04
@@ -393,7 +393,7 @@ std::string GBDT::SaveModelToString(int start_iteration, int num_iteration, int
ss << loaded_parameter_ << "\n";
ss << "end of parameters" << '\n';
}
return ss.str();
return std::move(ss.str());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guolinke @jameslamb I just noticed that std::move here produces the following compilation warnings:

-- The C compiler identification is AppleClang 9.1.0.9020039
-- The CXX compiler identification is AppleClang 9.1.0.9020039
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /Applications/Xcode_9.4.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /Applications/Xcode_9.4.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found OpenMP_C: -Xclang -fopenmp (found version "3.1") 
-- Found OpenMP_CXX: -Xclang -fopenmp (found version "3.1") 
-- Found OpenMP: TRUE (found version "3.1")  
-- Performing Test MM_PREFETCH
-- Performing Test MM_PREFETCH - Success
-- Using _mm_prefetch
-- Performing Test MM_MALLOC
-- Performing Test MM_MALLOC - Success
-- Using _mm_malloc
-- Configuring done
-- Generating done
-- Build files have been written to: /Users/runner/work/1/s/build
Scanning dependencies of target _lightgbm
[  3%] Building CXX object CMakeFiles/_lightgbm.dir/src/application/application.cpp.o
[  6%] Building CXX object CMakeFiles/_lightgbm.dir/src/boosting/boosting.cpp.o
[  9%] Building CXX object CMakeFiles/_lightgbm.dir/src/boosting/gbdt_model_text.cpp.o
[ 12%] Building CXX object CMakeFiles/_lightgbm.dir/src/boosting/gbdt.cpp.o
/Users/runner/work/1/s/src/boosting/gbdt_model_text.cpp:396:10: warning: moving a temporary object prevents copy elision [-Wpessimizing-move]
  return std::move(ss.str());
         ^
/Users/runner/work/1/s/src/boosting/gbdt_model_text.cpp:396:10: note: remove std::move call here
  return std::move(ss.str());
         ^~~~~~~~~~        ~
/Users/runner/work/1/s/src/boosting/gbdt_model_text.cpp:621:10: warning: moving a local object in a return statement prevents copy elision [-Wpessimizing-move]
  return std::move(feature_importances);
         ^
/Users/runner/work/1/s/src/boosting/gbdt_model_text.cpp:621:10: note: remove std::move call here
  return std::move(feature_importances);
         ^~~~~~~~~~                   ~
[ 15%] Building CXX object CMakeFiles/_lightgbm.dir/src/boosting/gbdt_prediction.cpp.o
2 warnings generated.
[ 18%] Building CXX object CMakeFiles/_lightgbm.dir/src/boosting/prediction_early_stop.cpp.o
[ 21%] Building CXX object CMakeFiles/_lightgbm.dir/src/io/bin.cpp.o
[ 24%] Building CXX object CMakeFiles/_lightgbm.dir/src/io/config.cpp.o
[ 27%] Building CXX object CMakeFiles/_lightgbm.dir/src/io/config_auto.cpp.o
[ 30%] Building CXX object CMakeFiles/_lightgbm.dir/src/io/dataset.cpp.o
[ 33%] Building CXX object CMakeFiles/_lightgbm.dir/src/io/dataset_loader.cpp.o
[ 36%] Building CXX object CMakeFiles/_lightgbm.dir/src/io/file_io.cpp.o
[ 39%] Building CXX object CMakeFiles/_lightgbm.dir/src/io/json11.cpp.o
[ 42%] Building CXX object CMakeFiles/_lightgbm.dir/src/io/metadata.cpp.o
[ 45%] Building CXX object CMakeFiles/_lightgbm.dir/src/io/parser.cpp.o
[ 48%] Building CXX object CMakeFiles/_lightgbm.dir/src/io/tree.cpp.o
[ 51%] Building CXX object CMakeFiles/_lightgbm.dir/src/metric/dcg_calculator.cpp.o
[ 54%] Building CXX object CMakeFiles/_lightgbm.dir/src/metric/metric.cpp.o
[ 57%] Building CXX object CMakeFiles/_lightgbm.dir/src/network/ifaddrs_patch.cpp.o
[ 60%] Building CXX object CMakeFiles/_lightgbm.dir/src/network/linker_topo.cpp.o
[ 63%] Building CXX object CMakeFiles/_lightgbm.dir/src/network/linkers_mpi.cpp.o
[ 66%] Building CXX object CMakeFiles/_lightgbm.dir/src/network/linkers_socket.cpp.o
[ 69%] Building CXX object CMakeFiles/_lightgbm.dir/src/network/network.cpp.o
[ 72%] Building CXX object CMakeFiles/_lightgbm.dir/src/objective/objective_function.cpp.o
[ 75%] Building CXX object CMakeFiles/_lightgbm.dir/src/treelearner/cuda_tree_learner.cpp.o
[ 78%] Building CXX object CMakeFiles/_lightgbm.dir/src/treelearner/data_parallel_tree_learner.cpp.o
[ 81%] Building CXX object CMakeFiles/_lightgbm.dir/src/treelearner/feature_parallel_tree_learner.cpp.o
[ 84%] Building CXX object CMakeFiles/_lightgbm.dir/src/treelearner/gpu_tree_learner.cpp.o
[ 87%] Building CXX object CMakeFiles/_lightgbm.dir/src/treelearner/serial_tree_learner.cpp.o
[ 90%] Building CXX object CMakeFiles/_lightgbm.dir/src/treelearner/tree_learner.cpp.o
[ 93%] Building CXX object CMakeFiles/_lightgbm.dir/src/treelearner/voting_parallel_tree_learner.cpp.o
[ 96%] Building CXX object CMakeFiles/_lightgbm.dir/src/c_api.cpp.o
[100%] Linking CXX shared library ../lib_lightgbm.so
[100%] Built target _lightgbm

I wonder is there any way to avoid both valgrind error and compilation warning?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh I'm not sure! Think @guolinke will have to answer this one

Copy link
Collaborator

@guolinke guolinke Oct 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In c++11, the compliers should be able to auto move the "local" object by return call, so it is safe to remove std::move.
However, it seems apple clang doesn't have this optimization...
To fix the warning, we can remove these std::move

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, it seems apple clang doesn't have this optimization...

I also saw these warnings with ordinary Clang on Ubuntu.

@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants