Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic node executions are not cached by Kubeflow Pipelines #2717

Open
ptitzler opened this issue May 6, 2022 · 3 comments
Open

Generic node executions are not cached by Kubeflow Pipelines #2717

ptitzler opened this issue May 6, 2022 · 3 comments
Labels
inactive:deferred platform: pipeline-Kubeflow Related to usage of Kubeflow Pipelines as pipeline runtime status:Needs Discussion

Comments

@ptitzler
Copy link
Member

ptitzler commented May 6, 2022

Describe the issue
While investigating potential options how to support partial pipeline execution, I've noticed that nodes that are implemented using generic components are not cached by Kubeflow Pipelines. This is an unintentional side effect of a unique environment variable value that we are passing to the container. In a nutshell this variable contains a unique run id, which is a constant value across all nodes in the same pipeline.

Caching can have significant benefits with respect to resource usage and performance. Imagine, if you will, a notebook that downloads a data set archive, extracts it, and performs some processing. If the archive content doesn't change at the source, downloading and processing it again is completely unnecessary if the produced outputs are identical during multiple runs.

I've created this issue to discus how we should deal with caching in general:

  • Should the problematic variable be removed or somehow be encapsulated to hide its unique value from KFP? If we did, caching would work as expected assuming no other constraints are violated that would result in a cache miss.
  • Should we provide users the ability to disable caching for any generic or custom node, if this is supported by the runtime environment? In certain scenarios forcing a re-run of a component might be desirable. Going back to the earlier download/extract example, if the archive content had changed since it was first downloaded and nothing else has changed, KFP would not re-run the node. As a result any downstream node would operate on the old output, which doesn't reflect the current state of the data set archive.
@ptitzler ptitzler added status:Needs Discussion status:Needs Triage platform: pipeline-Kubeflow Related to usage of Kubeflow Pipelines as pipeline runtime labels May 6, 2022
@ptitzler
Copy link
Member Author

Did some more digging but haven't been able to identify a straightforward solution that would support providing a unique run id without loosing the benefits of node output caching. Since there have not been any user reports that the lack of caching support for generic components poses an issue, no action will be taken at this time.

@akfmdl
Copy link

akfmdl commented Aug 2, 2023

I have same issue!

@akfmdl
Copy link

akfmdl commented Aug 6, 2023

i think this issues occurs when using kubeflow pipeline editor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
inactive:deferred platform: pipeline-Kubeflow Related to usage of Kubeflow Pipelines as pipeline runtime status:Needs Discussion
Projects
None yet
Development

No branches or pull requests

2 participants