Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Webhook storage (fixes #2835) #3000

Merged
merged 21 commits into from
Aug 4, 2020
Merged

Add Webhook storage (fixes #2835) #3000

merged 21 commits into from
Aug 4, 2020

Conversation

jameslamb
Copy link

@jameslamb jameslamb commented Jul 20, 2020

  • adds new tests (if appropriate)
  • add a changelog entry in the changes/ directory (if appropriate)
  • updates docstrings for any new functions or function arguments, including docs/outline.toml for API reference docs (if appropriate)

What does this PR change?

This PR proposes a new type of storage, WebHook. WebHook storage is a type of storage where build() makes an HTTP request to store a flow and get_flow() makes another HTTP request to retrieve it.

This proposal is documented in #2835 and in the Prefect community Slack thread it links to.

Why is this PR important?

#2835 has more detail on why this is valuable, but the basic thread that led to this is as follows:

  1. Because Storage objects have to be deserializable from the JSON sent to Prefect Cloud, users cannot write their own arbitrary Python code: all storage has to use a class in prefect
  2. Many integrations could be generalized as "issue an HTTP request to store the flow, issue another to retrieve it". This includes external services (like DropBox and Box) and custom services written by users or their companies
  3. Creating a storage object that makes those requests would allow users to more easily plug into services that expose "move files over HTTP / HTTTPS", without needing a new Prefect storage class for each of those services

Notes for reviewers

  • In addition to the unit tests here, I added some integration tests to be sure this is working as expected with real services. See https://github.com/jameslamb/webhook-storage-integration-tests for examples of two integrations I wanted to test:
    • storing flows in DropBox
    • storing flows in a service where you have to first store the flow, capture an ID from that response, then later use that ID in get_flow()
  • this implementation will only work in cases where you have long-lived credentials (like an API key) or where it's feasible to get short-lived credentials in an ad hoc way and then do a single flow run manually. It won't support, for example, any authentication that requires exchanging tokens at runtime. I think that's ok for a first version of this.
  • this implementation assumes that a single request is enough to move a flow over the wire and doesn't support multipart upload. I don't think that is too limiting, but I don't have a good idea of distribution of flow sizes...most of the ones I've written have been a handful of tasks and a few kB when serialized
  • this implementation assumes that headers are the only place where sensitive information is stored. Is that a good assumption? While searching around, I didn't find any mainstream services that do things like POST service.whatever/upload?apiKey={apikey}, but I have worked with such services in the past.
    • if reviewers think that's too limiting, I could take the approach of allowing templating in build_kwargs and `get_flow_kwargs
  • the o and t keys on my keyboard have some stuff under them so please keep an eye out for extra letters in my docs 馃槵

Thanks for your time and consideration!

@codecov
Copy link

codecov bot commented Jul 20, 2020

Codecov Report

Merging #3000 into master will increase coverage by 0.05%.
The diff coverage is 98.47%.

@joshmeek
Copy link

@jameslamb just letting you know this has not been forgotten. I will take a deeper look at this sometime tomorrow!

@jameslamb
Copy link
Author

@jameslamb just letting you know this has not been forgotten. I will take a deeper look at this sometime tomorrow!

thanks! no problem

Copy link

@joshmeek joshmeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really really solid! Allow me to give this a true test with an API that I can upload some flows to. In the meantime I actually think this storage type could be expanded to also support the file-based storage options that the other storage types now also support as of the 0.12.5 release 馃

@jameslamb
Copy link
Author

...I actually think this storage type could be expanded to also support the file-based storage options that the other storage types now also support as of the 0.12.5 release

ooooo I haven't looked at file-based storage yet! I found #2840 and #2944 and took a look.

Is this what you're thinking for WebHook?

  • if storage_as_file is True:
    • build_* arguments describe an HTTP request that sends the contents of a .py file over the wire to some place that stores it.
    • get_flow_* arguments describe an HTTP request the retrieves the contents of a .py file. get_flow() executes that request, passes that file content to extract_flow_from_file()

I'm happy to add that if all Storage objects are expected to have it.

@joshmeek
Copy link

joshmeek commented Jul 24, 2020

@jameslamb Yep that is exactly it! It doesn't necessarily need to be added to WebHook but I see no reason why it couldn't handle it 馃檪

@jameslamb
Copy link
Author

cool ok, I can do that!

I just looked and there is actually a text/x-python MIME type for that!

@jameslamb
Copy link
Author

I've added support for stored_as_script in 59bcc06.

I'm proposing this design for file-based storage with Webhook:

  • provide a flow_file_path in the constructor, a path to a .py file with the flow
    • that file path is not serialized, so it's never sent to Prefect Cloud. File names can leak things like usernames and other sensitive information.
  • build() reads in the contents of the file, converts that string to bytes, and attaches it to the request as data
  • get_flow() expects to receive binary data that it .decode()-es into a a string, which is then passed to extract_flow_from_file()

I decided on this approach because, in my experience, most services that read and write files expect binary content, not application/text, text/x-python, etc. If someone asks in the future for support for writing / reading script files with text MIME types, a flag could be added to control whether or not this storage should run .encode() / .decode() on the script contents. For now, I don't think that complexity is worth it.

To test this added stored_as_script support, I added new unit tests (in that same commit) and tested and end-to-end flow build, register, and run by adding a stored_as_script=True test to the integration tests I set up to help on this PR: https://github.com/jameslamb/webhook-storage-integration-tests#script

@jameslamb
Copy link
Author

I just pushed one other small change in 8c0e32d, hope it's ok...I think I prefer Webhook to WebHook (no capital H).

That is the spelling used everywhere I looked, including Wikipedia, GitHub help, Atlassian Help, and IFTTT's docs.

@jameslamb jameslamb changed the title Add WebHook storage (fixes #2835) Add Webhook storage (fixes #2835) Jul 27, 2020
},
},
get_flow_http_method="POST",
build_secret_config={
Copy link

@jcrist jcrist Jul 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this kwarg a bit confusing and limiting (it only works on headers? what if you need to set a query parameter or part of the body? what if you need to modify the secret before transmitting (e.g. adding a Bearer prefix)). I wonder if we might instead support templating in the (recursive) values in get_flow_kwargs/build_flow_kwargs. The semantics might be:

  • Recursively search the values (not keys) of dicts passed to build_kwargs and get_flow_kwargs for strings
  • Replace any template strings found in those values, using environment variables first and falling back to secrets.
storage = Webhook(
    build_kwargs={
        "url": "https://content.dropboxapi.com/2/files/upload",
        "headers": {
            "Content-Type": "application/octet-stream",
            "Dropbox-API-Arg": json.dumps(
                {
                    "path": "/Apps/prefect-test-app/dropbox-flow.flow",
                    "mode": "overwrite",
                    "autorename": False,
                    "strict_conflict": True,
                }
            ),
        },
        "Authorization": "${DBOX_OAUTH2_TOKEN}",
    },
    build_http_method="POST",
    get_flow_kwargs={
        "url": "https://content.dropboxapi.com/2/files/download",
        "headers": {
            "Accept": "application/octet-stream",
            "Dropbox-API-Arg": json.dumps(
                {"path": "/Apps/prefect-test-app/dropbox-flow.flow"}
            ),
        },
    },
    get_flow_http_method="POST",
)

One way of doing this would be to make use of string.Template and a magic mapping to handle dynamically looking up fields. We'd might want to change the regex to drop the $ prefix to make it similar to str.format not (or maybe not? not sure what's clearer) but this works. (note that str.format converts the mapping to a dict before formatting, so we can't use that to dynamically load secrets/environment variables unfortunately).

In [13]: from collections.abc import Mapping

In [14]: class Magic(Mapping):
    ...:     def __getitem__(self, key):
    ...:         print("Could lookup environment variable or secret here")
    ...:         return "hello-world"
    ...:     def __iter__(self):
    ...:         return iter([])
    ...:     def __len__(self):
    ...:         return 0
    ...:

In [15]: magic_map = Magic()

In [16]: template = string.Template("Bearer ${token}")

In [17]: template.substitute(magic_map)
Could lookup environment variable or secret here
Out[17]: 'Bearer hello-world'

Could also use the regex module directly, which might be simpler 馃し.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also find the get_flow and build prefixes on these kwargs a bit off. I know they correspond to the requests for the build and get_flow methods, but without the word request in there build_kwargs looks like kwargs to pass to build to me. Feels too tied to the interface method names and not tied to what the requests are actually doing (storing and loading bytes). Perhaps?

store_request_kwargs=...,
store_request_method=...,
load_request_kwargs=...,
load_request_method=...,

I'd use put and get, except those conflict with the http methods. Not a strong opinion, just a thought.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this was sufficiently complex that I shouldn't make it part of the first pass, but if you think it's necessary I'm happy to add it!

I'm a little worried about free-form templating everything though...that's going to be a problem if you have JSON jammed in a string, like the DropBox API requires (https://github.com/jameslamb/webhook-storage-integration-tests/blob/3bc93bf2ce4b9a0539306045f2f6a82bc3325c53/test-dropbox.py#L47). That opens you up to needing to know how to escape the right }, which doesn't sound fun.

maybe it would be simpler to, instead of templating individual string fields, just allow people to replace the entire value of any build_kwarg or get_flow_kwarg with the value of an environment variable / secret?

what if you need to modify the secret before transmitting (e.g. adding a Bearer prefix))

I did think about this specific case. If your API token's literal value is abc, there's no reason you couldn't put Bearer abc into an environment variable / secret, right? Without needing to have any code run to add Bearer to the front.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels non-intuitive to me to require storing a full authorization header/url/etc... in a secret to make proper use of it. If we keep the $ prefix requirement that string.Template uses, that (I believe) avoids the issue of accidentally templating things that just happen to contain {} characters, since they won't match the regex.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooooo ook! I like that a lot. I'll add that to this PR.

As for the names, I feel that there's value in coupling to the method names actually. build() and get_flow() are important to understand when using a Storage object, I think, and I'd rather couple to those than invent another thing people have to reason about. But I do like adding _request to make that clearer. How do you feel about build_request_*and get_flow_request_*?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I do like adding _request to make that clearer.

Makes sense to me!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I just attempted the templating thing. Awesome suggestion, I like this a lot better than the secret_config approach.

Commit: 937dd84

I also updated my integration tests and ran them to be sure it's working as expected: https://github.com/jameslamb/webhook-storage-integration-tests.

Note for reviewers

I think it could be valuable to offer more explicit control over whether environment variables or Prefect secrets are used, to avoid issues caused by conflicting names.

I think that could be done in a backwards-compatible way in the future, by adding a render_preferences argument that is like {"SOME_VARIABLE": "environment"}, which changes the behavior for rendering ${SOME_VARIABLE} from "env --> secret --> error-if-absent" to "env --> error-if-absent".

I thought that complexity wasn't worth it for a first pass, but I'd be happy to add something like it if reviewers think it's a good idea.

@jameslamb jameslamb requested review from joshmeek and jcrist and removed request for lauralorenz August 3, 2020 04:33
Copy link

@jcrist jcrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one small tweak and this looks good to me! Thanks for working on this!

src/prefect/environments/storage/webhook.py Outdated Show resolved Hide resolved
Copy link

@jcrist jcrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, happy to merge on CI pass.

@jcrist jcrist merged commit 549b086 into PrefectHQ:master Aug 4, 2020
@jameslamb
Copy link
Author

Thanks for the thorough reviews and for teaching me the string.Template + magic mapping combination!

@jameslamb jameslamb mentioned this pull request Aug 6, 2020
3 tasks
@jameslamb jameslamb deleted the feat/webhook-storage branch September 15, 2020 20:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants