X Tutup
The Wayback Machine - https://web.archive.org/web/20200914053145/https://github.com/github/octodns/pull/211
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement, document, and test --cautious mode #211

Open
wants to merge 1 commit into
base: master
from
Open

Conversation

@ross
Copy link
Contributor

ross commented Feb 25, 2018

Normal plan:

(env)otter:octodns ross$ PYTHONPATH=. ./octodns/cmds/sync.py --config-file=config/dev.yaml
2018-02-24T16:18:12  [140736612635584] INFO  Manager __init__: config_file=config/dev.yaml
2018-02-24T16:18:12  [140736612635584] INFO  Manager __init__:   max_workers=4
2018-02-24T16:18:12  [140736612635584] INFO  Manager __init__:   include_meta=True
2018-02-24T16:18:12  [140736612635584] INFO  Manager sync: eligible_zones=[], eligible_targets=[], dry_run=True, force=False, cautious=False
2018-02-24T16:18:12  [140736612635584] INFO  Manager sync:   zone=githubtest.net.
2018-02-24T16:18:12  [140736612635584] INFO  Manager sync:   sources=['config'] -> targets=['route53']
2018-02-24T16:18:12  [123145418510336] INFO  YamlProvider[config] populate:   found 17 records
2018-02-24T16:18:12  [123145418510336] INFO  Route53Provider[route53] plan: desired=githubtest.net.
2018-02-24T16:18:13  [123145418510336] INFO  Route53Provider[route53] populate:   found 18 records
2018-02-24T16:18:13  [123145418510336] INFO  Route53Provider[route53] plan:   Creates=1, Updates=1, Deletes=1, Existing Records=18
2018-02-24T16:18:13  [140736612635584] INFO  Manager
********************************************************************************
* githubtest.net.
********************************************************************************
* route53 (Route53Provider)
*   Update
*     <ARecord A 1, test.githubtest.net., ['1.1.1.1', '1.1.1.2', '1.1.1.4']> ->
*     <ARecord A 600, test.githubtest.net., ['1.1.1.1', '1.1.1.2', '1.1.1.3']> (config)
*   Delete <ARecord A 3600, cx.githubtest.net., ['1.2.3.4']>
*   Create <ARecord A 3600, cx2.githubtest.net., ['1.2.3.4']> (config)
*   Summary: Creates=1, Updates=1, Deletes=1, Existing Records=18
********************************************************************************

Cautious plan

(env)otter:octodns ross$ PYTHONPATH=. ./octodns/cmds/sync.py --config-file=config/dev.yaml --cautious
2018-02-24T16:18:29  [140736612635584] INFO  Manager __init__: config_file=config/dev.yaml
2018-02-24T16:18:29  [140736612635584] INFO  Manager __init__:   max_workers=4
2018-02-24T16:18:29  [140736612635584] INFO  Manager __init__:   include_meta=True
2018-02-24T16:18:29  [140736612635584] INFO  Manager sync: eligible_zones=[], eligible_targets=[], dry_run=True, force=False, cautious=True
2018-02-24T16:18:29  [140736612635584] INFO  Manager sync:   zone=githubtest.net.
2018-02-24T16:18:29  [140736612635584] INFO  Manager sync:   sources=['config'] -> targets=['route53']
2018-02-24T16:18:29  [123145376927744] INFO  YamlProvider[config] populate:   found 17 records
2018-02-24T16:18:29  [123145376927744] INFO  Route53Provider[route53] plan: desired=githubtest.net.
2018-02-24T16:18:30  [123145376927744] INFO  Route53Provider[route53] populate:   found 18 records
2018-02-24T16:18:30  [123145376927744] INFO  Route53Provider[route53] plan:   Creates=1, Updates=1, Deletes=1, Existing Records=18
2018-02-24T16:18:30  [140736612635584] INFO  Manager
********************************************************************************
* githubtest.net.
********************************************************************************
* route53 (Route53Provider)
*   Update
*     <ARecord A 1, test.githubtest.net., ['1.1.1.1', '1.1.1.2', '1.1.1.4']> ->
*     <ARecord A 60, test.githubtest.net., ['1.1.1.1', '1.1.1.2', '1.1.1.3']> (config)
*   Delete <ARecord A 3600, cx.githubtest.net., ['1.2.3.4']>
*   Create <ARecord A 60, cx2.githubtest.net., ['1.2.3.4']> (config)
*   Summary: Creates=1, Updates=1, Deletes=1, Existing Records=18
********************************************************************************

This automates something we've done a couple times now to "cautiously" roll out changes in a way where we have a quick path to revert. We've been doing this with two rounds of changes were we manually set the TTLs on the 😨 bits to 60s in the first round and then to the real value we want in the subsequent. This just builds that concept into octoDNS itself.

It seems useful, but feedback/thoughts solicated.

/cc @theojulienne @joewilliams @yzguy @vanbroup

@ross ross self-assigned this Feb 25, 2018
@ross ross requested review from joewilliams and theojulienne Feb 25, 2018
@yzguy
Copy link
Collaborator

yzguy commented Feb 25, 2018

I believe this would be a helpful feature for sure. Some thoughts.

I think there's two workflow for changes:

  1. Lower TTL, change value, wait time enough for ~all clients to be using new TTL + value.

Workflow 2 is the one that the "cautious" rollout works for, this would be a smaller percentage of clients taking the change over a longer time period, however a quicker rollback for the clients that took the change already, resulting in lower impact.

  1. Lower TTL, wait time enough for ~all clients to be using new TTL, change value

Workflow 1 is common, but fits the two round rollout rather than a "cautious" rollout. It gives you a shorter time to convergence across all clients and quick rollback. However at the same time, a larger percentage of clients are taking the change in a shorter amount of time, so if it's a bad change then the impact is bigger.

One thing I've experienced that comes to mind that caused a big issue using Terraform, is that when you do "cautious", and apply, you have changed the state on the provider, but you haven't changed the state locally. This is similar to making changes outside of Terraform or making changes and applying, but not pushing up your changes.

Two cases I can think of that would become an issue:

  1. You have pushed up the changes, someone else goes to make a change, they apply your "normal record", which removes the "cautious record", effectively increasing the time it takes to rollback should you need to.

  2. You don't push up the changes, someone else goes to make a change, and deletes/reverts your record because they don't have your changes, effectively causing an outage in the deletion case and an unknown recovery time as resolver behavior around NXDOMAIN records differs. In the revert case, the results vary, but there's some risk of impact.

I know in the case of Github, you're pushing up changes, there's a noop process on a branch, review, and then a deploy of the branch, then merge. In the case of the cautious, there is a period where the branch does not reflect the true state. If someone else made a branch, did a noop, I guess the "removing cautious" would hopefully be caught during review, before their deploy went out. However, if you're banking on having some time before your record was updated to "normal" to protect your rollback, you effectively have to halt all DNS deploys until you've either rolled back or went forward with bringing it to "normal". That period could be a while, especially if you're waiting for traffic to shift, and even then you could have a issue that only comes up when you're under load, which you'd definitely want the ability to rollback.

I think as far as the feature, it's definitely a great addition to OctoDNS, the above points I believe are process problems, not a tool problem. Just curious if you have any thoughts/have run into any issues like that if you've been using the cautious feature internally for some time.

I'll pull the branch down in the morning and give it a try :)

@vanbroup
Copy link
Contributor

vanbroup commented Feb 26, 2018

Thanks @ross, this is a great feature!

I agree with @yzguy that it comes with a workflow challenge. I was thinking that an intermediate branch could push a "cautious" state and the master branch would set the normal state. But this would require another (manual) merge and you don't want the records in a "cautious" state for too long either.

def make_cautious(self):
for change in self.changes:
if change.new:
change.new.ttl = 60

This comment has been minimized.

@joewilliams

joewilliams Feb 26, 2018

Member

My only nit, could the TTL be a global or maybe configurable?

@ross
Copy link
Contributor Author

ross commented Mar 3, 2018

Lower TTL, change value, wait time enough for ~all clients to be using new TTL + value.

This seems like it describes the use-case this was intended for. Changes are rolled out with a short TTL so that if something goes wrong and you revert the reversion will take effect quickly, thus "cautious."

Lower TTL, wait time enough for ~all clients to be using new TTL, change value

I think I'd call this one aggressive more than cautious. Lowering the TTL before making a change would cause all the clients to pick it up quickly so if there was a problem, everyone would see it before you'd have a chance to revert. You could at least still revert quickly assuming you left the lower TTL in place when the value change happened.

Two cases I can think of that would become an issue:

You have pushed up the changes, someone else goes to make a change, they apply your "normal record", which removes the "cautious record", effectively increasing the time it takes to rollback should you need to.

...

Our deployment system ensures that only one branch deploy is active at a time. In general i think there are a lot of problems you could run into without "doit" locking of some sort. Terraform's serial numbers and saved plans style setup could help with that, but that was a direction we consciously didn't end up taking to avoid the need for state of some sort (and the extra complexity.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants
You can’t perform that action at this time.
X Tutup