Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Downtime-free deploy: second modification of crash-looping bundle is ignored #294

Open
ulidtko opened this issue Jul 19, 2024 · 7 comments

Comments

@ulidtko
Copy link
Contributor

ulidtko commented Jul 19, 2024

Steps to reproduce:

  • Have an app.keter in incoming/ up & running;
  • Without stopping, deploy a bogus version of app.keter that immediately crashes on start.
    • Keter keeps the old version running, as expected,
    • The new bundle gets unpacked under temp/ & goes into crash-loop;
    • Logs Process restarting too quickly, waiting before trying again... and restarts in 20ms anyway, spamming monitoring 😂
  • Now "fix" the app.keter removing the crash, and deploy the bundle to incoming/ again.

Expected: the second modification of app.keter gets unpacked under temp/ again, and starts from there as usual. Once fully started, replaces the previous process.

Actual: keter just logs Watched file modified and nothing happens.

I managed to stop the alert flood by rm -rf temp/app-1 and touch incoming/app.keter; but that's it. temp/app-0 keeps running, temp/app-2 is unpacked but nothing gets launched from there. Will need to arrange a maintenance window now, to be able to restart keter 🥲

This is on version 2.0.1 (vendored slightly), but I'm pretty sure the bug exists also in upstream latest.

@ulidtko
Copy link
Contributor Author

ulidtko commented Jul 19, 2024

... I should add that this actually breaks the running app.

As it turns out,

temp/app-0 keeps running

it does, yet proxying stops:

2024-07-19 18:23:58.71: Caught a proxy exception --[ NoResponseDataReceived ]-- on Request {requestMethod = "OPTIONS", [...]

(where the log comes from this vendor patch)

@jappeace
Copy link
Collaborator

what is the bogus version of app.keter? how does it crash? is it just not a tarball or something?

@ulidtko
Copy link
Contributor Author

ulidtko commented Jul 23, 2024

@jappeace I'll try to minimize down to a self-contained repro case

@ulidtko
Copy link
Contributor Author

ulidtko commented Jul 23, 2024

So, to see the idea, @jappeace check this sequence of 3 dummy-app.keter versions: dummy-app-keter#294.zip

  • dummy-app-v0.keter "works" (it's a substitute dummy, but runs & serves requests);
  • dummy-app-v1.keter is "bogus" — it's broken, and crashes on start;
  • dummy-app-v2.keter is "fixed", works again.

The expected result is: with v0 running, deploying v1 keeps v0 running & serving requests; and deploying v2 starts v2, then after startup is complete, replaces the v0 process with v2. In a production scenario, that did not happen.


Edit: the bogus version is crashing by intentional sys.exit(1) — as a test. Please check the zip I attached

@ulidtko
Copy link
Contributor Author

ulidtko commented Jul 24, 2024

Basically I saw a regression of #64 (on version 2.0.1 patched). There users had described the exact same scenario & behavior.

However, on current master I can't repro even 1 successful reload:

  • start dummy.keter
  • touch incoming/dummy.keter
  • -> Error occured when launching bundle "dummy": openFile: resource busy (file is locked)

@ulidtko
Copy link
Contributor Author

ulidtko commented Jul 24, 2024

OK, so that issue was identified — it's a separate bug. #296 filed.

Back to the suspected regression of #64 here.

When retested on master 6b7f1e4 with rotate-logs: False, and a "broken" bundle that fails immediately (without a 10s startup delay) — there's some inconsistency, perhaps a race-condition.

After the 20-ish restart attempts (which take ~2 minutes), and the ensureAlive failed log, keter gives up on trying to launch the bundle. This is not immediately obvious from logs (on the spot, they look like a regular eternal crash loop). Proxying to the previous running version continues all the time.

One inconsistency is the following:

  • Fresh bundle updates are ignored until the crash-loop backs off with the long ensureAlive failed message logged.
    After that, the accumulated incoming updates do get picked up & deployed.
    It just takes 2 minutes (or more, depending on bundle size & server load).

What I think will resolve this point of confusion nicely, would be a clear indication in logs of this "circuit-breaker" restarts counter, something like Reactivating app foobar [restart 5/18].

I'll propose a patch, sometime next week at latest.

@jappeace
Copy link
Collaborator

I'll see if I can help you in the weekend

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants