Downtime-free deploy: second modification of crash-looping bundle is ignored #294

ulidtko · 2024-07-19T18:38:41Z

Steps to reproduce:

Have an app.keter in incoming/ up & running;
Without stopping, deploy a bogus version of app.keter that immediately crashes on start.
- Keter keeps the old version running, as expected,
- The new bundle gets unpacked under temp/ & goes into crash-loop;
- Logs Process restarting too quickly, waiting before trying again... and restarts in 20ms anyway, spamming monitoring 😂
Now "fix" the app.keter removing the crash, and deploy the bundle to incoming/ again.

Expected: the second modification of app.keter gets unpacked under temp/ again, and starts from there as usual. Once fully started, replaces the previous process.

Actual: keter just logs Watched file modified and nothing happens.

I managed to stop the alert flood by rm -rf temp/app-1 and touch incoming/app.keter; but that's it. temp/app-0 keeps running, temp/app-2 is unpacked but nothing gets launched from there. Will need to arrange a maintenance window now, to be able to restart keter 🥲

This is on version 2.0.1 (vendored slightly), but I'm pretty sure the bug exists also in upstream latest.

The text was updated successfully, but these errors were encountered:

ulidtko · 2024-07-19T19:09:16Z

... I should add that this actually breaks the running app.

As it turns out,

temp/app-0 keeps running

it does, yet proxying stops:

2024-07-19 18:23:58.71: Caught a proxy exception --[ NoResponseDataReceived ]-- on Request {requestMethod = "OPTIONS", [...]

(where the log comes from this vendor patch)

jappeace · 2024-07-22T23:19:45Z

what is the bogus version of app.keter? how does it crash? is it just not a tarball or something?

ulidtko · 2024-07-23T05:43:43Z

@jappeace I'll try to minimize down to a self-contained repro case

ulidtko · 2024-07-23T20:02:07Z

So, to see the idea, @jappeace check this sequence of 3 dummy-app.keter versions: dummy-app-keter#294.zip

dummy-app-v0.keter "works" (it's a substitute dummy, but runs & serves requests);
dummy-app-v1.keter is "bogus" — it's broken, and crashes on start;
dummy-app-v2.keter is "fixed", works again.

The expected result is: with v0 running, deploying v1 keeps v0 running & serving requests; and deploying v2 starts v2, then after startup is complete, replaces the v0 process with v2. In a production scenario, that did not happen.

Edit: the bogus version is crashing by intentional sys.exit(1) — as a test. Please check the zip I attached

ulidtko · 2024-07-24T18:49:38Z

Basically I saw a regression of #64 (on version 2.0.1 patched). There users had described the exact same scenario & behavior.

However, on current master I can't repro even 1 successful reload:

start dummy.keter
touch incoming/dummy.keter
-> Error occured when launching bundle "dummy": openFile: resource busy (file is locked)

ulidtko · 2024-07-24T20:32:17Z

OK, so that issue was identified — it's a separate bug. #296 filed.

Back to the suspected regression of #64 here.

When retested on master 6b7f1e4 with rotate-logs: False, and a "broken" bundle that fails immediately (without a 10s startup delay) — there's some inconsistency, perhaps a race-condition.

After the 20-ish restart attempts (which take ~2 minutes), and the ensureAlive failed log, keter gives up on trying to launch the bundle. This is not immediately obvious from logs (on the spot, they look like a regular eternal crash loop). Proxying to the previous running version continues all the time.

One inconsistency is the following:

Fresh bundle updates are ignored until the crash-loop backs off with the long ensureAlive failed message logged.
After that, the accumulated incoming updates do get picked up & deployed.
It just takes 2 minutes (or more, depending on bundle size & server load).

What I think will resolve this point of confusion nicely, would be a clear indication in logs of this "circuit-breaker" restarts counter, something like Reactivating app foobar [restart 5/18].

I'll propose a patch, sometime next week at latest.

jappeace · 2024-07-25T00:13:12Z

I'll see if I can help you in the weekend

ulidtko mentioned this issue Jul 24, 2024

Reloading is broken in v2.1 with rotate-logs: True #296

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Downtime-free deploy: second modification of crash-looping bundle is ignored #294

Downtime-free deploy: second modification of crash-looping bundle is ignored #294

ulidtko commented Jul 19, 2024

ulidtko commented Jul 19, 2024

jappeace commented Jul 22, 2024

ulidtko commented Jul 23, 2024

ulidtko commented Jul 23, 2024 •

edited

Loading

ulidtko commented Jul 24, 2024

ulidtko commented Jul 24, 2024

jappeace commented Jul 25, 2024

Downtime-free deploy: second modification of crash-looping bundle is ignored #294

Downtime-free deploy: second modification of crash-looping bundle is ignored #294

Comments

ulidtko commented Jul 19, 2024

ulidtko commented Jul 19, 2024

jappeace commented Jul 22, 2024

ulidtko commented Jul 23, 2024

ulidtko commented Jul 23, 2024 • edited Loading

ulidtko commented Jul 24, 2024

ulidtko commented Jul 24, 2024

jappeace commented Jul 25, 2024

ulidtko commented Jul 23, 2024 •

edited

Loading