-
-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Corrupted metadata JSON files caused by bug #297 #303
Comments
If you're in a hurry, you can make a build yourself with this:
|
Curious what evidence you have that this and #297 are related -- how do you know that superfluous ARI requests is corrupting files? |
I’m using certmagic as a library for a web server that serves front-end files and supports custom domains. When the error occurred, I checked the logs and file modification times and found that an extra ‘ Prior to the error, I was already preparing to migrate from NFS to S3 and implemented a custom storage to access S3. After the error, I expedited the migration process. Since S3 writes are atomic, it prevented the issue, even though redundant ARI requests still remain. |
NFS has known bugs related to synchronization, that might be the actual problem. S3 does not provide atomic operations for us to be able to safely offer synchronization, even if writes are synced. I recommend using a database like MySQL/Postgres/Redis for high concurrency distributed storage. |
Moving another discussion with @Zenexer into here:
Originally posted by @Zenexer in #297 Anyway, @lqs, from what you're saying:
This actually checks out with known issues with both of those storage backends (as noted just above). NFS has sync/flush issues when it comes to concurrent users over a network; and S3 doesn't provide atomic operations, so proper locking/syncing of an operation like an ARI request is impossible. @Zenexer, are you also using NFS perchance? |
I spent about a dozen hours debugging this yesterday, and I believe my initial comment was incorrect: rather than the bug persisting, I believe I just hadn't sufficiently cleaned all of the existing corrupt files. There were situations in which there were two trailing bytes at the end of a file ( I am using NFS, but it does appear to support locking correctly with my current mount options--or, at least, in a way that is compatible with this patch. I'm not a huge fan of NFS and generally don't trust it, but it should work with this lock/write pattern. I doubt it would ever make sense to officially support NFS given how fickle it is, but the locking code in certmagic is straightforward enough that I should be able to debug and patch it if there are further issues. The one thing that still has me a little worried is a disconcerting number of requests to the on-demand |
That's good to hear! Yours was the only feedback so far that it didn't fix the issue, so it's reassuring that it was an oversight. |
That's a relief, thanks for the follow-up.
The ask endpoint can be busy... we could potentially ease this with a bloom filter or something, that we just reset every 5 or 10 minutes (or something like that). But ideally I'd rather the ask endpoint itself do the caching since it knows better logic than we can guess. I'd be curious if the ARI log entries are redundant (same hostname) or not. I really want that to be fixed (AFAIK it should be already). |
I would assume that any Caddy user with that sort of traffic probably has caching on their ask endpoint anyway and can keep it fast. From my perspective, though, I'd like to be able to log exactly when I've told Caddy it was authorized to go out and request a certificate: having that logging on my application helps me with troubleshooting, since I can use that to determine where in the stack a problem is arising that might be leading to excessive certificate requests. As it currently stands, I don't know whether an
I'll try to figure that out, but Caddy is spitting out gigabytes of log data with on-demand TLS enabled, so I'm still trying to sort out what's important and what's not. |
I think most of the log entries are the result of various hosting providers and registrars trying to request or renew certificates for domains that no longer point to them, with no checking on their end prior to starting the ACME challenge process. That makes it really difficult to tell the difference between legitimate ACME-related error messages and errors that I can safely ignore. (Ugh, I really wish CAs wouldn't waive rate limiting on challenge failures for large integrators--it hurts everyone.) I don't think that should affect ARI log messages, so I'll let |
I'm not seeing any overlap between ARI requests so far. Each one is unique. |
To make sure I understand, you want a way for the 'ask' request to distinguish whether a certificate is being obtained or something else? The only times the 'ask' endpoint is invoked are currently when a certificate needs to be obtained or renewed. It does not guard ARI requests or other maintenance, per-se, though in theory it should be guarding them implicitly, because if you cannot obtain or renew a cert, you cannot maintain it either. Technically, 'ask' is invoked before even trying to load a certificate from storage (as that can be expensive depending on the storage backend). So I guess, to your request, I would say that it shouldn't matter, but I'm open to discussing this further if desired.
I was one who advocated for exemptions to the rate limits when conforming to ARI, out of concerns that certificate renewals would be rejected -- sometimes past their expiration -- on account of rate limits, even though it was the CA who specified the renewal window. So to ensure certificates can be renewed even if they have to be squished into a narrow window, Let's Encrypt (rightly) exempts clients from rate limits in that situation. Why do you think it hurts everyone? Is Caddy attempting to renew lots of certificates for you and failing?
That's good, so it sounds like the synchronization is working. 👍 Thanks for checking on that. |
Yes, mostly for debugging purposes. If I see that multiple Caddy instances are all trying to get certs at the same time, that's a sign something is amiss. It would likely help when troubleshooting future concurrency issues, but a lightweight plugin could probably serve the same purpose. I'm still not 100% confident the remaining errors I'm seeing are benign, but I'll have more data over the next few days. |
Sorry, I realized I forgot to answer this question. I don't have an opinion on the scenario you mentioned. What appears to be happening is twofold:
Access logs have since shown that the second issue accounts for the majority of the "no challenge data found" warnings I was seeing. Disabling HTTP-01 mostly resolved that. The first issue is far more problematic: I can't stop other hosting providers from requesting certs, and they're wasting PKI resources. It also wastes my time because they cause concerning log entries. These large hosting providers don't really have any incentive to check whether hosts point to them before requesting a cert; they just offload that to the CA. Meanwhile, I have to check that a domain points to me--and only to me--when Caddy hits the |
@Zenexer It sounds like your ask endpoint needs to check to make sure you are expecting to be getting a certificate for those domain names.
Not exactly; why not check your database (or whatever is relevant to your application/service) to see if you should be expecting to maintain a certificate for a hostname? That's the purpose of the ask endpoint and it should resolve the rate limit problems, yeah? |
It already does that. I am expecting to get certificates for those domains. Scenario 1: I control Scenario 2: I control
I do. However, I'm running a free service to which non-technical users can point their domains. They might use my service for a while, decide they don't like it, and point their domain elsewhere. I can't trust that the users are going to maintain a list of domains that point to me, so I do have to validate the A/AAAA records before requesting a certificate. The result gets cached in Redis for a while, so subsequent calls to the ask endpoint are very fast. |
I think to simplify this, what @Zenexer is stating is that a third party actor can initiate a request to I believe for security reasons caddy already needs to keep a track of the list of tokens its generated through applying for ACME challenges. Checking for a string in a list of strings is probably significantly faster to do first, before sending a request to the Would it make sense to flip the order of operations here? Caddy does the first initial sanity check (e.g. "I don't even know what this token is, trash it?) I guess if you have a very slow storage driver, then maybe that operation will be slower... but maybe my naive view is that the storage driver is likely to be faster than the Beyond that there's also the problem that |
It does, but that's kept in the storage. What Matt is saying is that in some setups, the storage lookup is more expensive than the Caddy can be run in a cluster, so it must use the storage to see whether another Caddy instance initiated issuance. It can't rely on an in-memory cache. The
That's what it was originally until about a year ago, storage was checked first, but that was bad for some users. |
That's impossible. It has to do DNS lookups at any sort of scale, despite the documentation. There's just no use case for on-demand TLS that doesn't necessitate frequently double-checking DNS. Calling it incorrect isn't helpful. |
You should not be doing DNS checks. That doesn't make sense. It's not your |
Not really, this fully depends on the use case defined here imo. In your use case, you're going to need to do a DNS lookup, but that's not necessarily universally applicable.
Yeah that makes sense tbh. Computers suck. Maybe the ability to choose at which stage |
I'm having a hard time envisioning common use cases for on-demand TLS in which the operator of Caddy has exclusive ownership and control over all of the domain names pointed to it. The obvious use case seems to be a hosting provider or integrator. They need to regularly verify that any domains provided by end users actually, truly point to them before requesting certificates. Caching that information for too long is risky, especially since DNS is prone to misconfiguration. |
Company setting where they have one global load balancer, but a secondary system checks that a domain is owned by the company before being put into a given database. This is different from your use case where you let arbitrary users point their domains to your infrastructure with no registration/signup process (e.g. you don't know whats even linked to you until you get a request!) Both of these use cases should work IMO |
@Zenexer What you should do is have your customers register their domain via your app's settings, and you add it to your DB allow list. Then all the |
@Zenexer’s use case allows people to point their domains to their infrastructure without any registration, etc. From what I’m understanding, there is no registration or direct user involvement. The only involvement is going to DNS and changing name servers or A records. |
That's the erroneous assumption that leads to issue 1. Hosting providers assume that just because a domain is registered with them, they will pass DV. I don't make that assumption. Hosting providers often do, which is why we're here. Caddy sees ACME challenges that were started by other hosting providers who don't bother to check DNS before requesting a cert. |
I don't understand. Your |
This happens dozens of times per second. I'm Hosting Provider B in this scenario. If I rely on my database, I become Hosting Provider A. |
There is no external database to check in this circumstance. There is no UI. There is no app. There is a page with instructions: if you want to park your domain with ${whatever}, please point your A and AAAA records to ${whatever}. The database is DNS. The answer here may be that this is not a supported use case of caddy, but imo this can very easily be a supported use case with slight modifications. |
Just catching up after feeding the baby and running some kids around... sorry! I've been drafting this reply while several new replies have come in -- I wish GitHub would show that someone was replying or at least show the new replies. I feel like the conversation went off-track and got confused by some things, but maybe I did instead. In any case, here's my attempt to bring it back: Thanks for clarifying above. It seems to me you have an extraordinary situation that is not common from what I know of existing large-scale CertMagic deployments. That's not bad, just something that is worthy of discussion/understanding.
So, in this case, the hosting provider would fail their own ACME challenge, but your server would likely get pinged with a TLS handshake or HTTP request in an attempt to solve the challenge. The HTTP-01 challenge does not use TLS, so those would not issue a certificate. You'd see junk in your logs, but what's new (it's the Internet). The TLS-ALPN-01 challenge does use TLS, but with a special ALPN value. When it sees a handshake of this sort, it only follows a special code path that serves the challenge solution certificate (if it doesn't find one it just returns an error and aborts the handshake). Neither case will initiate an ACME challenge that you end up failing and getting rate limited for. (If they are, that's a bug I'd like more details, likely in a separate issue.)
The ask endpoint is only invoked if the client tries to establish a TLS handshake using a domain name it does not have a certificate for (and isn't itself a challenge handshake). But this endpoint is for the HTTP-01 challenge, which is HTTP-only. Are the servers accessing this plaintext endpoint over HTTPS? That would be the only way this is possible, but is extremely broken / in violation of spec. |
That seems to run contrary to what was said elsewhere in this thread: it seems that the ask endpoint is being hit prior to checking storage for the existence of a certificate. If that's not the case, there might still be a concurrency issue somewhere that needs debugging. |
Catching up on the flood of new replies while I was typing... and also revisiting some of the earlier replies...
This should only be negatively affecting them (and I guess the CAs, which is why they rate limit them). If they initiate an ACME challenge for a domain pointed to you, you will see junk log entries at most. Or what wasted PKI resources are you referring to?
I should have mentioned that this is what I was referring to that is a bit unconventional. Typically, users sign up with their domain name(s) they are going to use, then your service knows what the domains are. If setting DNS records is itself the act of "signing up" with your service, then it makes sense that you have to check DNS records. This is the only time I have heard of this being the case... and for a few reasons (complexity, reliability, etc) I don't recommend this... unless, perhaps, you do like what our Caddy homepage demo does, where a user can point a specific subdomain ("caddydemo" in our case) to your IP, and then no signup is required. This still prevents abuse because it only works for one specific identifier per registered domain, hence there's a relatively significant cost barrier to abuse it.
To clarify once more, if someone else initiates an ACME challenge for a hostname that fails repeatedly, that doesn't rate limit you. It only rate limits their ACME account, not yours. @aaomidi (and of course @francislavoie) Thanks for chiming into this discussion. Your feedback and perspectives are much appreciated!!
When CertMagic initiates an ACME challenge, it puts the challenge info in storage so that any other instances in the cluster can solve the challenge (as opposed to just in process memory). So when a challenge request comes in, we don't know whether it's junk until we access storage.
This is true, and we could add more info here; however, I've tried to keep the abstraction as pure as possible with the understanding that it shouldn't matter: is this identifier known to / allowed by the server or not? In theory, I don't think other information should matter. But I'm open to exploring this more.
Ah, I think I see where the confusion is. What I said here is true, and doesn't preclude what I said earlier. 'Ask' is consulted before checking to see if storage has a certificate to satisfy the TLS handshake (I'm talking about normal handshakes here, not TLS-ALPN-01 challenge handshakes which use only a very specific, minimal code path). If 'ask' returns 200, storage is checked. If storage returns no cert, then an ACME challenge is initiated to obtain one. Does that make sense? |
@Zenexer Sorry, one more thing:
Step 13 is not (or should not be) the case. A request over HTTP does not invoke the ask endpoint because no certificate is required because there is no TLS handshake to complete. In other words, this part does not invoke or utilize on-demand TLS at all. If a plaintext HTTP request is in fact invoking on-demand TLS / your 'ask' endpoint, then there is a bug and we should open a new issue to discuss. So, if my understanding is correct, your main concern is spammy log entries? |
To be clear, the only reason I'm bringing up any of this is because it's made debugging this concurrency issue difficult. If everything worked perfectly as described all the time, it would just be a matter of ignoring the logs--the same as I was already doing. It's only a problem because I can't tell which log entries and ask requests are a result of my Caddy instances trying to get certs (or encountering concurrency issues), versus which log entries are from someone else's lazy misuse of ACME. By "wasting PKI resources," I'm referring to the fact that DV isn't free, especially at scale. There's an expectation that integrators should make a reasonable effort to ensure a domain actually points to them (and that an ACME challenge will succeed) prior to trying to get a cert.
I think we're getting a bit caught up with my specific use case, which isn't too relevant here. Let's use the example I provided in a recent comment with two hosting providers.
Correct. I'm not being rate limited and am not proposing that as a concern. If I follow the instructions in the docs and stop performing DNS checks, then I become Hosting Provider A in my example, and I will get rate limited.
Yes, that was my initial impression. That design strongly disincentivizes ask endpoints from performing sufficient checks. That leads to an ecosystem problem as Caddy becomes more popular. It invites Caddy users to put themselves in the position of Hosting Provider A above. |
I just want to reiterate. This is not okay to do. The problem is that any bad actor can point a wildcard domain to your server and then infinitely make requests like
Caddy's internal rate limiting should prevent that from being a concern. It slows itself down if too many attempts are happening.
No, because only a TLS handshake that actually reaches your server (meaning the DNS is already configured correctly) can trigger On-Demand TLS. Any domain which doesn't point to your server will not renew when the time comes because only a TLS handshake would trigger a renewal attempt. I've never seen any evidence of the usecase you're describing being a real issue. |
I don't understand. Caddy will check storage in this situation without querying the ask endpoint? There's a misunderstanding or bug here somewhere, because if I go to http://example.com/.well-known/acme-challenge/foobar in this scenario, the error logs indicate that it's doing something. |
Customers misconfigure DNS in such a way that it will hit both hosting providers all the time (e.g., adding Hosting Provider B's nameservers without removing A's). There are also a plethora of automated services out there that will continue making requests to the old IP address, because that's just what they do. Most are security and OSINT companies. Some are malicious actors. In any case, Hosting Provider A is still going to see TLS handshakes for a long time to come. |
That's understandable. There's a lot of moving parts (under the hood, at least) that we have to wade through.
CertMagic emits logs when it -- and not a third party -- initiates an ACME challenge. Are you referring to log entries for HTTP challenge resources specifically? How are the logs ambiguous? If CertMagic didn't initiate the ACME challenge you won't see log entries that indicate trying to solve challenge. So if you see other logs for ACME challenges but no "trying to solve challenge" logs, then it's a third-party.
I see... I suppose you might be using a loose definition of the term "PKI" in that you're talking about CA resources in general, because a failed ACME challenge doesn't allocate any PKI resources (at least, not publicly -- the website still generates a CSR I guess): no certificates, no precertificates, no CT logs, no revocations/CRLs, no OCSP staples, etc. We've discussed this a lot in the past, including with Let's Encrypt themselves, and with their community, and the overall consensus seems to be that because DNS is a matter of perspective, doing the lookup yourself is seldom helpful. And we've seen from experience (with the lego project, before we diverted from it) that doing DNS lookups oneself got in the way more often than it helped.
Fair; but I still care about helping you make your service functional. (We may need to discuss a sponsorship to really dig in and figure out how we can improve the situation for your infrastructure, as it's not a common case at all.) That said, the example with steps 1-14 doesn't seem quite right to me (see last comment).
Ah, I see. In that ca,se when a domain is pointed away from a server but the host still thinks it serves that domain, TLS handshakes naturally stop coming in because DNS lookups resolve somewhere else; and CertMagic's On-Demand TLS lets the certificate expire and then it gets deleted. It does not keep trying to renew the cert -- that only happens if TLS handshakes keep coming in (and ask returns 200).
Okay, I think I see what you mean with this. But the implementation of On-Demand TLS should account for this as it naturally lets certs expire that are no longer pointing clients at it.
It checks storage for a challenge token, not for a certificate. This flow has nothing to do with On-Demand TLS.
Well, of course it'll do something 😃 What are the full logs related to such a request? It should be looking up challenge info to see if it has to solve it, and it should find that there isn't any, and respond to the request.
And this problem is orthogonal to Caddy/CertMagic, and it exists across the Web no matter what server or cert automation you're using. Ultimately the CA has to decide whether the DNS configuration authorizes a certificate for the server, mess or no mess. |
If I'm not mistaken, Let's Encrypt's rate limiting is much stricter for requests that fail DV--that is, you're likely to run into rate limiting if you're requesting a lot of certs and not doing DNS checks on your own.
I've since confirmed this: the ask endpoint won't receive a request. I think a self-contained example would be helpful here. It should go without saying, but to anyone coming across this discussion: this example does not use best practices; it's intended to demonstrate a particular issue with minimal code. Do not use this in production. Caddyfile
docker-compose.ymlservices:
caddy:
image: "library/caddy:latest"
ports:
- "127.0.0.1:8080:80"
- "127.0.0.1:8443:443"
- "127.0.0.1:8443:443/udp"
volumes:
- "./Caddyfile:/etc/caddy/Caddyfile" In my use case, I'm getting a lot of requests to Command:
Caddy logs:
For comparison, The most notable log message here is As you pointed out, In my case, there were a combination of issues:
I knew there were likely multiple issues, which is why debugging was so complicated. I do think the PR here resolves #303, although it doesn't fix the already-broken metadata. My only complaint was that the design decisions surrounding |
Caddy was designed such that it shouldn't be an issue in practice, both because of the internal rate limiting, and the fallback to staging on first failure attempt. See https://caddyserver.com/docs/automatic-https#errors
Thanks for the detailed writeup! I think that aligns with our view of it as well. I agree changes could probably be made for better introspection. What would you think would help in terms of logging etc? |
Interesting discussion. I'd like to better understand the consequences of a world full of providers like Hosting Provider A, whereby they only check that their user added the domain and disregard misconfigured DNS, which it sounds like is the vast majority of Caddy users? Mixing name servers of various providers ("multi-provider DNS"), both accidentally and intentionally, is common enough to warrant a solution for it; the data on its prevalence can be observed across registry zone files, or aggregation tools like DomainTools.com, DNS.Coffee, etc. Let's assume most of those multi-provider occurrences are by users who have added the domains to each hosting provider that they use (e.g., Wix and Shopify). Perhaps they want to A/B test the two providers without realizing this is the wrong way to do it. In this example, both Wix and Shopify should fail DV, but they won't realize that if all they're checking is whether a user added the domain to an account with them. Not a big deal if it's rare, but it's not that rare.
@mholt, at what point does the CA decide that too many are failing and that the provider should be taking proactive steps to reduce those failures? What are the consequences of that CA's decision on hosting providers?
@francislavoie, what if there are thousands of domains with this issue being visited daily? That's entirely plausible for a medium-sized provider. |
Changing the log level for I don't think it makes sense to query the |
Catching up after the weekend...
Somewhat. From Let's Encrypt Rate Limits:
Note that it's per account, per hostname, (and per hour). So, kind of, but if one person's domain fails to verify, that won't block others.
Thanks so much for investigating this! And thanks for the follow-up. I'll see if we can make things clearer. @bracketforward Good questions.
I think Let's Encrypt's rate limits are a good example of this. They have many rate limits, and if you reach them, you could take steps to reduce those failures. On the other hand, they exist to prevent excessive use of resources, so hitting them isn't "bad" per-se, especially when the domain is truly misconfigured (i.e. trying again won't help). The only time it's not good is when it's a "false positive" like if DNS records have been set but haven't propagated yet (it's not really a "false positive" either, I just don't know a better term). I've seen LE staff reach out to integrators that are having notable difficulty. But it's always been reasonable IMO. My browser is crashing so I'll brb to finish this.
The internal rate limits are pretty generous now, I think it's 1 per second. The main thing is the CA sets their rate limits, we just want to avoid slamming CAs with thousands of requests per minute.
That's not a bad idea. Maybe DEBUG.
We have specifically had requests -- from large integrators -- to gate this behind the ask endpoint. It is fast, but it's expensive. The well-known lookups are a good point, though I haven't had complaints about that yet (other than, in a way, this thread, I suppose). |
To be exact, 10 per 10 seconds, but effectively the same. |
I am having the same issue. I tried the following steps:
but it's not working. FROM caddy:2.8.4-builder AS builder
RUN xcaddy build --with github.com/caddyserver/certmagic=github.com/caddyserver/certmagic@16e2e0b
RUN xcaddy build --with github.com/caddyserver/transform-encoder
FROM caddy:2.8.4
RUN apk update && apk add nss-tools
COPY --from=builder /usr/bin/caddy /usr/bin/caddy
COPY Caddyfile /etc/caddy/Caddyfile
HEALTHCHECK --start-period=5s --timeout=5s CMD wget -q --spider http://localhost/up || exit 1 On
But the certificate is not written to disk and the request fails. |
The CA has not issued the certificate yet; CM is still waiting for it. |
The bug #297 frequently leads to corrupted JSON files when multiple instances mount a shared NFS directory as storage. We’ve encountered numerous cases in production where this causes certificate failures with the error message
decoding certificate metadata: invalid character '}' after top-level value
, rendering the affected sites completely unusable.The number of corrupted files is increasing, and I can’t restart the service.
I see the bug fix is in the master branch. Please release a new version that includes this fix. Additionally, how can I identify and repair the already corrupted files among thousands without deleting all files? Alternatively, is there a way to ignore corrupted files during loading?
The text was updated successfully, but these errors were encountered: