Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Max structured document size #22

Closed
tplooker opened this issue Jun 17, 2020 · 19 comments · Fixed by #62
Closed

Max structured document size #22

tplooker opened this issue Jun 17, 2020 · 19 comments · Fixed by #62
Assignees
Labels
ready for PR Ready for Pull Request

Comments

@tplooker
Copy link
Member

Currently in the spec under structured documents it stipulates that the maximum document size is limited to 16MB, however this could be a decision reserved for a storage provider for which they could advertise their maximum allowed document size instead.

Suggestion is to remove this maximum document size from the spec and instead add language around how a storage provider server can advertise what their maximum size is, the spec could offer recommendations on how this should be set

@OR13
Copy link
Contributor

OR13 commented Jun 22, 2020

are we walking about JSON Documents? because if we are, I think its better to just define the limit once, and not allow for different storage providers to use different values... for example:

https://stackoverflow.com/questions/1262376/is-there-a-limit-on-how-much-json-can-hold

I'd prefer to see the spec specify a set of maximum values for various things, and for vendors who want to exceed those for performance reasons to have to declare their incompatibility if its created by exceeding the spec guidelines.

@dlongley
Copy link
Contributor

I think we need to define a limit for interoperability. We should do the same for chunk sizes for streams.

@dlongley
Copy link
Contributor

Without a limit, a number of cases around replication or changing storage providers could get very messy or result in lock in that the standard is trying to help prevent.

@dlongley
Copy link
Contributor

Next we need to agree on language to put into the spec that declares a limit on document size that all encrypted data vaults must abide by (and must support). The language could be something to the effect: "In order to encourage interoperability, a storage server MUST support document sizes of up to 16 MiB in size and MUST reject documents that are larger than this size. For data that is larger than 16 MiB, chunked streams MUST be used instead." Bikeshedding/PRs are welcome.

@dmitrizagidulin dmitrizagidulin transferred this issue from decentralized-identity/confidential-storage May 13, 2021
@dmitrizagidulin
Copy link
Contributor

Discussed on Jun 10, 2021 call.

Tobias and Dave: less options is better, let's just pick a max size for version 1 of the spec, instead of having it be a config field. We don't want vaults to vary in that setting (otherwise, a document that fits on one vault can't migrate to another).

@dmitrizagidulin
Copy link
Contributor

No objections from the group to keeping 16MB size.

@msporny
Copy link
Contributor

msporny commented Jun 10, 2021

We need to look up what the max document size is for MongoDB and express it in MiBs and then update the spec.

@msporny msporny added the ready for PR Ready for Pull Request label Jun 10, 2021
@dlongley
Copy link
Contributor

Another reason for a simple limit: Migrating from EDV A to EDV B could become challenging (or not possible) without the same limits. So let's keep a single limit for version 1 of the spec.

@OR13
Copy link
Contributor

OR13 commented Jun 10, 2021

https://docs.mongodb.com/manual/core/document/#:~:text=Document%20Size%20Limit,MongoDB%20provides%20the%20GridFS%20API.

The maximum BSON document size is 16 megabytes.

Saddly, BSON != JSON.

@dlongley
Copy link
Contributor

Do we know if BSON size is always smaller than JSON? Btw, EDV implementers would be measuring by JWE size as it's the only common thing to measure.

@dlongley
Copy link
Contributor

If we can't determine that BSON will be smaller than JSON, we should choose 10 MiB as a reasonable hedge. Multiples of 10 are easy to reason about quickly and it's likely not too small for nearly all use cases. As usual, if it's too small for a specific use case, streams can be used. There are also advantages to having the size be smaller; AEAD (authenticated encryption) requires the entire data chunk to be processed before determining whether to report an error or accept it.

@DRK3
Copy link

DRK3 commented Jul 5, 2021

It seems that BSON can be smaller or larger than JSON, depending on the data: https://stackoverflow.com/questions/24114932/which-one-is-lighter-json-or-bson

@DRK3
Copy link

DRK3 commented Jul 5, 2021

I'm going to create a PR shortly to address this issue. What I'm seeing so far from the group is:

  • No objections to 16 MiB size, which is a reasonable default based on MongoDB.
  • Only issue is that the 16 MiB size in MongoDB is based on BSON. In EDV here we use JSON, and the size of the BSON representation is not guaranteed to be smaller... so having a 16 MiB limit for EDV may not play nicely with a MongoDB implementation.
  • 10MiB would be a reasonable hedge

Please feel free to continue discussing, and let me know if there are any objections to a 10 MiB limit!

@msporny
Copy link
Contributor

msporny commented Jul 5, 2021

Please feel free to continue discussing, and let me know if there are any objections to a 10 MiB limit!

Your reasoning sounds solid to me, @DRK3 -- +1 to 10MiB limit.

@DRK3
Copy link

DRK3 commented Jul 5, 2021

The spec currently has a size limit on Structured Documents, but not on Encrypted Documents. I think this should be the other way around - the limit should be on the Encrypted Documents themselves, for a few reasons:

  • The Encrypted Document is the actual data that the EDV server will see, and therefore whatever storage implementation it uses will have to contend with that.
  • Encrypted Documents can be bigger than their corresponding Structured Document. A Structured Document that fits within a certain limit may be too big once it's packaged in an Encrypted Document. Perhaps this depends on the cipher used? Even with different ciphers, though, Encrypted Documents can have Encrypted Indexes defined in them, which also increases their size compared to the Structured Document.

(On a somewhat related point: since an EDV server never sees the unencrypted data (AKA Structured Document), technically I think nothing really forces you to use Structured Documents in the first place since the server won't know the difference, which begs the question: why do Structured Documents exist? I made an issue about this awhile ago: #32)

@msporny
Copy link
Contributor

msporny commented Jul 6, 2021

why do Structured Documents exist?

Structured Documents exist because the client needs to know what to expect when decrypting the encrypted document, and needs to be able to encrypt documents in a way that other clients will be able to read. For example, for streams, you need to be able to encode the chunks, media type, and each chunk location in the EDV. You have to encode that information /somewhere/, and the Structured Document is that /somewhere/. The Structured Document is also has a free-form (but hopefully well specified) metadata storage location in addition to the file content.

All this to say, we have Structured Documents because there has to be some place that we put file metadata and content.

@DRK3
Copy link

DRK3 commented Jul 6, 2021

@msporny Makes sense! Thanks for explaining it to me.

What do you think about having the 10 MiB limit be on the Encrypted Document size, rather than the Structured Document size? (for the reasons listed in #22 (comment))

@msporny
Copy link
Contributor

msporny commented Jul 6, 2021

What do you think about having the 10 MiB limit be on the Encrypted Document size, rather than the Structured Document size? (for the reasons listed in #22 (comment))

Yes, +1 for the 10 MiB limit to be on the Encrypted Document size.

@dlongley
Copy link
Contributor

dlongley commented Jul 6, 2021

Yes, +1 for the 10 MiB limit to be on the Encrypted Document size.

Yes, this was always the intent, so if the spec says otherwise it's just a typo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready for PR Ready for Pull Request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants