Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Platform-wide caching (experimental) #108

Open
ccbrown opened this issue Aug 12, 2016 · 6 comments
Open

Platform-wide caching (experimental) #108

ccbrown opened this issue Aug 12, 2016 · 6 comments

Comments

@ccbrown
Copy link
Owner

ccbrown commented Aug 12, 2016

A consolidation of issues with the experimental platform-wide caching feature:

  • zero byte files seem to occasionally get placed into the cache
  • directory sources aren't properly cached since their source isn't currently factored into their key (related: Support for "watching" directories or files #110)
  • keying based on factors such as environment variables or paths severely hurts cachability

And potential features to consider:

  • remote cache implementation
  • caching of sources (downloads, repos, etc.)
  • caching of universal binaries
@ccbrown ccbrown changed the title Platform-wide caching (Experimental) Platform-wide caching (experimental) Aug 12, 2016
@vmrob
Copy link
Contributor

vmrob commented Aug 12, 2016

  • Android configuration hash should be relocatable
  • Influential environment variables should invalidate cache (see autoconf's ./configure --help for common ones)

@ccbrown
Copy link
Owner Author

ccbrown commented Aug 22, 2016

We definitely want the directory cache to be cleared aggressively if disk space is low.

@ccbrown
Copy link
Owner Author

ccbrown commented Nov 8, 2016

Before we expand this any further, I want to revise what we currently have. If possible I want to simplify things a lot to make things more maintainable.

The requirement of storing the policies and manifest in the cache itself complicates things. If we were to remove those concepts from BuildCache, we would have far fewer consistency concerns (e.g. we wouldn't have to worry about the lock-read-write-unlock pattern that's completely infeasible on S3).

Both of those things are currently only used for garbage collection purposes. So can we do garbage collection some other way?

For filesystems, I think it's fair to limit filesystem cache support to filesystems that record access times. So no need for the manifest at all there. But we still need to intelligently decide when to actually perform garbage collection. So maybe we keep a file there that contains some metadata such as last garbage collection time, but keep it internal and invisible to BuildCache.

For S3, we can't trivially get access times. But it's trivial to set lifecycle policies instead, so I think it's fine if we don't worry about garbage collection there.

So I think the Cache interface should be reduced to simply get, set, and maintain. The maintain function performing garbage collection, or in the case of S3, either doing nothing or informing the user that they should set lifecycle policies.

@ccbrown
Copy link
Owner Author

ccbrown commented Nov 8, 2016

I also need to make this caching much more transparent. I want it to be something that everyone can benefit from. Not just a massive chunk of code with narrow application that when finished only benefits those that seek it out.

If we can make the above changes, I'd like to utilize part of this for the feature mentioned in-person: The addition of a configuration option in the needs file that specifies a team or organization-wide cache.

That would be a good start towards getting some of the caching code out of the experimental phase and shaped into something more presentable.

@vmrob
Copy link
Contributor

vmrob commented Nov 8, 2016

The position I was attempting to express in person is that I want there to be a 1:1 relationship between the needs definition of a library and fully-qualified cache keys. If I make a modification to a library definition, that change must be reflected in the thing that needy produces--cached or not.

With that in mind, I think setting cache keys per library is a great idea, but I would want them to act as some sort of namespace to the fully-qualified key. Example:

args:
    download: https://github.com/Taywee/args/archive/6.0.2.tar.gz
    checksum: 4be736f11aa2008820d0836bac4595d889048242
    cache-key: s3://my-teams-ci-bucket/Taywee/args
    project:
        build-steps: cp args.hxx {build_directory}/include/

If Needy took that yaml definition, removed the cache-key from it, hashed it up, added whatever Needy qualifiers were required to accurately represent the build parameters, and then searched for a cache object at s3://my-teams-ci-bucket/Taywee/args/macos/x86_64/$sha256/distribution.tar.gz, then it would perfectly satisfy the 1:1 relationship requirement.

While I would like Needy to provide a means to assist in uploading (or at least aggregating) newly built objects, I don't think it's a strict requirement. I don't think it's going to be possible to satisfy the 1:1 requirement without Needy at least producing keys for resulting objects. Without at least the aggregation functionality, I have a feeling that users that want to integrate this into CI are going to have a lot of duplicate deployment code in each applicable repo and that's something I want to avoid.


I really dislike the manifest and policy and the file locking dances that happen around them. I think my original implementation lacked a lot of that, but requirements at the time dictated its necessity.

The cache should be treated as a key-value store and Needy should treat the inability to access a key as if it doesn't exist. I believe we currently do this so very little, if anything, would be lost if the manifest were removed altogether.

To go one step further, I think existing cache objects should be immutable. If those properties apply (and if key paths make sense), there's no need for a manifest at all nor is there any need for file locking. If a successful open call occurs and then another Needy process deletes it during garbage collection or because the user rm'ed the cache directory, there still won't be a problem during the current read.

To guarantee that new object creation occurs atomically on a local filesystem, we can just create a temporary staging directory next to the cache, write the object to that with a random name, and then move the file into place. Nothing has to be done to ensure this takes occurs on S3--all GETS and PUTS are atomic (albeit not transactional). At most, a race condition could occur between an object at a key being deleted, a new object for that key being uploaded, and the object at that key being downloaded. All of those operations would succeed atomically and the entire object retrieved each time, but there would be no way to determine if the downloaded object was the object before or after the delete. This isn't a problem as long as keys represent their objects exactly.

As for the cache policy, I think a policy is important, but this could be done at the repo-level in .needyconfig.


TLDR: remove all of the file locking, eliminate the cache policy and manifest, and add some logic to guarantee that cache objects are never modified once stored.

@ccbrown
Copy link
Owner Author

ccbrown commented Nov 8, 2016

#144 simplifies things a lot without compromising features. Now I'm much more comfortable expanding on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants