Skip to content

Latest commit

 

History

History
120 lines (95 loc) · 5.52 KB

File metadata and controls

120 lines (95 loc) · 5.52 KB

Digital Collections Cloud Replicate


Report Bug | Request Feature

About The Project

This tool aims to introduce some digital preservation transparency into the process of copying digital collections files from local storage to S3. Amazon's awcli provides high-level tools like sync and low-level tools like put-object claim to validate fixity on uploaded objects behind the scenes, but this isn't transparent.

In our digital collections processes, we generate fixity using md5deep as soon after file capture or creation as possible. We use that fixity digest to verify files on each move. This tool does the following:

  • Verify that all files in the fixity manifest exist in the filesystem
  • Verify that all files in the filesystem are explicated in the manifest
  • Ignore some files like Thumbs.db
  • Verify the MD5 fixity for each file matches the fixity recorded in the manifest
  • Replicates files from local storage to AWS S3
  • Request that AWS validates the MD5 to ensure file in S3 is accurate
  • Configure metadata on the S3 object with the MD5 of the file
  • Log all of these actions to a log file for review

Built With

Getting Started

To get a local copy up and running follow these simple steps.

Installation

  1. Clone the repo

    git clone https://github.com/VTUL/digital-collections-cloud-replicate.git
  2. Install libraries

    pip install -r requirements.txt  

Usage

See options using -h

$ ./s3-replicate.py -h
usage: s3-replicate.py [-h] [-c CONFIG] -d DIRECTORY [-f] [-l LOG] [-m MANIFEST] [-p PROFILE] -u URI [-v]

optional arguments:
  -h, --help            show this help message and exit
  -c CONFIG, --config CONFIG
                        path to aws credentials file. E.g., /home/user/.aws/credentials. Default is ~/.aws/credentials
  -d DIRECTORY, --directory DIRECTORY
                        path to digital collections directory. E.g., /some/path
  -f, --fixity          perform fixity validation against manifest
  -l LOG, --log LOG     directory to save logfile. E.g., /some/path. Default is POSIX temp directory
  -m MANIFEST, --manifest MANIFEST
                        name of manifest file if not "checksums-md5.txt"
  -p PROFILE, --profile PROFILE
                        aws profile name. E.g., default. Default is default.
  -u URI, --uri URI     S3 URI. E.g. s3://vt-testbucket/SpecScans/IAWA3/JDW/
  -v, --verbose         print verbose output to console

Run with options

$ ./s3-replicate.py -u s3://imgagestore/SpecScans/IAWA/JDW/ -d /home/jjt/Downloads/ingest_test/in_jdw/ -m checksums-md5-jdw.txt -f -v

Review logfile

$ cat /tmp/s3-replicate_in_jdw_JDW_2021-09-20-12-44-04.log
2021-09-20 14:43:43,092 - INFO - Replicating files from /home/jjt/Downloads/ingest_test/in_jdw to s3://imagestore/SpecScans/IAWA/JDW/
2021-09-20 14:43:43,102 - INFO - Found credentials in shared credentials file: ~/.aws/credentials
2021-09-20 14:43:43,380 - INFO - User has write access to S3 bucket
2021-09-20 14:43:43,384 - INFO - Ignoring manifest entry matching ignore list: ./jdwst001001/Thumbs.db
2021-09-20 14:43:43,385 - INFO - Found 5 records in manifest file
2021-09-20 14:43:43,385 - INFO - Using 4 manifest records after matching ignored files
2021-09-20 14:43:43,385 - INFO - Scanning files at /home/jjt/Downloads/ingest_test/in_jdw.  Generating fixity will take time
2021-09-20 14:43:43,386 - INFO - Ignoring file checksums-md5-jdw.txt
2021-09-20 14:43:43,675 - INFO - Found 4 files in /home/jjt/Downloads/ingest_test/in_jdw after ignoring 1 files
2021-09-20 14:43:43,675 - INFO - Filesystem and manifest match.
2021-09-20 14:43:43,675 - INFO - Initiating file replication to s3://imgagestore/SpecScans/IAWA/JDW/
2021-09-20 14:43:46,191 - INFO - {'ResponseMetadata': {'RequestId': '3M752071KR8DS1YW', 'HostId': 'ueyoxW3Wkdff6SJan2S1zv6Mkm1wbMQb/lfy9hq97m4AlGRQFFe4DMDFUuqSdrqR+6dvl03QgNk=', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amz-id-2': 'ueyoxW3Wkdfe6SJan2S1zv6Mkm1wbMQb/lfy9hq97m4AlGRQFFe4DMDFUuqSdjqR+6xvl03QgNk=', 'x-amz-request-id': '3M752971KK8DS1YW', 'date': 'Mon, 20 Sep 2021 18:43:44 GMT', 'etag': '"7034b2e690d2e04bc50a6ce8a8be392e"', 'server': 'AmazonS3', 'content-length': '0'}, 'RetryAttempts': 0}, 'ETag': '"7034b2e690d2e04bc50a6ce8a8be392e"'}
...

Roadmap

See the open issues for a list of proposed features (and known issues).

License

Distributed under the Apache 2.0 License. See LICENSE for more information.

Contact

summer@braggtown.com

Project Link: https://github.com/VTUL/digital-collections-cloud-replicate