This lightweight Python script helps you compare the contents of one root folder with those of another that is not part of an information tracker such as Git.
Sometimes, you or your team members change a file in some subfolder within a directory tree. Suppose this directory tree contains raw data that feeds into a dataset or some other process. This directory is for some reason not meant to be versioned; it is not part of a repository. It can quickly happen that this change goes unnoticed. How can you reduce the risk of a change going unnoticed?
One approach is to generate checksums for all files in a directory tree and save these in a single manifest file, which this script does. You can then compare manifests to spot differences between directory states. Suppose you have the same directory tree on another machine. The manifest can be then placed on a shared network drive that is accessible by both machines.
This solution can be part of a data processing or training pipeline where you would apply the comparison to assert equality before loading the unversioned data for further processing.
Clone the repository to your local machine:
git clone https://github.com/flacle/lightdatacomparator.git
The goal is to keep this lightweight, so no dependencies are needed. The script uses Python's built-in libraries.
NOTE: The script is written in Python 3.12.7. It may work with other versions, but it has not been tested.
Run with the following command:
python ldc.py <command> [options]
directory
: Root directory to compute checksums for.--password
: Password to encrypt the manifest file (required unless --debug is set).--debug
: Runs in debug mode where the manifest is not encrypted (optional).--output
: Output file path for the manifest (use only when not comparing).--compare
: Path to an existing manifest file to compare against.
- The password is for obfuscation purposes only. Do not use this for sensitive data.
- The manifest file is saved with a custom
.comparator
extension. - You can use the
--output
option to generate a manifest file, for example on a shared network drive. - This repository contains a unit test script with a sample directory to test the script.
Generate a manifest file for a directory:
python ldc.py /path/to/directory --password your_password --output manifest_hash.comparator
Compare directory with existing manifest file:
python ldc.py /path/to/directory --password your_password --compare previous_manifest_hash.comparator
Contributions are welcome! Please open an issue or submit a pull request.
This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.