S3 Sync
Find a file
Paul Campbell f35ea9795d
Create and use a cache of hashes for local files (#249)
* [domain] Define Hashes in domain package

* [filesystem] Load and parse any .thorp.cache files found

* [filesystem] Use cached file data when available and up-to-date

* [lib] FileScanner refactoring

* [filesystem] scan sub-dirs first to minimise time cache is on heap

* [filesystem] Write new cache data to temp file

* [lib] replace cache file when finished updating

* [filesystem] AppendLines to correct file with new lines

* [domain] decode HashType from String

* [filesystem] Store last modified time as epoch milliseconds

* [filesystem] parse lastmodified as a long

* [filesystem] use all hash values in cache

* [lib] FileScanner rearrange code

* [lib] Create and use a single cache file per source

* [storage-aws] Use ETag hash from cache when available

* [filesystem] Merge file data together correctly

* [filesystem] Handle exceptions thrown by Files.mode correctly

* [readme] Add section on caching

* [changelog] updated

* [changelog] add pending dependencies notes

* [lib] Filters should not name methods after their defining object

* [lib] Fix up test
2019-10-27 19:53:00 +00:00
.github Add GitHub Action 2019-08-28 07:16:42 +01:00
app/src/main/scala/net/kemitix/thorp Support multiple parallel uploads (#188) 2019-09-27 16:08:16 +01:00
bin Rename project to Thorp (#75) 2019-06-17 15:33:49 +01:00
cli/src Support multiple parallel uploads (#188) 2019-09-27 16:08:16 +01:00
config/src [config] Accept ‘parallel’ in config files (#230) 2019-10-08 13:58:40 +01:00
console/src/main/scala/net/kemitix/thorp/console Support multiple parallel uploads (#188) 2019-09-27 16:08:16 +01:00
domain/src Create and use a cache of hashes for local files (#249) 2019-10-27 19:53:00 +00:00
filesystem/src Create and use a cache of hashes for local files (#249) 2019-10-27 19:53:00 +00:00
lib/src Create and use a cache of hashes for local files (#249) 2019-10-27 19:53:00 +00:00
project Update sbt-bloop to 1.3.5 (#247) 2019-10-27 17:36:53 +00:00
storage/src/main/scala/net/kemitix/thorp/storage Not wrapping exceptions thrown in waitForUploadResult (#162) 2019-09-23 13:30:34 +01:00
storage-aws/src Create and use a cache of hashes for local files (#249) 2019-10-27 19:53:00 +00:00
uishell/src/main/scala/net/kemitix/thorp/uishell [uishell] ProgressUI uses only 2 line per file (#221) 2019-09-29 20:01:46 +01:00
.gitignore Rename project to Thorp (#75) 2019-06-17 15:33:49 +01:00
.scalafmt.conf Apply scalafmt (#108) 2019-07-16 07:56:54 +01:00
.travis.yml Enable Coverage reporting to Codecov (#189) 2019-09-07 19:32:43 +01:00
build.sbt Update zio, zio-streams to 1.0.0-RC14 (#226) 2019-10-08 13:31:55 +01:00
CHANGELOG.org Create and use a cache of hashes for local files (#249) 2019-10-27 19:53:00 +00:00
LICENSE Create LICENSE 2019-06-07 21:25:23 +01:00
modules.dot Restructure using EIP-ZIO channels (#183) 2019-09-07 07:52:13 +01:00
README.org Create and use a cache of hashes for local files (#249) 2019-10-27 19:53:00 +00:00

thorp

Synchronisation of files with S3 using the hash of the file contents.

file:https://img.shields.io/codacy/grade/c1719d44f1f045a8b71e1665a6d3ce6c.svg?style=for-the-badge file:https://img.shields.io/maven-central/v/net.kemitix.thorp/thorp_2.12.svg?style=for-the-badge

Originally based on Alex Kudlick's aws-s3-sync-by-hash.

The normal aws s3 sync ... command only uses the time stamp of files to decide what files need to be copied. This utility looks at the md5 hash of the file contents.

Usage

  thorp
  Usage: thorp [options]

    -V, --version         Display the version and quit
    -B, --batch           Enabled batch-mode
    -s, --source <value>  Source directory to sync to S3
    -b, --bucket <value>  S3 bucket name
    -p, --prefix <value>  Prefix within the S3 Bucket
    -P, --parallel <value> Maximum parallel upload/copy operations
    -i, --include <value> Include matching paths
    -x, --exclude <value> Exclude matching paths
    -d, --debug           Enable debug logging
    --no-global           Ignore global configuration
    --no-user             Ignore user configuration

If you don't provide a source the current diretory will be used.

The --include and --exclude parameters can be used more than once.

The --source parameter can be used more than once, in which case, all files in all sources will be consolidated into the same bucket/prefix.

Batch mode

Batch mode disable the ANSI console display and logs simple messages that can be written to a file.

Configuration

Configuration will be read from these files:

  • Global: /etc/thorp.conf
  • User: ~ /.config/thorp.conf
  • Source: ${source}/.thorp.conf

Command line arguments override those in Source, which override those in User, which override those Global, which override any built-in config.

When there is more than one source, only the first ".thorp.conf" file found will be used.

Built-in config consists of using the current working directory as the source.

Note, that include and exclude are cumulative across all configuration files.

Caching

The last modified time for files is used to decide whether to calculate the hash values for the file. If a file has not been updated, then the hash values stored in the `.thorp.cache` file located in the root of the source is used. Otherwise the file will be read to caculate the the new hashes.

Behaviour

When considering a local file, the following table governs what should happen:

# local file remote key hash of same key hash of other keys action
1 exists exists matches - do nothing
2 exists is missing - matches copy from other key
3 exists is missing - no matches upload
4 exists exists no match matches copy from other key
5 exists exists no match no matches upload
6 is missing exists - - delete

Executable JAR

To build as an executable jar, perform `sbt assembly`

This will create the file `cli/target/scala-2.13/thorp`

Copy this file to your `PATH`.