thorp/README.org

81 lines
3.1 KiB
Org Mode
Raw Normal View History

* thorp
2019-04-29 20:10:38 +01:00
Synchronisation of files with S3 using the hash of the file contents.
2019-06-21 13:59:43 +01:00
[[https://www.codacy.com/app/kemitix/thorp][file:https://img.shields.io/codacy/grade/c1719d44f1f045a8b71e1665a6d3ce6c.svg?style=for-the-badge]]
2019-07-06 17:20:33 +01:00
[[https://search.maven.org/search?q=net.kemitix.thorp][file:https://img.shields.io/maven-central/v/net.kemitix.thorp/thorp_2.12.svg?style=for-the-badge]]
2019-06-11 08:01:37 +01:00
2019-05-16 16:40:33 +01:00
Originally based on Alex Kudlick's [[https://github.com/akud/aws-s3-sync-by-hash][aws-s3-sync-by-hash]].
2019-04-29 20:10:38 +01:00
The normal ~aws s3 sync ...~ command only uses the time stamp of files
to decide what files need to be copied. This utility looks at the md5
hash of the file contents.
2019-05-10 22:44:27 +01:00
* Usage
2019-05-16 16:40:33 +01:00
#+begin_example
thorp
Usage: thorp [options]
2019-05-10 22:44:27 +01:00
-V, --version Display the version and quit
-B, --batch Enabled batch-mode
2019-06-11 07:57:40 +01:00
-s, --source <value> Source directory to sync to S3
-b, --bucket <value> S3 bucket name
-p, --prefix <value> Prefix within the S3 Bucket
-i, --include <value> Include matching paths
-x, --exclude <value> Exclude matching paths
-d, --debug Enable debug logging
--no-global Ignore global configuration
--no-user Ignore user configuration
2019-05-16 16:40:33 +01:00
#+end_example
2019-05-10 22:44:27 +01:00
If you don't provide a ~source~ the current diretory will be used.
2019-06-11 07:57:40 +01:00
The ~--include~ and ~--exclude~ parameters can be used more than once.
** Batch mode
Batch mode disable the ANSI console display and logs simple messages
that can be written to a file.
* Configuration
Configuration will be read from these files:
- Global: ~/etc/thorp.conf~
- User: ~ ~/.config/thorp.conf~
- Source: ~${source}/.thorp.conf~
Command line arguments override those in Source, which override those
in User, which override those Global, which override any built-in
config.
Built-in config consists of using the current working directory as the
~source~.
Note, that ~include~ and ~exclude~ are cumulative across all
configuration files.
Handle renames (#14) * [sync] move thunks to s3client to bottom of class Also, use the thunk methods from within run rather than accessing the s3client object directly. * Layout tweaks to put each parameter on own line * [syncsuite] value renames and move sync.run outside it() call Future tests will be evaluating the result of that call, so this avoids repeatedly calling it. * Add first pass at copy methods and some delete stubs * [Bucket] Convert from type alias for String to a case class * [SyncSuite] mark new tests as pending * [RemoteKey] Convert from type alias for String to a case class * [MD5Hash] Convert from type alias for String to a case class * [LastModified] Convert from type alias for String to a case class * [LocalFile] Revert to using a normal File * [Sync] Use a for-comprehension and restructure S3MetaData The for-comprehension will make it easier to generate multiple actions out of the stream of enriched metadata. The restructured S3MetaData avoids the need to wrap it in an Either in some cases. * [ToUpload] Add an wrapper to indicate action required on File * [S3Action] Stub actions for IO events * [S3Action] Use UploadS3Action * [Sync] Fix formating when echoing parameters * [logging] Change log level down to 4 for listing every file considered * [Sync] Use a case class to hold counters * [HashModified] Add case class to replace MD5Hash, LastModified tuples * [logging] Move file considered logging to source of files Rather than logging this where adding meta data, move to where the files are being initially identified. * [logging] Log all final counters * Pass Config and HashLookup as implicit parameters * [LocalFileStream] rename method as findFiles * [S3MetaDataEnricher] rename method as getMetadata * Rename selection filter and uploader trait and methods * [MD5HashGenerator] Extract as trait * [Action] Convert ToUpload into an Action sealed trait * [ActionGenerator] refactored and removed logging * fix up tests * [LocalFileStream] adjust logging * [RemoteMetaData] Added * [ActionGenerator] remove redundant braces * [LocalFile] Added as wrapper for File * [Sync] run: remove redundant braces * [Sync] run: rename HashLookup as S3ObjectsData * WIP - toward copy action * Extract S3ObjectsByHash for grouping * extract internal wrapper for S3CatsIOClient Remove some boiler plate from the middle of a test * Explicitly name the Map parameters in extected result * All lastModified are the same to avoid confusion We aren't testing this field, just that the keys and hash values are correct. * Rename variable * space out object cxreation * Fix test - error in expected result Code has been working for ages! * [readme] condense and simplify behaviour table, adding option delete Reduce the complexity by only noting the distinct attributes leading to each action. Add the action of delete when a local file is missing. * [S3MetaDataEnricherSuite] rename tests and note missing tests * [ActionGeneratorSuite] rename tests and note missing tests * Note unwritten tests as such * [ActionGenerator] #2 local exists, remote is missing, other matches * [S3ClientSuite] fix tests * [S3MetaDataEnricherSuite] #2a local exists, remote is missing, remote matches, other matches - copy * [S3MetaDataEnricherSuite] drop 'remote is missing, remote matches' Impossible to represent this combination * [S3MetaDataEnricherSuite] #3 local exists, remote is missing, remote no match, other no matches - upload * [S3MetaDataEnricherSuite] Tests #1-3 rename variables consistantly * [S3MetadataEnricherSuite] #4 local exists, remote exists, remote no match, other matches - copy * [S3MetadataEnricherSuite] #5 local exists, remote exists, remote no match, other no matches - upload * [S3MetadataEnricherSuite] drop test #6 - no way to make request * [ActionGeneratorSuite] standardise tests 2-4 * [ActionGeneratorSuite] #1 local exists, remote exists, remote matches - do nothing * [ActionGeneratorSuite] Comment expected outcome * [ActionGeneratorSuite] #5 local exists, remote exists, remote no match, other no matches - upload * [Action] Add ToDelete case class * Use ToDelete and fix up return types for DeleteS3Action * [ActionGenerator] Add explicit case for #1 * [ActionGenerator] Add explicit check for local exists in #2 * [ActionGenerator] match case against #3 * [ActionGenerator] simplify case and match against #5 * [ActionGenerator] Add case for #4 * [ActionGenerator] Remote explicit checks for file existing If we are called with a LocalFile parameter then we assume the file exists. * [ActionGenerator] Avoid #1 matching condition #5 * [ActionGeneratorSuite] enable tests * [test] remove stray println * [SyncSuite] Add test helper RecordingSync * [SyncSuite] Use RecordingSync * [SyncSuite] enable rename test - excluding delete test * [Sync] log and increment counters for copy and delete * [Sync] Use case matched RemoteKey in log message * [Sync] Reorder actioins to do copy then upload then delete * [S3Action] Drop Move as a distinct action Can be implemented as a Copy followed by a Delete. * [S3Action] Actions are ordered Copy, Upload then Delete This allows sequencing of actions so that all the quick to accomplish copies take place before bandwidth/time costly updates or destructive deletes. Deletes come last after they have had the opportunity to b used as the source for any copies. * [Sync] Use S3Action's default sorting * [Sync] extract logging of activity * [SyncLogging] Extract logging out of Sync Single Responsibility principle - Sync knows nothing about how it logs, it just delegates to SyncLogging. * [Sync] Rename variables and extract sort into private def * [SyncLogging] Use IO context * [SyncLogging] Remove moved counter * [SyncLogging] Clean up an log start of run config info * Verify that IO actions are evaluated before the program terminates * [Sync] ensure logging runs * [ActionGenerator] Don't upload files every time * [ActionGenerator] fix remote hash for #5 * [SyncSuite] Add tests for delete and delete after rename * [RemoteKey] Add asFile and isMissingLocally helpers * [Sync] Generate delete actions * Remove old extensions upon MD5HashGenerator * [MD5Hash] prevent confusion by never allowing quotes This means we need to filter quotes from md5hash values at source * [Sync] ensure start log message is run * [ThorpS3Client] Fix passing parameters for source key * [ThorpS3Client] reformat byKey for clarity * [S3Client] Add level 5 logging around s3 sdk calls * fix up tests
2019-05-22 13:55:03 +01:00
* Behaviour
When considering a local file, the following table governs what should happen:
|---+------------+------------+------------------+--------------------+---------------------|
| # | local file | remote key | hash of same key | hash of other keys | action |
|---+------------+------------+------------------+--------------------+---------------------|
| 1 | exists | exists | matches | - | do nothing |
| 2 | exists | is missing | - | matches | copy from other key |
| 3 | exists | is missing | - | no matches | upload |
| 4 | exists | exists | no match | matches | copy from other key |
| 5 | exists | exists | no match | no matches | upload |
| 6 | is missing | exists | - | - | delete |
|---+------------+------------+------------------+--------------------+---------------------|
* Executable JAR
To build as an executable jar, perform `sbt assembly`
2019-06-30 15:27:00 +01:00
This will create the file `cli/target/scala-2.12/thorp`
2019-06-30 15:27:00 +01:00
Copy this file to your `PATH`.