Reducing overhead of large S3 file syncs with s3cmd (Misc)

I make quite extensive use of s3cmd from s3tools, historically I found it much more usable than Amazon's own CLI (though that's since improved), so the habits set in.

The sync functionality in particular is useful and I've written in the past about reimplementing it for encrypted incremental backups.

The sync argument works much like rsync. It's utility for incremental backups is obvious, but it also means that you get the ability to resume a multi-file upload if it's interrupted (for whatever reason).

However, if you're syncing large files, there can be quite a startup time built in, as the tool first needs to calculate MD5s of all the files being synced (these are used to help verify that AWS has received an uncorrupted version). When you "resume" your upload, you'll incur this time again.

This snippet details how to use the --cache-file option to avoid that

Details

  • Language: Misc

Snippet

s3cmd sync --progress --cache-file=$path_to_cache_file --recursive $dir_to_push s3://$bucket/$des_path/

Usage Example

# Sync dir images into bucket foobar
s3cmd sync --progress --cache-file=/tmp/s3cmd_cache --recursive images s3://foobar/

# The file specified in --cache-file gets overwritten (not appended to) by different syncs
# so if you're breaking your sync up, you might want to do something like

for i in *
do
    s3cmd sync --progress --cache-file=/tmp/s3cmd_cache-$i --recursive $i s3://foobar/my_backups
done