How I automatically backup Google Drive documents to Amazon S3 Glacier IR

In today’s digital landscape, data backup is crucial for businesses and individuals alike. My partner and I use Google Drive to store and share our documents, for things like a scanned copy of our Hotel reservation for an upcoming trip, along with our dogs’ vaccine certificates. This has been incredibly useful over the years, specially for offline access, but I do worry about data loss.

Keeping a secure copy of Google Drive files in Amazon S3 Glacier can add an extra layer of protection, especially for critical documents. This guide outlines a simple automated process that I implemented to sync Google Drive with Amazon S3 (Glacier Instant Retrieval) using Python and rclone. This setup allows for an automated backup solution that runs on a cron schedule, making data backups a breeze.

OK, let’s get into it…

Project Overview

Architecture Overview

A simple Python script orchestrates the setup configuration and credential retrieval, then runs rclone, which performs it’s synchronization between Google Drive and S3 Bucket. For added convenience, the entire operation is scheduled to run as a GitHub workflow, ensuring regular backups with minimal user intervention, but we can circle back to that.

Key Components

RClone reads from Google Drive : rclone is the core of this solution. For this use case, I configured it to access Google Drive with drive.readonly permission to grant read-only access to all the files. The rclone scopes are mapped to Google Drive API scopes.

RClone stores into S3 Storage (Instant Retrieval class) : The S3 Glacier storage class family is ideal for long-lived data archiving with minimal read access, making it low-cost option. I am choosing to use S3 Glacier Instant Retrieval (IR) because it gives me the option to retrieve millisecond retrievals, albeit at a higher retrieval API call cost. S3 Pricing link for full details.

S3 Standard — storage cost: $0.023 per GB — GET per 1k requests: $0.0004 S3 Glacier IR — storage cost: $0.004 per GB — GET per 1k requests: $0.01

The assumption here is that we are using S3 as a redundancy, and we really don’t need to retrieve data except for testing and in case Google Drive account is unavailable for any reason.

Scheduled Workflow (optional, but recommended) : By setting up a scheduled job, the backup process automatically runs at predefined intervals, ensuring your data remains up-to-date in S3 without manual oversight. To keep it simple and minimize the cost, I chose Github Actions to do this, but you can just as easily run it on your laptop.

Step-by-Step Guide

1. Install RClone

To get started, download and install rclone.

2. Set up Google Drive access

In a terminal prompt run:

rclone config

Follow the on-screen prompts to set up access to your Google Drive. Ensure you grant only drive.readonly permissions to limit access strictly to reading files, which reduces chances for security incident and accidental writes.

The ~/.rclone.conf generated will add a block similar to this, where GDrive is the name I chose,

[GDrive]
type = drive
scope = drive.readonly
token = {"access_token": "abcd1234",
          "token_type":"Bearer",
          "refresh_token": "wxyz9876",
          "expiry":"2024-11-09T19:21:34.729137-04:00"}

2. Set up S3 access

You’ll need an AWS account with IAM and S3 access. I also recommend enabling object versioning as an additional safety measure.

Use rclone config to add AWS S3 as a remote storage option. rclone will guide you through the options like object owner and storage class.

For the credentials, it’s best practice to not have static IAM credentials so leverage env_auth with AWS STS to assume identity with temporary credentials.

Once done, the ~/.rclone.conf will add a block similar to this, where AWSS3 is the name I chose,

[AWSS3]
type = s3
provider = AWS
region = us-east-1
env_auth = true
acl = private
storage_class = GLACIER_IR
bucket_acl = private

3. Test it out!

At this point, we should be able to test out the synchronization process to see what would happen.

We can run a dry-run with to sync all documents in your Google Drive to the sync prefix in your bucket.

rclone sync --dry-run GDrive: AWSS3:bucket-name/sync/

Bonus points for setting up a cron job to do this on a schedule.

The argument for a Python orchestrator

While this is not necessary if you are simply running this locally, but I wanted to make this repeatable in Github Actions, while storing the Google API tokens in AWS Secrets Manager.

As an overview, the Github Actions is authorized with OIDC to get temporary credentials to access S3 and Secrets Manager. The Python script uses these credentials to fetch the Google API token then dynamically generates the rclone.conf , which allows Github Actions to execute the rclone sync command. Finally the python script takes the refreshed Google API token and stores it back in the Secrets Manager.

Final Thoughts

Using this setup, I can simplify my data backup routine by leveraging the power of rclone, GitHub Actions, and the reliability of AWS S3. Not only does this save time, but it also creates a dependable safety net for my data.

Full code is available on my Github repository . If you have improvement suggestions, feel free to reach out on LinkedIn or open a Github issue!

— —

I’m trying to get better about writing about things I do. Let me know if you found this useful 🙂 .

Connect with me on LinkedIn .