Moving Data Across Clouds: A Practical Pattern for Multi-Account AWS to GCP Pipelines
Moving Data Across Clouds: A Practical Pattern for Multi-Account AWS to GCP Pipelines
Cross-cloud data movement is one of those things that sounds harder than it is — until you actually try to do it. Then it's exactly as hard as it sounds.
At Six Column Solutions, we recently worked through a pattern that comes up more than you'd think: a client running workloads across multiple AWS accounts needed to consolidate data into a centralized GCP analytics environment. The data lived in relational databases — MySQL and Aurora — scattered across accounts. The destination was a columnar data store on GCP built for analytics and aggregation at scale.
Here's the architecture pattern we landed on, why we made the choices we did, and what you need to know before you build something similar.
Why You Can't Go Directly from AWS to GCS
The first thing to understand is that AWS Database Migration Service (DMS) is an AWS-native service. Its supported target endpoints are AWS services — S3, Redshift, DynamoDB, Aurora, and so on. Google Cloud Storage is not on that list, and it won't be. These are competing cloud providers.
That means any pipeline pulling data out of an AWS-hosted database and landing it in GCS needs a staging layer. Amazon S3 is the natural choice — DMS can write directly to it, and GCP's Storage Transfer Service knows how to read from it.
The Architecture
The pattern has four stages:
- Source databases (Aurora MySQL / RDS MySQL) across multiple AWS accounts
- AWS DMS writing Parquet files to Amazon S3
- GCP Storage Transfer Service pulling from S3 into Google Cloud Storage
- Downstream analytics processing from GCS
It's not complicated, but each handoff has details that matter.
DMS, CDC, and Parquet: What You Need to Know
Binary logging is non-negotiable.
For DMS to capture ongoing changes (inserts, updates, deletes) rather than just doing a one-time full load, MySQL needs binary logging enabled in ROW format. On RDS and Aurora, this means modifying your DB parameter group and — critically — having automated backups enabled. Binary logging in RDS depends on backup retention being greater than zero. If that's not already in place, plan for a parameter group change and a maintenance window.
Parquet output from DMS works, but test your data types.
DMS can write Parquet directly to S3, which is what you want for downstream analytics workloads. The format is efficient, columnar, and widely supported. That said, watch your data types carefully during testing. Decimals, timestamps, and NULLs are the usual suspects for type mapping issues between MySQL and Parquet. Validate schema consistency early — it's much easier to fix at the DMS endpoint configuration level than after you've got data landing in GCS.
The DMS task mode matters.
If you need a one-time migration, full load is straightforward. If you need ongoing replication — which most real-world analytics pipelines do — you want "Migrate existing data and replicate ongoing changes." This runs a full load first, then switches to CDC mode. DMS writes change files to S3 as they come in, which the Storage Transfer job then picks up.
Moving from S3 to GCS: Storage Transfer Service
GCP's Storage Transfer Service handles the cross-cloud leg cleanly. You configure it with an AWS IAM user that has read-only access to the source S3 bucket, point it at your destination GCS bucket, and set a transfer schedule. It handles the object-level sync — picking up new Parquet files as DMS lands them.
A few things worth paying attention to here:
- IAM scoping: The AWS IAM user you create for GCP should have read-only access (
s3:GetObject,s3:ListBucket) scoped to only the transfer bucket. Don't reuse the DMS role. - Transfer frequency: Storage Transfer Service runs on a schedule you define. Depending on your latency requirements, you can run it hourly, every few hours, or continuously. For near-real-time analytics, push toward continuous or high-frequency runs.
- Egress costs: AWS charges for data leaving its network. This is often the surprise line item in cross-cloud architectures. Size your data volumes before committing — it's not a reason to avoid this pattern, but it should be in your cost model.
The Multi-Account Wrinkle
When source databases live across multiple AWS accounts, you have a few options for how to structure the S3 staging layer:
- One S3 bucket per account, multiple Storage Transfer jobs into a single GCS bucket. Clean IAM boundaries, straightforward to reason about.
- Cross-account S3 access from a central AWS account, with DMS replication instances in each source account writing to a shared staging bucket. More complex IAM setup, but fewer Storage Transfer jobs to manage.
The right choice depends on how many source accounts you're working with and what your IAM governance model looks like. For a small number of accounts, option one is simpler and easier to audit. As the number of sources grows, centralizing the staging layer starts to make more sense.
Validation: Don't Skip It
Before you declare the pipeline operational, run a validation pass against the data in GCS. That means row counts against source tables, schema consistency checks, and timestamp handling verification. CDC pipelines in particular can produce subtle issues — duplicate events, out-of-order changes — that don't surface until you start querying the data downstream. Build validation into your go-live checklist, not your incident response process.
When to Consider a Different Approach
This DMS + S3 + Storage Transfer pattern works well, but it's not the only option. If you're already heavily invested in GCP tooling, GCP's Datastream service can replicate directly from MySQL and PostgreSQL sources into GCS or BigQuery without the AWS staging layer. It requires network connectivity from GCP to your AWS-hosted databases — typically via VPN or Direct Connect/Interconnect — but it eliminates one hop and one set of IAM credentials to manage.
The DMS-first approach tends to win when: you're already using DMS for other workloads, you want to keep data movement within AWS as long as possible, or you need the flexibility of S3 as an intermediate landing zone for other consumers.
Bottom Line
Cross-cloud data pipelines aren't exotic anymore. As organizations run workloads on multiple cloud providers, patterns like this — AWS as the source, GCP as the analytics platform — are becoming standard operating procedure. The plumbing is straightforward once you understand the constraints: DMS is AWS-only on the target side, so S3 is your bridge, and GCP's Storage Transfer Service handles the crossing cleanly.
If you're building something similar or running into issues with an existing cross-cloud pipeline, Six Column Solutions works on exactly these kinds of problems. Reach out — we're happy to talk through your architecture.
Building a cross-cloud data pipeline?
Six Column Solutions works on exactly these kinds of problems — multi-account AWS environments, GCP analytics platforms, and the plumbing in between. If you're architecting something similar or troubleshooting an existing pipeline, reach out. We're happy to talk through your architecture.
Get in Touch