Chengchang Yu
Published on

AWS E-Commerce Platform Architecture for Million-Scale Users πŸ›’

Authors
AWS E-Commerce Platform Architecture

AWS E-Commerce Platform Architecture

If you were to design an AWS E-Commerce Platform capable of handling millions of users, what key architectural decisions would you make? I explored this challenge with Gemini Γ— Claude, and together we mapped out a scalable, event-driven solution.

AWS E-Commerce Platform Architecture for Million-Scale Users (Click to view details)

AWS E-Commerce Platform - Million Scale Architecture (generated via Claude Sonnet 4.5)


Architecture Highlights 🎯

High Availability Design

Multi-AZ Deployment

3 Availability Zones (us-east-1a, 1b, 1c)
β”œβ”€ Application: ECS Fargate across all AZs
β”œβ”€ Database: Aurora Multi-AZ with auto-failover
β”œβ”€ Cache: ElastiCache Redis with replicas
└─ NAT Gateways: One per AZ (no single point of failure)

Cross-Region DR

Primary: us-east-1 (Active)
Secondary: eu-west-1 (Standby + Read Traffic)
β”œβ”€ Aurora Cross-Region Replica (< 1s lag)
β”œβ”€ DynamoDB Global Tables (active-active)
└─ Automated Failover: RTO < 5 min, RPO < 1 min

Auto-Scaling Strategy

ECS Fargate:
β”œβ”€ Target Tracking: CPU 70%, Memory 80%
β”œβ”€ Step Scaling: +10 tasks when CPU > 85%
└─ Scheduled Scaling: +20% capacity during peak hours

Aurora Serverless v2:
β”œβ”€ Min: 0.5 ACU (idle)
β”œβ”€ Max: 16 ACU (peak load)
└─ Scale in 0.5 ACU increments

Read Replicas:
└─ Auto-scaling: 0-15 replicas based on CPU/connections

Low Latency Optimization

Global Edge Network

CloudFront CDN:
β”œβ”€ 300+ Edge Locations worldwide
β”œβ”€ Cache Hit Ratio Target: 80%+
β”œβ”€ TTL Strategy:
β”‚   β”œβ”€ Static Assets: 1 year
β”‚   β”œβ”€ Product Images: 7 days
β”‚   β”œβ”€ API Responses: 5 minutes (for product listings)
β”‚   └─ Dynamic Content: No cache
└─ Regional Edge Caches for better cache efficiency

Caching Strategy (Multi-Layer)

Layer 1: CloudFront (Edge) - 50-100ms global latency
Layer 2: ElastiCache Redis (Regional) - 1-5ms latency
Layer 3: DynamoDB DAX (Optional) - sub-millisecond
Layer 4: Application Memory Cache - microseconds

Database Read Optimization

Read-Heavy Workloads:
β”œβ”€ Product Catalog: DynamoDB (single-digit ms)
β”œβ”€ User Sessions: DynamoDB (TTL auto-cleanup)
β”œβ”€ Shopping Cart: DynamoDB + Redis cache
└─ Order History: Aurora Read Replicas (read scaling)

Write-Heavy Workloads:
β”œβ”€ Order Processing: Aurora Primary + SQS buffering
β”œβ”€ Inventory Updates: DynamoDB (1000s writes/sec)
└─ Analytics Events: Kinesis (real-time streaming)

Latency Targets

P50 Latency: < 100ms (API responses)
P95 Latency: < 300ms
P99 Latency: < 500ms
Time to First Byte (TTFB): < 200ms globally

Cost Optimization

Monthly Cost Estimate: 14,000βˆ’14,000 - 22,000 (Budgetary Estimate)

Compute (40% = $5,600 - $8,800):
β”œβ”€ ECS Fargate: ~50 tasks average (Note: This will scale significantly higher during peak periods to handle the load, impacting the average cost)
β”‚   └─ 0.5-1 vCPU, 1-2GB RAM per task
β”œβ”€ Lambda: 10M invocations/month (mostly free tier)
└─ Savings: Use Fargate Spot for non-critical workloads (-70%)

Database (35% = $4,900 - $7,700):
β”œβ”€ Aurora Serverless v2: 4-8 ACU average (Note: Sustained peak usage will drive costs towards the higher end of this estimate)
β”œβ”€ DynamoDB: On-Demand pricing (pay per request) (Note: A significant cost component due to the high volume of read/write operations)
β”œβ”€ ElastiCache: 3x cache.r6g.large (reserved instances -40%)
└─ OpenSearch: 3x r6g.large.search (reserved instances)

Storage (7% = $1,000 - $1,500):
β”œβ”€ S3: 5TB storage + 10TB transfer
β”œβ”€ CloudFront: 20TB data transfer
└─ Lifecycle policies: Auto-tier to Glacier after 90 days

Networking (10% = $1,500-2,500):
β”œβ”€ NAT Gateways: 3x $45/month
β”œβ”€ ALB: 2x $23/month + LCU charges
└─ Data Transfer: Inter-AZ + Internet egress (Note: A major driver of networking costs due to Multi-AZ architecture and high traffic volumes)

Other (8% = $1,000-1,500):
β”œβ”€ Route 53, WAF, CloudWatch, X-Ray
β”œβ”€ Secrets Manager, KMS
└─ Third-party: Stripe (2.9% + $0.30), SendGrid, Twilio

Cost Optimization Strategies

1. Compute Savings

βœ… Use Fargate Spot for batch jobs (70% savings)
βœ… Right-size containers: Start small, scale based on metrics
βœ… Lambda provisioned concurrency only for critical functions
βœ… Schedule scale-down during off-peak hours (2am-6am)

2. Database Savings

βœ… Aurora Serverless v2: Pay only for active capacity
βœ… DynamoDB On-Demand: No idle capacity costs
βœ… Reserved Instances for ElastiCache (1-year, 40% off)
βœ… Read replicas only during peak hours

3. Storage Savings

βœ… S3 Intelligent-Tiering: Auto-optimize storage class
βœ… CloudFront: Optimize cache hit ratio to reduce origin requests
βœ… Image compression: WebP format (30% smaller)
βœ… Lifecycle policies: Delete old logs after 90 days

4. Networking Savings

βœ… Use VPC Endpoints for S3/DynamoDB (no NAT Gateway charges)
βœ… CloudFront: Reduce origin requests with longer TTLs
βœ… Compress responses: gzip/brotli (reduce transfer costs)

Performance Benchmarks πŸ“Š

Capacity Planning

Concurrent Users: 1,000,000 total
β”œβ”€ Peak Concurrent: 50,000 users
β”œβ”€ Average Session: 15 minutes
└─ Requests per User: 20 requests/session

Total Traffic:
β”œβ”€ Peak QPS: 10,000 requests/second
β”œβ”€ Average QPS: 2,000 requests/second
β”œβ”€ Daily Requests: ~170 million
└─ Monthly Requests: ~5 billion

Database Load:
β”œβ”€ Aurora: 5,000 reads/sec, 500 writes/sec
β”œβ”€ DynamoDB: 10,000 reads/sec, 2,000 writes/sec
└─ Cache Hit Rate: 95% (reduces DB load by 20x)

Scalability Limits

Current Architecture Supports:
β”œβ”€ 100,000 concurrent users (2x headroom)
β”œβ”€ 20,000 QPS (2x headroom)
β”œβ”€ 10,000 orders/hour
└─ 1 million products in catalog

Scale-Up Path:
β”œβ”€ Add more ECS tasks (up to 500+)
β”œβ”€ Add Aurora read replicas (up to 15)
β”œβ”€ Increase ElastiCache shards (up to 500 nodes)
└─ DynamoDB auto-scales to millions of requests/sec

Security Best Practices πŸ”’

Network Security:
β”œβ”€ WAF: Block SQL injection, XSS, bad bots
β”œβ”€ Security Groups: Least privilege (only required ports)
β”œβ”€ NACLs: Subnet-level firewall rules
└─ Private Subnets: No direct internet access for app/data tiers

Data Security:
β”œβ”€ Encryption at Rest: KMS for all data stores
β”œβ”€ Encryption in Transit: TLS 1.3 everywhere
β”œβ”€ Secrets Management: No hardcoded credentials
└─ Database Encryption: Aurora + DynamoDB native encryption

Identity & Access:
β”œβ”€ Cognito: User authentication with MFA
β”œβ”€ IAM Roles: Service-to-service authentication
β”œβ”€ Least Privilege: Each service has minimal permissions
└─ Audit Logging: CloudTrail for all API calls

Application Security:
β”œβ”€ Input Validation: Server-side validation
β”œβ”€ Rate Limiting: API Gateway throttling
β”œβ”€ CORS: Strict origin policies
└─ Dependency Scanning: ECR image scanning

Monitoring & Alerting πŸ“Š

Key Metrics Dashboard

Golden Signals:
β”œβ”€ Latency: P50, P95, P99 response times
β”œβ”€ Traffic: Requests per second by service
β”œβ”€ Errors: 4xx/5xx error rates
└─ Saturation: CPU, memory, database connections

Business Metrics:
β”œβ”€ Orders per minute
β”œβ”€ Cart abandonment rate
β”œβ”€ Payment success rate
β”œβ”€ Search conversion rate
└─ Revenue per hour

Critical Alarms

🚨 P1 Alarms (Immediate Response):
β”œβ”€ API Gateway 5xx > 1% for 5 minutes
β”œβ”€ Payment Service error rate > 0.5%
β”œβ”€ Database CPU > 90% for 10 minutes
β”œβ”€ Order processing queue depth > 10,000
└─ CloudFront 5xx > 5% for 5 minutes

⚠️ P2 Alarms (15-min Response):
β”œβ”€ ECS task failure rate > 5%
β”œβ”€ Cache hit rate < 80%
β”œβ”€ API latency P95 > 500ms
└─ Lambda concurrent executions > 800

πŸ“Š P3 Alarms (Review Next Day):
β”œβ”€ Daily cost > $500
β”œβ”€ S3 storage growth > 20% week-over-week
β”œβ”€ Unusual traffic patterns (anomaly detection)
└─ Security Group changes (CloudTrail)

Deployment Strategy πŸš€

CI/CD Pipeline

Code Commit β†’ GitHub
    ↓
CodePipeline Triggered
    ↓
CodeBuild (Parallel):
β”œβ”€ Run Unit Tests
β”œβ”€ Run Integration Tests
β”œβ”€ Build Docker Image
β”œβ”€ Scan for Vulnerabilities
└─ Push to ECR
    ↓
Deploy to Dev Environment
    ↓
Automated E2E Tests
    ↓
Manual Approval (for Prod)
    ↓
Blue/Green Deployment to Prod
β”œβ”€ Deploy to "Green" environment
β”œβ”€ Run smoke tests
β”œβ”€ Shift 10% traffic β†’ Monitor
β”œβ”€ Shift 50% traffic β†’ Monitor
└─ Shift 100% traffic β†’ Complete
    ↓
Rollback if error rate > 1%

Zero-Downtime Deployment

ECS Blue/Green:
β”œβ”€ New task definition deployed
β”œβ”€ Health checks pass
β”œβ”€ ALB shifts traffic gradually
└─ Old tasks drained and terminated

Database Migrations:
β”œβ”€ Backward-compatible changes only
β”œβ”€ Schema changes in separate deployment
β”œβ”€ Use Aurora zero-downtime patching
└─ Test on read replica first

Disaster Recovery Plan πŸ†˜

RTO & RPO Targets

Tier 1 Services (Critical):
β”œβ”€ Order Processing, Payment
β”œβ”€ RTO: < 5 minutes
β”œβ”€ RPO: < 1 minute
└─ Strategy: Active-Active cross-region

Tier 2 Services (Important):
β”œβ”€ Product Catalog, User Service
β”œβ”€ RTO: < 15 minutes
β”œβ”€ RPO: < 5 minutes
└─ Strategy: Warm standby in secondary region

Tier 3 Services (Non-Critical):
β”œβ”€ Analytics, Recommendations
β”œβ”€ RTO: < 1 hour
β”œβ”€ RPO: < 1 hour
└─ Strategy: Backup and restore

Failover Procedures

Automated Failover:
β”œβ”€ Route 53 health checks detect failure
β”œβ”€ Traffic automatically routed to eu-west-1
β”œβ”€ Aurora promotes read replica to primary
β”œβ”€ DynamoDB Global Tables (already active)
└─ CloudWatch alarm triggers runbook

Manual Failover (if needed):
1. Promote Aurora replica in DR region
2. Update Route 53 to point to DR region
3. Scale up ECS tasks in DR region
4. Verify all services healthy
5. Communicate to stakeholders

Migration Roadmap πŸ—ΊοΈ

Phase 1: Foundation (Week 1-2)

βœ… Set up VPC, subnets, security groups
βœ… Deploy Aurora, DynamoDB, ElastiCache
βœ… Configure CloudFront, Route 53, WAF
βœ… Set up IAM roles, Secrets Manager

Phase 2: Core Services (Week 3-4)

βœ… Deploy Product Service (read-only first)
βœ… Deploy User Service + Cognito
βœ… Deploy Cart Service
βœ… Migrate product catalog to DynamoDB
βœ… Set up monitoring and alarms

Phase 3: Transactions (Week 5-6)

βœ… Deploy Order Service
βœ… Deploy Payment Service (Stripe integration)
βœ… Deploy Inventory Service
βœ… Set up event-driven architecture (SQS, SNS, EventBridge)
βœ… Load testing (simulate 10K QPS)

Phase 4: Optimization (Week 7-8)

βœ… Fine-tune caching strategy
βœ… Optimize database queries
βœ… Set up cross-region replication
βœ… Implement auto-scaling policies
βœ… Chaos engineering tests

Phase 5: Go-Live (Week 9)

βœ… Final load testing (50K concurrent users)
βœ… DR failover drill
βœ… Cutover plan execution
βœ… Monitor for 48 hours
βœ… Post-launch optimization