- Published on
AWS E-Commerce Platform Architecture for Million-Scale Users π
- Authors

- Name
- Chengchang Yu
- @chengchangyu

AWS E-Commerce Platform Architecture
If you were to design an AWS E-Commerce Platform capable of handling millions of users, what key architectural decisions would you make? I explored this challenge with Gemini Γ Claude, and together we mapped out a scalable, event-driven solution.

AWS E-Commerce Platform - Million Scale Architecture (generated via Claude Sonnet 4.5)
Architecture Highlights π―
High Availability Design
Multi-AZ Deployment
3 Availability Zones (us-east-1a, 1b, 1c)
ββ Application: ECS Fargate across all AZs
ββ Database: Aurora Multi-AZ with auto-failover
ββ Cache: ElastiCache Redis with replicas
ββ NAT Gateways: One per AZ (no single point of failure)
Cross-Region DR
Primary: us-east-1 (Active)
Secondary: eu-west-1 (Standby + Read Traffic)
ββ Aurora Cross-Region Replica (< 1s lag)
ββ DynamoDB Global Tables (active-active)
ββ Automated Failover: RTO < 5 min, RPO < 1 min
Auto-Scaling Strategy
ECS Fargate:
ββ Target Tracking: CPU 70%, Memory 80%
ββ Step Scaling: +10 tasks when CPU > 85%
ββ Scheduled Scaling: +20% capacity during peak hours
Aurora Serverless v2:
ββ Min: 0.5 ACU (idle)
ββ Max: 16 ACU (peak load)
ββ Scale in 0.5 ACU increments
Read Replicas:
ββ Auto-scaling: 0-15 replicas based on CPU/connections
Low Latency Optimization
Global Edge Network
CloudFront CDN:
ββ 300+ Edge Locations worldwide
ββ Cache Hit Ratio Target: 80%+
ββ TTL Strategy:
β ββ Static Assets: 1 year
β ββ Product Images: 7 days
β ββ API Responses: 5 minutes (for product listings)
β ββ Dynamic Content: No cache
ββ Regional Edge Caches for better cache efficiency
Caching Strategy (Multi-Layer)
Layer 1: CloudFront (Edge) - 50-100ms global latency
Layer 2: ElastiCache Redis (Regional) - 1-5ms latency
Layer 3: DynamoDB DAX (Optional) - sub-millisecond
Layer 4: Application Memory Cache - microseconds
Database Read Optimization
Read-Heavy Workloads:
ββ Product Catalog: DynamoDB (single-digit ms)
ββ User Sessions: DynamoDB (TTL auto-cleanup)
ββ Shopping Cart: DynamoDB + Redis cache
ββ Order History: Aurora Read Replicas (read scaling)
Write-Heavy Workloads:
ββ Order Processing: Aurora Primary + SQS buffering
ββ Inventory Updates: DynamoDB (1000s writes/sec)
ββ Analytics Events: Kinesis (real-time streaming)
Latency Targets
P50 Latency: < 100ms (API responses)
P95 Latency: < 300ms
P99 Latency: < 500ms
Time to First Byte (TTFB): < 200ms globally
Cost Optimization
Monthly Cost Estimate: 22,000 (Budgetary Estimate)
Compute (40% = $5,600 - $8,800):
ββ ECS Fargate: ~50 tasks average (Note: This will scale significantly higher during peak periods to handle the load, impacting the average cost)
β ββ 0.5-1 vCPU, 1-2GB RAM per task
ββ Lambda: 10M invocations/month (mostly free tier)
ββ Savings: Use Fargate Spot for non-critical workloads (-70%)
Database (35% = $4,900 - $7,700):
ββ Aurora Serverless v2: 4-8 ACU average (Note: Sustained peak usage will drive costs towards the higher end of this estimate)
ββ DynamoDB: On-Demand pricing (pay per request) (Note: A significant cost component due to the high volume of read/write operations)
ββ ElastiCache: 3x cache.r6g.large (reserved instances -40%)
ββ OpenSearch: 3x r6g.large.search (reserved instances)
Storage (7% = $1,000 - $1,500):
ββ S3: 5TB storage + 10TB transfer
ββ CloudFront: 20TB data transfer
ββ Lifecycle policies: Auto-tier to Glacier after 90 days
Networking (10% = $1,500-2,500):
ββ NAT Gateways: 3x $45/month
ββ ALB: 2x $23/month + LCU charges
ββ Data Transfer: Inter-AZ + Internet egress (Note: A major driver of networking costs due to Multi-AZ architecture and high traffic volumes)
Other (8% = $1,000-1,500):
ββ Route 53, WAF, CloudWatch, X-Ray
ββ Secrets Manager, KMS
ββ Third-party: Stripe (2.9% + $0.30), SendGrid, Twilio
Cost Optimization Strategies
1. Compute Savings
β
Use Fargate Spot for batch jobs (70% savings)
β
Right-size containers: Start small, scale based on metrics
β
Lambda provisioned concurrency only for critical functions
β
Schedule scale-down during off-peak hours (2am-6am)
2. Database Savings
β
Aurora Serverless v2: Pay only for active capacity
β
DynamoDB On-Demand: No idle capacity costs
β
Reserved Instances for ElastiCache (1-year, 40% off)
β
Read replicas only during peak hours
3. Storage Savings
β
S3 Intelligent-Tiering: Auto-optimize storage class
β
CloudFront: Optimize cache hit ratio to reduce origin requests
β
Image compression: WebP format (30% smaller)
β
Lifecycle policies: Delete old logs after 90 days
4. Networking Savings
β
Use VPC Endpoints for S3/DynamoDB (no NAT Gateway charges)
β
CloudFront: Reduce origin requests with longer TTLs
β
Compress responses: gzip/brotli (reduce transfer costs)
Performance Benchmarks π
Capacity Planning
Concurrent Users: 1,000,000 total
ββ Peak Concurrent: 50,000 users
ββ Average Session: 15 minutes
ββ Requests per User: 20 requests/session
Total Traffic:
ββ Peak QPS: 10,000 requests/second
ββ Average QPS: 2,000 requests/second
ββ Daily Requests: ~170 million
ββ Monthly Requests: ~5 billion
Database Load:
ββ Aurora: 5,000 reads/sec, 500 writes/sec
ββ DynamoDB: 10,000 reads/sec, 2,000 writes/sec
ββ Cache Hit Rate: 95% (reduces DB load by 20x)
Scalability Limits
Current Architecture Supports:
ββ 100,000 concurrent users (2x headroom)
ββ 20,000 QPS (2x headroom)
ββ 10,000 orders/hour
ββ 1 million products in catalog
Scale-Up Path:
ββ Add more ECS tasks (up to 500+)
ββ Add Aurora read replicas (up to 15)
ββ Increase ElastiCache shards (up to 500 nodes)
ββ DynamoDB auto-scales to millions of requests/sec
Security Best Practices π
Network Security:
ββ WAF: Block SQL injection, XSS, bad bots
ββ Security Groups: Least privilege (only required ports)
ββ NACLs: Subnet-level firewall rules
ββ Private Subnets: No direct internet access for app/data tiers
Data Security:
ββ Encryption at Rest: KMS for all data stores
ββ Encryption in Transit: TLS 1.3 everywhere
ββ Secrets Management: No hardcoded credentials
ββ Database Encryption: Aurora + DynamoDB native encryption
Identity & Access:
ββ Cognito: User authentication with MFA
ββ IAM Roles: Service-to-service authentication
ββ Least Privilege: Each service has minimal permissions
ββ Audit Logging: CloudTrail for all API calls
Application Security:
ββ Input Validation: Server-side validation
ββ Rate Limiting: API Gateway throttling
ββ CORS: Strict origin policies
ββ Dependency Scanning: ECR image scanning
Monitoring & Alerting π
Key Metrics Dashboard
Golden Signals:
ββ Latency: P50, P95, P99 response times
ββ Traffic: Requests per second by service
ββ Errors: 4xx/5xx error rates
ββ Saturation: CPU, memory, database connections
Business Metrics:
ββ Orders per minute
ββ Cart abandonment rate
ββ Payment success rate
ββ Search conversion rate
ββ Revenue per hour
Critical Alarms
π¨ P1 Alarms (Immediate Response):
ββ API Gateway 5xx > 1% for 5 minutes
ββ Payment Service error rate > 0.5%
ββ Database CPU > 90% for 10 minutes
ββ Order processing queue depth > 10,000
ββ CloudFront 5xx > 5% for 5 minutes
β οΈ P2 Alarms (15-min Response):
ββ ECS task failure rate > 5%
ββ Cache hit rate < 80%
ββ API latency P95 > 500ms
ββ Lambda concurrent executions > 800
π P3 Alarms (Review Next Day):
ββ Daily cost > $500
ββ S3 storage growth > 20% week-over-week
ββ Unusual traffic patterns (anomaly detection)
ββ Security Group changes (CloudTrail)
Deployment Strategy π
CI/CD Pipeline
Code Commit β GitHub
β
CodePipeline Triggered
β
CodeBuild (Parallel):
ββ Run Unit Tests
ββ Run Integration Tests
ββ Build Docker Image
ββ Scan for Vulnerabilities
ββ Push to ECR
β
Deploy to Dev Environment
β
Automated E2E Tests
β
Manual Approval (for Prod)
β
Blue/Green Deployment to Prod
ββ Deploy to "Green" environment
ββ Run smoke tests
ββ Shift 10% traffic β Monitor
ββ Shift 50% traffic β Monitor
ββ Shift 100% traffic β Complete
β
Rollback if error rate > 1%
Zero-Downtime Deployment
ECS Blue/Green:
ββ New task definition deployed
ββ Health checks pass
ββ ALB shifts traffic gradually
ββ Old tasks drained and terminated
Database Migrations:
ββ Backward-compatible changes only
ββ Schema changes in separate deployment
ββ Use Aurora zero-downtime patching
ββ Test on read replica first
Disaster Recovery Plan π
RTO & RPO Targets
Tier 1 Services (Critical):
ββ Order Processing, Payment
ββ RTO: < 5 minutes
ββ RPO: < 1 minute
ββ Strategy: Active-Active cross-region
Tier 2 Services (Important):
ββ Product Catalog, User Service
ββ RTO: < 15 minutes
ββ RPO: < 5 minutes
ββ Strategy: Warm standby in secondary region
Tier 3 Services (Non-Critical):
ββ Analytics, Recommendations
ββ RTO: < 1 hour
ββ RPO: < 1 hour
ββ Strategy: Backup and restore
Failover Procedures
Automated Failover:
ββ Route 53 health checks detect failure
ββ Traffic automatically routed to eu-west-1
ββ Aurora promotes read replica to primary
ββ DynamoDB Global Tables (already active)
ββ CloudWatch alarm triggers runbook
Manual Failover (if needed):
1. Promote Aurora replica in DR region
2. Update Route 53 to point to DR region
3. Scale up ECS tasks in DR region
4. Verify all services healthy
5. Communicate to stakeholders
Migration Roadmap πΊοΈ
Phase 1: Foundation (Week 1-2)
β
Set up VPC, subnets, security groups
β
Deploy Aurora, DynamoDB, ElastiCache
β
Configure CloudFront, Route 53, WAF
β
Set up IAM roles, Secrets Manager
Phase 2: Core Services (Week 3-4)
β
Deploy Product Service (read-only first)
β
Deploy User Service + Cognito
β
Deploy Cart Service
β
Migrate product catalog to DynamoDB
β
Set up monitoring and alarms
Phase 3: Transactions (Week 5-6)
β
Deploy Order Service
β
Deploy Payment Service (Stripe integration)
β
Deploy Inventory Service
β
Set up event-driven architecture (SQS, SNS, EventBridge)
β
Load testing (simulate 10K QPS)
Phase 4: Optimization (Week 7-8)
β
Fine-tune caching strategy
β
Optimize database queries
β
Set up cross-region replication
β
Implement auto-scaling policies
β
Chaos engineering tests
Phase 5: Go-Live (Week 9)
β
Final load testing (50K concurrent users)
β
DR failover drill
β
Cutover plan execution
β
Monitor for 48 hours
β
Post-launch optimization