How we achieve 99.9% uptime on a team of eight
Our infrastructure philosophy, our on-call rotation, and the three incidents that taught us the most in our first six months of operation.
SyncTec has maintained 99.9% uptime since launch. We're a team of 8 people.
How do we do it? Infrastructure philosophy, good tooling, and learning from failures.
Our Philosophy
**1. Simple beats clever**
We could have built a microservices architecture. We didn't. We run a monolith.
Microservices are great at scale, but they add operational complexity. With 8 people, simple wins.
**2. Boring technology**
We use: Node.js, PostgreSQL, Redis, AWS. Nothing exotic.
Boring technology means: lots of documentation, mature tooling, easy to hire for.
**3. Automate everything that can fail repeatedly**
If something fails more than twice, we automate the fix.
Examples: Database failover, worker restarts, log rotation, SSL renewal.
**4. Alerts should be actionable**
We don't alert on 'CPU at 60%'. We alert on 'API response time over 2 seconds for 5 minutes'.
Every alert should answer: 'What's broken?' and 'What do I do about it?'
Infrastructure
**Load balancers:** AWS ALB (2 instances, multi-AZ)
**API servers:** 5 EC2 instances (autoscale to 10)
**Workers:** 20 EC2 instances (autoscale to 50)
**Database:** RDS PostgreSQL (primary + 2 read replicas, multi-AZ)
**Cache:** ElastiCache Redis (3-node cluster)
**Storage:** S3 (for logs and backups)
**Monitoring:** Datadog
**On-call:** PagerDuty
**Total monthly cost:** ~$8,000 (we're profitable, so this is sustainable)
On-Call Rotation
We have 4 engineers. Each person is on-call for 1 week per month.
**On-call responsibilities:**
- Respond to pages within 15 minutes
- Fix issues or escalate
- Document incidents
- Post-mortem for anything that causes downtime
**Compensation:**
- $500 bonus for on-call week
- Extra if actually paged (rare)
Incident 1: The Redis OOM Kill (February 2024)
**What happened:**
Redis ran out of memory at 3am. Linux OOM killer terminated the Redis process. Queue died. All syncs stopped.
**Impact:**
- 45 minutes of downtime
- 2,000 failed syncs
- Angry customers
**Root cause:**
We were storing full product JSON in Redis (as cache). One customer synced 10,000 products with large images. Filled Redis memory.
**Fix:**
- Increased Redis memory (short-term)
- Changed caching strategy to store product IDs only, not full JSON (long-term)
- Added memory alerts
**What we learned:**
Don't cache things that can grow unbounded. IDs are fixed size. JSON is not.
Incident 2: The Shopify API Timeout Storm (April 2024)
**What happened:**
Shopify's API started timing out at a high rate (20% of requests). Our workers retried aggressively. This made it worse. Shopify rate-limited us. Everything stopped.
**Impact:**
- 2 hours of degraded service
- 10,000 failed syncs
- Shopify support ticket
**Root cause:**
Our retry logic was too aggressive. When Shopify slowed down, we hammered them harder.
**Fix:**
- Implemented exponential backoff with jitter
- Added circuit breakers (stop retrying if error rate > 50%)
- Reduced concurrency during Shopify incidents
**What we learned:**
When external APIs fail, back off. Don't make it worse.
Incident 3: The Database Failover (September 2024)
**What happened:**
AWS performed maintenance on our primary database. Automatic failover to replica took 30 seconds. Our app didn't handle it gracefully. 5 minutes of errors.
**Impact:**
- 5 minutes of errors
- Some customers saw failed syncs
- No data loss (failover worked correctly)
**Root cause:**
Our app assumed database was always available. Didn't handle connection errors gracefully.
**Fix:**
- Added connection pooling with auto-reconnect
- Graceful degradation (queue jobs locally during database downtime)
- Better error messages to customers
**What we learned:**
Everything fails eventually. Plan for it.
Current Reliability Numbers
**Uptime (last 12 months):** 99.92%
**Incidents:** 8 total
**Downtime:** 7 hours total
**Mean time to detect:** 2 minutes
**Mean time to resolve:** 22 minutes
Tools We Rely On
**Monitoring:** Datadog (metrics, logs, APM)
**Alerting:** PagerDuty (on-call rotation)
**Logging:** CloudWatch + Datadog
**Error tracking:** Sentry
**Status page:** StatusPage.io (customers can check status)
**Backups:** Automated daily snapshots to S3
What We Don't Do
**We don't do 24/7 support**
On-call is for incidents, not support tickets. Support tickets wait until business hours.
**We don't do blue-green deployments**
Too complex for our needs. We deploy during low-traffic hours with rollback plan.
**We don't do multi-region**
Everything runs in us-east-1. Multi-region adds complexity we don't need yet.
Future Plans
**Next 6 months:**
- Implement canary deployments
- Add more integration tests
- Move to managed Kubernetes (current EC2 approach is getting complex)
**Next 12 months:**
- Multi-region (EU for data residency)
- 99.95% uptime target
The Bottom Line
99.9% uptime doesn't require a huge team or exotic infrastructure.
It requires: simple architecture, good monitoring, fast response, and learning from failures.
We're 8 people. We've had 8 incidents in 12 months. We learned from each one. That's how you get to 99.9%.