SyncTecSyncTec
Back to blog
Engineering
February 12, 2025
10 min read

How we achieve 99.9% uptime on a team of eight

Our infrastructure philosophy, our on-call rotation, and the three incidents that taught us the most in our first six months of operation.

JO
James Okafor
February 12, 2025

SyncTec has maintained 99.9% uptime since launch. We're a team of 8 people.

How do we do it? Infrastructure philosophy, good tooling, and learning from failures.

Our Philosophy

**1. Simple beats clever**

We could have built a microservices architecture. We didn't. We run a monolith.

Microservices are great at scale, but they add operational complexity. With 8 people, simple wins.

**2. Boring technology**

We use: Node.js, PostgreSQL, Redis, AWS. Nothing exotic.

Boring technology means: lots of documentation, mature tooling, easy to hire for.

**3. Automate everything that can fail repeatedly**

If something fails more than twice, we automate the fix.

Examples: Database failover, worker restarts, log rotation, SSL renewal.

**4. Alerts should be actionable**

We don't alert on 'CPU at 60%'. We alert on 'API response time over 2 seconds for 5 minutes'.

Every alert should answer: 'What's broken?' and 'What do I do about it?'

Infrastructure

**Load balancers:** AWS ALB (2 instances, multi-AZ)

**API servers:** 5 EC2 instances (autoscale to 10)

**Workers:** 20 EC2 instances (autoscale to 50)

**Database:** RDS PostgreSQL (primary + 2 read replicas, multi-AZ)

**Cache:** ElastiCache Redis (3-node cluster)

**Storage:** S3 (for logs and backups)

**Monitoring:** Datadog

**On-call:** PagerDuty

**Total monthly cost:** ~$8,000 (we're profitable, so this is sustainable)

On-Call Rotation

We have 4 engineers. Each person is on-call for 1 week per month.

**On-call responsibilities:**

  • Respond to pages within 15 minutes
  • Fix issues or escalate
  • Document incidents
  • Post-mortem for anything that causes downtime

**Compensation:**

  • $500 bonus for on-call week
  • Extra if actually paged (rare)

Incident 1: The Redis OOM Kill (February 2024)

**What happened:**

Redis ran out of memory at 3am. Linux OOM killer terminated the Redis process. Queue died. All syncs stopped.

**Impact:**

  • 45 minutes of downtime
  • 2,000 failed syncs
  • Angry customers

**Root cause:**

We were storing full product JSON in Redis (as cache). One customer synced 10,000 products with large images. Filled Redis memory.

**Fix:**

  • Increased Redis memory (short-term)
  • Changed caching strategy to store product IDs only, not full JSON (long-term)
  • Added memory alerts

**What we learned:**

Don't cache things that can grow unbounded. IDs are fixed size. JSON is not.

Incident 2: The Shopify API Timeout Storm (April 2024)

**What happened:**

Shopify's API started timing out at a high rate (20% of requests). Our workers retried aggressively. This made it worse. Shopify rate-limited us. Everything stopped.

**Impact:**

  • 2 hours of degraded service
  • 10,000 failed syncs
  • Shopify support ticket

**Root cause:**

Our retry logic was too aggressive. When Shopify slowed down, we hammered them harder.

**Fix:**

  • Implemented exponential backoff with jitter
  • Added circuit breakers (stop retrying if error rate > 50%)
  • Reduced concurrency during Shopify incidents

**What we learned:**

When external APIs fail, back off. Don't make it worse.

Incident 3: The Database Failover (September 2024)

**What happened:**

AWS performed maintenance on our primary database. Automatic failover to replica took 30 seconds. Our app didn't handle it gracefully. 5 minutes of errors.

**Impact:**

  • 5 minutes of errors
  • Some customers saw failed syncs
  • No data loss (failover worked correctly)

**Root cause:**

Our app assumed database was always available. Didn't handle connection errors gracefully.

**Fix:**

  • Added connection pooling with auto-reconnect
  • Graceful degradation (queue jobs locally during database downtime)
  • Better error messages to customers

**What we learned:**

Everything fails eventually. Plan for it.

Current Reliability Numbers

**Uptime (last 12 months):** 99.92%

**Incidents:** 8 total

**Downtime:** 7 hours total

**Mean time to detect:** 2 minutes

**Mean time to resolve:** 22 minutes

Tools We Rely On

**Monitoring:** Datadog (metrics, logs, APM)

**Alerting:** PagerDuty (on-call rotation)

**Logging:** CloudWatch + Datadog

**Error tracking:** Sentry

**Status page:** StatusPage.io (customers can check status)

**Backups:** Automated daily snapshots to S3

What We Don't Do

**We don't do 24/7 support**

On-call is for incidents, not support tickets. Support tickets wait until business hours.

**We don't do blue-green deployments**

Too complex for our needs. We deploy during low-traffic hours with rollback plan.

**We don't do multi-region**

Everything runs in us-east-1. Multi-region adds complexity we don't need yet.

Future Plans

**Next 6 months:**

  • Implement canary deployments
  • Add more integration tests
  • Move to managed Kubernetes (current EC2 approach is getting complex)

**Next 12 months:**

  • Multi-region (EU for data residency)
  • 99.95% uptime target

The Bottom Line

99.9% uptime doesn't require a huge team or exotic infrastructure.

It requires: simple architecture, good monitoring, fast response, and learning from failures.

We're 8 people. We've had 8 incidents in 12 months. We learned from each one. That's how you get to 99.9%.

Ready to sync your stores?

Start your free 14-day trial. No credit card required.

Start free trial