Multi-Provider Failover Guide¶

Build resilient AI applications with automatic provider failover and redundancy

Overview¶

Multi-provider failover enables your application to automatically switch between AI providers when one fails, ensuring high availability and reliability. NeuroLink provides built-in failover capabilities with configurable priorities, conditions, and retry strategies.

Key Benefits¶

🔒 99.9%+ Uptime: Automatic failover when providers are down
⚡ Zero Downtime: Seamless switching between providers
💰 Cost Optimization: Route to cheaper providers when available
🌍 Geographic Redundancy: Distribute across regions
🔄 Smart Retries: Exponential backoff with configurable limits
📊 Failover Metrics: Track provider reliability

Use Cases¶

Production Applications: Ensure critical AI features never go down
Cost Optimization: Use expensive providers only when needed
Geographic Distribution: Serve users from nearest region
A/B Testing: Route traffic between providers for comparison
Compliance: Route EU traffic to GDPR-compliant providers

Quick Start¶

Basic Failover Configuration¶

import { NeuroLink } from "@juspay/neurolink";

const ai = new NeuroLink({
  providers: [
    {
      name: "openai",
      priority: 1, // Primary provider
      config: {
        apiKey: process.env.OPENAI_API_KEY,
      },
    },
    {
      name: "anthropic",
      priority: 2, // Fallback 1
      config: {
        apiKey: process.env.ANTHROPIC_API_KEY,
      },
    },
    {
      name: "google-ai",
      priority: 3, // Fallback 2
      config: {
        apiKey: process.env.GOOGLE_AI_API_KEY,
      },
    },
  ],
});

// Automatically tries OpenAI → Anthropic → Google AI
const result = await ai.generate({
  input: { text: "Hello world!" },
  // No provider specified - uses priority order
});

Test Failover¶

// Simulate OpenAI failure
const result = await ai.generate({
  input: { text: "Test failover" },
});

console.log("Used provider:", result.provider); // Will show fallback provider
console.log("Attempts:", result.metadata.attempts);
console.log("Failed providers:", result.metadata.failedProviders);

Failover Strategies¶

1. Priority-Based Failover (Recommended)¶

Try providers in priority order until one succeeds.

const ai = new NeuroLink({
  providers: [
    { name: "openai", priority: 1 }, // Try first
    { name: "anthropic", priority: 2 }, // Try second
    { name: "google-ai", priority: 3 }, // Try third
  ],
  failoverConfig: {
    enabled: true,
    maxAttempts: 3, // Try up to 3 providers
    retryDelay: 1000, // Wait 1s between attempts
    exponentialBackoff: true, // 1s, 2s, 4s delays
  },
});

2. Condition-Based Routing¶

Route to specific providers based on request conditions.

const ai = new NeuroLink({
  providers: [
    {
      name: "mistral",
      priority: 1, // (1)!
      condition: (req) => req.userRegion === "EU", // (2)!
      config: { apiKey: process.env.MISTRAL_API_KEY },
    },
    {
      name: "openai",
      priority: 1,
      condition: (req) => req.userRegion !== "EU", // (3)!
      config: { apiKey: process.env.OPENAI_API_KEY },
    },
    {
      name: "google-ai",
      priority: 2, // (4)!
      config: { apiKey: process.env.GOOGLE_AI_API_KEY },
    },
  ],
});

// Usage
const result = await ai.generate({
  input: { text: "Your prompt" },
  metadata: { userRegion: "EU" }, // (5)!
});

Same priority: Both Mistral and OpenAI have priority 1, but conditions determine which one is used.
GDPR compliance: Route EU users to Mistral AI (European provider) for automatic GDPR compliance.
Regional routing: Non-EU users go to OpenAI. Multiple providers at same priority with mutually exclusive conditions.
Universal fallback: Google AI (priority 2) has no condition, so it's used if both priority 1 providers fail.
Pass routing metadata: Include userRegion in metadata so conditions can access it for routing decisions.

3. Cost-Based Routing¶

Try cheaper providers first, fallback to premium providers.

const ai = new NeuroLink({
  providers: [
    {
      name: "google-ai",
      priority: 1, // Free tier first
      model: "gemini-2.0-flash",
      condition: (req) => !req.requiresPremium,
    },
    {
      name: "openai",
      priority: 2, // Paid tier fallback
      model: "gpt-4o-mini",
      condition: (req) => !req.requiresPremium,
    },
    {
      name: "anthropic",
      priority: 3, // Premium for complex tasks
      model: "claude-3-5-sonnet-20241022",
    },
  ],
});

// Cheap query (uses Google AI free tier)
const cheap = await ai.generate({
  input: { text: "Simple customer query" },
  metadata: { requiresPremium: false },
});

// Complex query (uses Anthropic)
const premium = await ai.generate({
  input: { text: "Complex business analysis requiring detailed reasoning..." },
  metadata: { requiresPremium: true },
});

4. Load-Balanced Failover¶

Combine load balancing with failover.

const ai = new NeuroLink({
  providers: [
    // Load balanced primary tier
    {
      name: "openai-1",
      priority: 1,
      config: { apiKey: process.env.OPENAI_KEY_1 },
    },
    {
      name: "openai-2",
      priority: 1,
      config: { apiKey: process.env.OPENAI_KEY_2 },
    },
    {
      name: "openai-3",
      priority: 1,
      config: { apiKey: process.env.OPENAI_KEY_3 },
    },

    // Fallback tier
    { name: "anthropic", priority: 2 },
    { name: "google-ai", priority: 3 },
  ],
  loadBalancing: "round-robin", // Balance across same-priority providers
  failoverConfig: { enabled: true },
});

Retry Configuration¶

Exponential Backoff¶

const ai = new NeuroLink({
  providers: [
    { name: "openai", priority: 1 },
    { name: "anthropic", priority: 2 },
  ],
  failoverConfig: {
    enabled: true,
    maxAttempts: 5,
    retryDelay: 1000, // Start with 1s
    exponentialBackoff: true, // 1s, 2s, 4s, 8s, 16s
    maxRetryDelay: 30000, // Cap at 30s
  },
});

Selective Retry¶

const ai = new NeuroLink({
  providers: [
    { name: "openai", priority: 1 },
    { name: "anthropic", priority: 2 },
  ],
  failoverConfig: {
    enabled: true,
    retryOn: [
      // (1)!
      "ECONNREFUSED", // Connection errors
      "ETIMEDOUT", // Timeout
      "429", // Rate limit
      "500", // Server errors
      "502", // Bad gateway
      "503", // Service unavailable
      "504", // Gateway timeout
    ],
    doNotRetryOn: [
      // (2)!
      "400", // Bad request (client error)
      "401", // Invalid API key
      "403", // Forbidden
    ],
  },
});

Retryable errors: Transient failures worth retrying. Network errors (ECONNREFUSED, ETIMEDOUT) and server issues (429, 5xx) often resolve on retry.
Non-retryable errors: Client-side errors that won't be fixed by retrying. Invalid requests (400), authentication failures (401), and authorization issues (403) require code changes.

Custom Retry Logic¶

const ai = new NeuroLink({
  providers: [
    { name: "openai", priority: 1 },
    { name: "anthropic", priority: 2 },
  ],
  failoverConfig: {
    enabled: true,
    shouldRetry: (error, attempt, provider) => {
      // Custom retry logic
      if (error.message.includes("rate limit")) {
        console.log(`Rate limited on ${provider}, waiting...`);
        return attempt < 3; // Retry up to 3 times
      }

      if (error.message.includes("timeout")) {
        console.log(`Timeout on ${provider}, trying next...`);
        return false; // Don't retry timeouts, failover immediately
      }

      // Default: retry on network errors
      return error.code?.startsWith("E");
    },
    getRetryDelay: (attempt, error, provider) => {
      // Custom delay calculation
      if (error.message.includes("rate limit")) {
        return 5000; // Wait 5s for rate limits
      }
      return Math.pow(2, attempt) * 1000; // Exponential for others
    },
  },
});

Provider Health Checks¶

Active Health Monitoring¶

const ai = new NeuroLink({
  providers: [
    { name: "openai", priority: 1 },
    { name: "anthropic", priority: 2 },
    { name: "google-ai", priority: 3 },
  ],
  healthCheck: {
    enabled: true,
    interval: 60000, // Check every 60s
    timeout: 5000, // 5s timeout per check
    unhealthyThreshold: 3, // Mark unhealthy after 3 failures
    healthyThreshold: 2, // Mark healthy after 2 successes
  },
});

// Get provider health status
const health = await ai.getProviderHealth();
console.log(health);
/*
{
  openai: { status: 'healthy', latency: 120, lastCheck: '2025-01-15T10:00:00Z' },
  anthropic: { status: 'unhealthy', latency: null, lastCheck: '2025-01-15T10:00:00Z' },
  'google-ai': { status: 'healthy', latency: 95, lastCheck: '2025-01-15T10:00:00Z' }
}
*/

// Only use healthy providers
const result = await ai.generate({
  input: { text: "Your prompt" },
  useOnlyHealthy: true, // Skip anthropic (unhealthy)
});

Circuit Breaker Pattern¶

const ai = new NeuroLink({
  providers: [
    { name: "openai", priority: 1 },
    { name: "anthropic", priority: 2 },
  ],
  circuitBreaker: {
    enabled: true,
    failureThreshold: 5, // Open circuit after 5 failures
    resetTimeout: 60000, // Try again after 60s
    halfOpenRequests: 3, // Test with 3 requests when half-open
  },
});

// Circuit breaker state machine:
// CLOSED (normal) → 5 failures → OPEN (block requests)
// → wait 60s → HALF_OPEN (test with 3 requests)
// → 3 successes → CLOSED | 1 failure → OPEN

// Get circuit state
const state = await ai.getCircuitState("openai");
console.log(state); // 'CLOSED' | 'OPEN' | 'HALF_OPEN'

Production Patterns¶

Pattern 1: High Availability Setup¶

// Production-ready HA configuration
const ai = new NeuroLink({
  providers: [
    // Tier 1: Load-balanced primary
    { name: "openai-us-east", priority: 1, region: "us-east-1" },
    { name: "openai-us-west", priority: 1, region: "us-west-2" },

    // Tier 2: Alternative provider
    { name: "anthropic-us", priority: 2 },

    // Tier 3: Emergency fallback
    { name: "google-ai", priority: 3 },
  ],
  loadBalancing: "latency-based", // Route to fastest provider
  failoverConfig: {
    enabled: true,
    maxAttempts: 6, // Try all providers
    retryDelay: 500,
    exponentialBackoff: true,
  },
  healthCheck: {
    enabled: true,
    interval: 30000, // Check every 30s
    timeout: 3000,
  },
  circuitBreaker: {
    enabled: true,
    failureThreshold: 10,
    resetTimeout: 120000, // 2 minutes
  },
});

Pattern 2: Cost-Optimized Failover¶

// Free tier first, paid tier fallback
const ai = new NeuroLink({
  providers: [
    {
      name: "google-ai",
      priority: 1,
      model: "gemini-2.0-flash",
      config: { apiKey: process.env.GOOGLE_AI_KEY },
      costPerToken: 0, // Free tier
    },
    {
      name: "openai",
      priority: 2,
      model: "gpt-4o-mini",
      config: { apiKey: process.env.OPENAI_KEY },
      costPerToken: 0.00015,
    },
    {
      name: "anthropic",
      priority: 3,
      model: "claude-3-5-sonnet-20241022",
      config: { apiKey: process.env.ANTHROPIC_KEY },
      costPerToken: 0.003,
    },
  ],
  failoverConfig: {
    enabled: true,
    // Skip rate-limited free tier immediately
    shouldFailover: (error, provider) => {
      if (provider.costPerToken === 0 && error.message.includes("quota")) {
        console.log("Free tier exhausted, failing over to paid tier");
        return true;
      }
      return error.code?.startsWith("E"); // Network errors
    },
  },
});

Pattern 3: Geographic Routing¶

// Route to nearest region with failover
const ai = new NeuroLink({
  providers: [
    // US East
    {
      name: "openai-us-east",
      priority: 1,
      condition: (req) => req.userRegion === "us-east",
    },

    // US West
    {
      name: "openai-us-west",
      priority: 1,
      condition: (req) => req.userRegion === "us-west",
    },

    // Europe
    {
      name: "mistral-eu",
      priority: 1,
      condition: (req) => req.userRegion === "eu",
    },

    // Asia Pacific
    {
      name: "vertex-asia",
      priority: 1,
      condition: (req) => req.userRegion === "asia",
    },

    // Global fallback
    { name: "openai-global", priority: 2 },
  ],
});

// Usage
const result = await ai.generate({
  input: { text: "Your prompt" },
  metadata: {
    userRegion: getUserRegion(req.ip), // Detect from IP
  },
});

Pattern 4: Model-Specific Failover¶

// Different models with same capability
const ai = new NeuroLink({
  providers: [
    // Primary: GPT-4
    {
      name: "openai",
      priority: 1,
      model: "gpt-4o",
      capability: "complex-reasoning",
    },

    // Fallback 1: Claude 3.5 Sonnet (similar capability)
    {
      name: "anthropic",
      priority: 2,
      model: "claude-3-5-sonnet-20241022",
      capability: "complex-reasoning",
    },

    // Fallback 2: Gemini Pro
    {
      name: "google-ai",
      priority: 3,
      model: "gemini-1.5-pro",
      capability: "complex-reasoning",
    },
  ],
  failoverConfig: {
    enabled: true,
    matchCapability: true, // Only failover to same capability
  },
});

Monitoring and Metrics¶

Track Failover Events¶

const ai = new NeuroLink({
  providers: [
    { name: "openai", priority: 1 },
    { name: "anthropic", priority: 2 },
  ],
  failoverConfig: {
    enabled: true,
    onFailover: (event) => {
      // Log failover event
      console.log({
        timestamp: new Date().toISOString(),
        from: event.failedProvider,
        to: event.successfulProvider,
        error: event.error.message,
        attempts: event.attempts,
        latency: event.totalLatency,
      });

      // Send to monitoring system
      metrics.increment("ai.failover.count", {
        from: event.failedProvider,
        to: event.successfulProvider,
      });
    },
    onSuccess: (event) => {
      // Log successful request
      metrics.histogram("ai.latency", event.latency, {
        provider: event.provider,
        model: event.model,
      });
    },
  },
});

Failover Metrics Dashboard¶

// Track provider reliability
class FailoverMetrics {
  private stats = new Map();

  recordAttempt(provider: string, success: boolean, latency: number) {
    if (!this.stats.has(provider)) {
      this.stats.set(provider, {
        total: 0,
        successes: 0,
        failures: 0,
        totalLatency: 0,
      });
    }

    const stat = this.stats.get(provider);
    stat.total++;
    stat.totalLatency += latency;

    if (success) {
      stat.successes++;
    } else {
      stat.failures++;
    }
  }

  getProviderStats() {
    const stats = [];

    for (const [provider, stat] of this.stats.entries()) {
      stats.push({
        provider,
        total: stat.total,
        successRate: (stat.successes / stat.total) * 100,
        avgLatency: stat.totalLatency / stat.total,
        failureCount: stat.failures,
      });
    }

    return stats.sort((a, b) => b.successRate - a.successRate);
  }
}

// Usage
const metrics = new FailoverMetrics();

const ai = new NeuroLink({
  providers: [
    /* ... */
  ],
  failoverConfig: {
    enabled: true,
    onSuccess: (event) => {
      metrics.recordAttempt(event.provider, true, event.latency);
    },
    onFailover: (event) => {
      metrics.recordAttempt(event.failedProvider, false, event.latency);
      metrics.recordAttempt(event.successfulProvider, true, event.latency);
    },
  },
});

// View stats
console.log(metrics.getProviderStats());
/*
[
  { provider: 'openai', total: 1000, successRate: 99.5, avgLatency: 120, failureCount: 5 },
  { provider: 'anthropic', total: 50, successRate: 98.0, avgLatency: 150, failureCount: 1 },
  { provider: 'google-ai', total: 10, successRate: 100, avgLatency: 95, failureCount: 0 }
]
*/

Best Practices¶

1. ✅ Always Configure Multiple Providers¶

// ❌ Bad: Single provider (no failover)
const ai = new NeuroLink({
  providers: [{ name: "openai" }],
});

// ✅ Good: Multiple providers with failover
const ai = new NeuroLink({
  providers: [
    { name: "openai", priority: 1 },
    { name: "anthropic", priority: 2 },
    { name: "google-ai", priority: 3 },
  ],
  failoverConfig: { enabled: true },
});

2. ✅ Use Health Checks in Production¶

// ✅ Good: Active health monitoring
const ai = new NeuroLink({
  providers: [
    /* ... */
  ],
  healthCheck: {
    enabled: true,
    interval: 60000, // 1 minute
    timeout: 5000, // 5 seconds
  },
});

3. ✅ Implement Circuit Breakers¶

// ✅ Good: Prevent cascading failures
const ai = new NeuroLink({
  providers: [
    /* ... */
  ],
  circuitBreaker: {
    enabled: true,
    failureThreshold: 5,
    resetTimeout: 60000,
  },
});

4. ✅ Monitor Failover Events¶

// ✅ Good: Track failures for debugging
failoverConfig: {
  enabled: true,
  onFailover: (event) => {
    logger.error('Provider failover', {
      from: event.failedProvider,
      to: event.successfulProvider,
      error: event.error
    });

    // Alert if too many failovers
    if (event.attempts > 3) {
      alerting.sendAlert('Multiple provider failures detected');
    }
  }
}

5. ✅ Test Failover Regularly¶

// ✅ Good: Test failover in CI/CD
describe("Failover", () => {
  it("should failover when primary provider fails", async () => {
    // Mock OpenAI failure
    mockOpenAI.mockRejectedValue(new Error("503 Service Unavailable"));

    const result = await ai.generate({
      input: { text: "test" },
    });

    // Verify failover occurred
    expect(result.provider).toBe("anthropic");
    expect(result.metadata.attempts).toBe(2);
  });
});

Troubleshooting¶

Issue 1: Failover Not Triggering¶

Problem: Requests fail without trying fallback providers.

Solution:

// Ensure failover is enabled
failoverConfig: {
  enabled: true,  // Must be true
  maxAttempts: 3  // Must be > 1
}

// Check provider priorities
providers: [
  { name: 'openai', priority: 1 },  // Different priorities
  { name: 'anthropic', priority: 2 }  // Not same priority
]

Issue 2: Too Many Retry Attempts¶

Problem: Requests take too long due to excessive retries.

Solution:

// Limit retry attempts
failoverConfig: {
  enabled: true,
  maxAttempts: 3,        // Limit attempts
  retryDelay: 1000,      // Reduce delay
  maxRetryDelay: 5000    // Cap max delay
}

Issue 3: Circuit Breaker Stuck Open¶

Problem: Provider marked as failed even when healthy.

Solution:

// Adjust circuit breaker settings
circuitBreaker: {
  enabled: true,
  failureThreshold: 10,    // Increase threshold
  resetTimeout: 30000,     // Reduce timeout
  halfOpenRequests: 5      // More test requests
}

// Manually reset circuit
await ai.resetCircuit('openai');

Feature Guides:

Provider Orchestration - Intelligent provider selection and routing
Regional Streaming - Region-specific failover strategies
Auto Evaluation - Validate failover quality

Enterprise Guides:

Load Balancing Guide - Distribution strategies
Cost Optimization - Reduce AI costs
Provider Setup - Provider configuration
Monitoring Guide - Observability and metrics

Additional Resources¶

NeuroLink GitHub - Source code
GitHub Discussions - Community support
Issues - Report bugs

Need Help? Join our GitHub Discussions or open an issue.

Multi-Provider Failover Guide¶

Overview¶

Key Benefits¶

Use Cases¶

Quick Start¶

Basic Failover Configuration¶

Test Failover¶

Failover Strategies¶

1. Priority-Based Failover (Recommended)¶

2. Condition-Based Routing¶

3. Cost-Based Routing¶

4. Load-Balanced Failover¶

Retry Configuration¶

Exponential Backoff¶

Selective Retry¶

Custom Retry Logic¶

Provider Health Checks¶

Active Health Monitoring¶

Circuit Breaker Pattern¶

Production Patterns¶

Pattern 1: High Availability Setup¶

Pattern 2: Cost-Optimized Failover¶

Pattern 3: Geographic Routing¶

Pattern 4: Model-Specific Failover¶

Monitoring and Metrics¶

Track Failover Events¶

Failover Metrics Dashboard¶

Best Practices¶

1. ✅ Always Configure Multiple Providers¶

2. ✅ Use Health Checks in Production¶

3. ✅ Implement Circuit Breakers¶

4. ✅ Monitor Failover Events¶

5. ✅ Test Failover Regularly¶

Troubleshooting¶

Issue 1: Failover Not Triggering¶

Issue 2: Too Many Retry Attempts¶

Issue 3: Circuit Breaker Stuck Open¶

Related Documentation¶

Additional Resources¶