Load Balancing Guide¶

Distribute AI requests across multiple providers, API keys, and regions for optimal performance

Overview¶

Load balancing distributes incoming AI requests across multiple providers, API keys, or model instances to optimize throughput, reduce latency, and prevent rate limiting. NeuroLink supports multiple load balancing strategies out of the box.

Key Benefits¶

⚡ Higher Throughput: Parallel requests across multiple keys/providers
🔒 Avoid Rate Limits: Distribute load to stay within quotas
🌍 Lower Latency: Route to fastest/nearest provider
💰 Cost Optimization: Balance between free and paid tiers
📊 Fair Distribution: Ensure even usage across resources
🔄 Dynamic Scaling: Add/remove providers on the fly

Use Cases¶

High-Volume Applications: Handle 1000s of requests/second
Rate Limit Management: Stay within provider quotas
Multi-Region Deployment: Serve global users efficiently
Cost Management: Maximize free tier usage before paid
A/B Testing: Compare provider performance
Gradual Rollouts: Slowly migrate between providers

Quick Start¶

Basic Round-Robin Load Balancing¶

import { NeuroLink } from "@juspay/neurolink";

const ai = new NeuroLink({
  providers: [
    {
      name: "openai-key-1",
      config: { apiKey: process.env.OPENAI_KEY_1 },
    },
    {
      name: "openai-key-2",
      config: { apiKey: process.env.OPENAI_KEY_2 },
    },
    {
      name: "openai-key-3",
      config: { apiKey: process.env.OPENAI_KEY_3 },
    },
  ],
  loadBalancing: "round-robin",
});

// Requests distributed evenly:
// Request 1 → openai-key-1
// Request 2 → openai-key-2
// Request 3 → openai-key-3
// Request 4 → openai-key-1 (cycles back)

for (let i = 0; i < 10; i++) {
  const result = await ai.generate({
    input: { text: `Request ${i}` },
  });
  console.log(`Used: ${result.provider}`);
}

Load Balancing Strategies¶

1. Round-Robin (Default)¶

Distribute requests evenly in circular order.

const ai = new NeuroLink({
  providers: [
    { name: "provider-1" },
    { name: "provider-2" },
    { name: "provider-3" },
  ],
  loadBalancing: "round-robin",
});

// Distribution: P1 → P2 → P3 → P1 → P2 → P3 ...

Best for:

Providers with equal capacity
Even distribution needed
Simple setup

2. Weighted Round-Robin¶

Distribute based on provider weights.

const ai = new NeuroLink({
  providers: [
    { name: "provider-1", weight: 3 }, // 60% of traffic
    { name: "provider-2", weight: 2 }, // 40% of traffic
  ],
  loadBalancing: "weighted-round-robin",
});

// Out of 5 requests:
// 3 → provider-1 (60%)
// 2 → provider-2 (40%)

Best for:

Different provider capacities
Gradual migrations
Free tier optimization

Example: Free Tier Prioritization

const ai = new NeuroLink({
  providers: [
    {
      name: "google-ai",
      weight: 5, // 71% (free tier)
      config: { apiKey: process.env.GOOGLE_AI_KEY },
    },
    {
      name: "openai",
      weight: 2, // 29% (paid tier, lower priority)
      config: { apiKey: process.env.OPENAI_KEY },
    },
  ],
  loadBalancing: "weighted-round-robin",
});

3. Least-Busy¶

Route to provider with fewest active requests.

const ai = new NeuroLink({
  providers: [
    { name: "provider-1" },
    { name: "provider-2" },
    { name: "provider-3" },
  ],
  loadBalancing: "least-busy",
});

// Automatically routes to least loaded provider
// Active requests: P1=5, P2=2, P3=8 → Routes to P2

Best for:

Varying request durations
High concurrency
Real-time load adaptation

4. Latency-Based Routing¶

Route to fastest provider.

const ai = new NeuroLink({
  providers: [
    { name: "provider-1" },
    { name: "provider-2" },
    { name: "provider-3" },
  ],
  loadBalancing: "latency-based",
  healthCheck: {
    enabled: true,
    interval: 30000, // Update latency every 30s
  },
});

// Routes to provider with lowest average latency
// Latencies: P1=120ms, P2=95ms, P3=200ms → Routes to P2

Best for:

Geographic distribution
Performance-critical apps
Multi-region deployments

5. Hash-Based (Consistent Hashing)¶

Route same user/request to same provider.

const ai = new NeuroLink({
  providers: [
    { name: "provider-1" },
    { name: "provider-2" },
    { name: "provider-3" },
  ],
  loadBalancing: "hash",
  hashKey: (req) => req.userId, // Hash on user ID
});

// Same user always routed to same provider
// user123 → always provider-2
// user456 → always provider-1

Best for:

Session affinity
Conversation continuity
Caching optimization

Example: User-Based Routing

const result = await ai.generate({
  input: { text: "Your prompt" },
  metadata: { userId: "user-123" }, // Always routes to same provider
});

6. Random¶

Randomly select provider.

const ai = new NeuroLink({
  providers: [
    { name: "provider-1" },
    { name: "provider-2" },
    { name: "provider-3" },
  ],
  loadBalancing: "random",
});

// Randomly selects any provider
// Good for simple load distribution

Best for:

Testing/development
Stateless requests
Equal provider capacity

Multi-Key Load Balancing¶

Managing Rate Limits¶

Distribute across multiple API keys to increase throughput.

// OpenAI: 500 RPM per key → 2500 RPM total with 5 keys
const ai = new NeuroLink({
  providers: [
    { name: "openai-1", config: { apiKey: process.env.OPENAI_KEY_1 } },
    { name: "openai-2", config: { apiKey: process.env.OPENAI_KEY_2 } },
    { name: "openai-3", config: { apiKey: process.env.OPENAI_KEY_3 } },
    { name: "openai-4", config: { apiKey: process.env.OPENAI_KEY_4 } },
    { name: "openai-5", config: { apiKey: process.env.OPENAI_KEY_5 } },
  ],
  loadBalancing: "round-robin",
  rateLimit: {
    requestsPerMinute: 500, // Per key limit
    strategy: "distributed", // Enforce across all keys
  },
});

// Total capacity: 2,500 RPM (5 keys × 500 RPM)

Quota Management¶

Track usage across multiple keys.

class QuotaManager {
  private usage = new Map<
    string,
    {
      requestsThisMinute: number;
      tokensThisMinute: number;
      minuteStart: number;
    }
  >();

  canUseProvider(providerName: string): boolean {
    const quota = this.usage.get(providerName);
    if (!quota) return true;

    const now = Date.now();

    // Reset if new minute
    if (now - quota.minuteStart > 60000) {
      quota.requestsThisMinute = 0;
      quota.tokensThisMinute = 0;
      quota.minuteStart = now;
      return true;
    }

    // Check limits (OpenAI Tier 1: 500 RPM, 30K TPM)
    return quota.requestsThisMinute < 500 && quota.tokensThisMinute < 30000;
  }

  recordUsage(providerName: string, tokens: number) {
    if (!this.usage.has(providerName)) {
      this.usage.set(providerName, {
        requestsThisMinute: 0,
        tokensThisMinute: 0,
        minuteStart: Date.now(),
      });
    }

    const quota = this.usage.get(providerName)!;
    quota.requestsThisMinute++;
    quota.tokensThisMinute += tokens;
  }
}

// Usage
const quotaManager = new QuotaManager();

const ai = new NeuroLink({
  providers: [
    { name: "openai-1", config: { apiKey: process.env.OPENAI_KEY_1 } },
    { name: "openai-2", config: { apiKey: process.env.OPENAI_KEY_2 } },
    { name: "openai-3", config: { apiKey: process.env.OPENAI_KEY_3 } },
  ],
  loadBalancing: {
    strategy: "custom",
    selector: (providers, req) => {
      // Select first provider below quota
      return (
        providers.find((p) => quotaManager.canUseProvider(p.name)) ||
        providers[0]
      );
    },
  },
  onSuccess: (result) => {
    quotaManager.recordUsage(result.provider, result.usage.totalTokens);
  },
});

Multi-Provider Load Balancing¶

Cross-Provider Distribution¶

Balance across different AI providers.

const ai = new NeuroLink({
  providers: [
    // 50% OpenAI
    { name: "openai", weight: 5, config: { apiKey: process.env.OPENAI_KEY } },

    // 30% Anthropic
    {
      name: "anthropic",
      weight: 3,
      config: { apiKey: process.env.ANTHROPIC_KEY },
    },

    // 20% Google AI
    {
      name: "google-ai",
      weight: 2,
      config: { apiKey: process.env.GOOGLE_AI_KEY },
    },
  ],
  loadBalancing: "weighted-round-robin",
});

// Distribution: 50% OpenAI, 30% Anthropic, 20% Google AI

A/B Testing¶

Compare provider performance.

const ai = new NeuroLink({
  providers: [
    {
      name: "openai",
      weight: 1,
      config: { apiKey: process.env.OPENAI_KEY },
      tags: ["experiment-a"],
    },
    {
      name: "anthropic",
      weight: 1,
      config: { apiKey: process.env.ANTHROPIC_KEY },
      tags: ["experiment-b"],
    },
  ],
  loadBalancing: "weighted-round-robin",
  onSuccess: (result) => {
    // Track metrics for each variant
    analytics.track("ai_request", {
      provider: result.provider,
      experiment: result.tags[0],
      latency: result.latency,
      tokens: result.usage.totalTokens,
      cost: result.cost,
    });
  },
});

// After collecting data, analyze which performs better

Geographic Load Balancing¶

Multi-Region Setup¶

Route users to nearest provider.

const ai = new NeuroLink({
  providers: [
    // US East
    {
      name: "openai-us-east",
      region: "us-east-1",
      priority: 1,
      condition: (req) => req.userRegion === "us-east",
    },

    // US West
    {
      name: "openai-us-west",
      region: "us-west-2",
      priority: 1,
      condition: (req) => req.userRegion === "us-west",
    },

    // Europe
    {
      name: "mistral-eu",
      region: "eu-west-1",
      priority: 1,
      condition: (req) => req.userRegion === "eu",
    },

    // Asia Pacific
    {
      name: "vertex-asia",
      region: "asia-southeast1",
      priority: 1,
      condition: (req) => req.userRegion === "asia",
    },
  ],
  loadBalancing: "latency-based",
});

// Usage
const result = await ai.generate({
  input: { text: "Your prompt" },
  metadata: {
    userRegion: detectRegion(req.ip), // us-east, us-west, eu, asia
  },
});

Latency-Optimized Routing¶

// Track provider latencies
class LatencyTracker {
  private latencies = new Map<string, number[]>();

  recordLatency(provider: string, latency: number) {
    if (!this.latencies.has(provider)) {
      this.latencies.set(provider, []);
    }

    const arr = this.latencies.get(provider)!;
    arr.push(latency);

    // Keep last 100 measurements
    if (arr.length > 100) {
      arr.shift();
    }
  }

  getAverageLatency(provider: string): number {
    const arr = this.latencies.get(provider) || [];
    if (arr.length === 0) return Infinity;

    return arr.reduce((a, b) => a + b, 0) / arr.length;
  }

  getFastestProvider(providers: string[]): string {
    let fastest = providers[0];
    let lowestLatency = this.getAverageLatency(fastest);

    for (const provider of providers) {
      const latency = this.getAverageLatency(provider);
      if (latency < lowestLatency) {
        lowestLatency = latency;
        fastest = provider;
      }
    }

    return fastest;
  }
}

// Usage
const tracker = new LatencyTracker();

const ai = new NeuroLink({
  providers: [
    { name: "provider-1" },
    { name: "provider-2" },
    { name: "provider-3" },
  ],
  loadBalancing: {
    strategy: "custom",
    selector: (providers) => {
      const fastest = tracker.getFastestProvider(providers.map((p) => p.name));
      return providers.find((p) => p.name === fastest)!;
    },
  },
  onSuccess: (result) => {
    tracker.recordLatency(result.provider, result.latency);
  },
});

Advanced Patterns¶

Pattern 1: Tiered Load Balancing¶

Combine multiple strategies across tiers.

const ai = new NeuroLink({
  providers: [
    // Tier 1: Free tier (round-robin within tier)
    { name: "google-ai-1", tier: 1, cost: 0 },
    { name: "google-ai-2", tier: 1, cost: 0 },
    { name: "google-ai-3", tier: 1, cost: 0 },

    // Tier 2: Cheap paid (round-robin within tier)
    { name: "openai-mini-1", tier: 2, cost: 0.15 },
    { name: "openai-mini-2", tier: 2, cost: 0.15 },

    // Tier 3: Premium (only when needed)
    { name: "anthropic-claude", tier: 3, cost: 3.0 },
  ],
  loadBalancing: {
    strategy: "tiered",
    tierStrategy: "round-robin", // Within each tier
    tierFallback: true, // Fall through tiers on failure
  },
});

Pattern 2: Cost-Optimized Balancing¶

Balance based on cost and quota.

async function costOptimizedSelect(
  providers: Provider[],
  req: Request,
): Promise<Provider> {
  // Sort by cost (cheapest first)
  const sorted = providers.sort((a, b) => a.cost - b.cost);

  // Try each provider in cost order
  for (const provider of sorted) {
    // Check if provider has quota available
    if (await hasQuotaAvailable(provider)) {
      return provider;
    }
  }

  // All cheap providers exhausted, use expensive fallback
  return sorted[sorted.length - 1];
}

const ai = new NeuroLink({
  providers: [
    { name: "google-ai", cost: 0 }, // Free tier
    { name: "openai-mini", cost: 0.15 }, // Cheap paid
    { name: "gpt-4", cost: 3.0 }, // Premium
  ],
  loadBalancing: {
    strategy: "custom",
    selector: costOptimizedSelect,
  },
});

Pattern 3: Request-Type Based Routing¶

Route based on request characteristics.

const ai = new NeuroLink({
  providers: [
    // Fast, cheap model for simple queries
    {
      name: "gemini-flash",
      condition: (req) => req.complexity === "low",
      model: "gemini-2.0-flash",
    },

    // Balanced for medium complexity
    {
      name: "gpt-4o-mini",
      condition: (req) => req.complexity === "medium",
      model: "gpt-4o-mini",
    },

    // Premium for complex queries
    {
      name: "claude-sonnet",
      condition: (req) => req.complexity === "high",
      model: "claude-3-5-sonnet-20241022",
    },
  ],
});

// Usage
const simpleResult = await ai.generate({
  input: { text: "What is 2+2?" },
  metadata: { complexity: "low" }, // Routes to gemini-flash
});

const complexResult = await ai.generate({
  input: { text: "Analyze this complex business scenario..." },
  metadata: { complexity: "high" }, // Routes to claude-sonnet
});

Monitoring and Metrics¶

Load Distribution Dashboard¶

class LoadBalancerMetrics {
  private stats = new Map<
    string,
    {
      requests: number;
      errors: number;
      totalLatency: number;
      lastUsed: number;
    }
  >();

  recordRequest(provider: string, latency: number, error: boolean) {
    if (!this.stats.has(provider)) {
      this.stats.set(provider, {
        requests: 0,
        errors: 0,
        totalLatency: 0,
        lastUsed: Date.now(),
      });
    }

    const stat = this.stats.get(provider)!;
    stat.requests++;
    stat.totalLatency += latency;
    stat.lastUsed = Date.now();

    if (error) {
      stat.errors++;
    }
  }

  getStats() {
    const total = Array.from(this.stats.values()).reduce(
      (sum, stat) => sum + stat.requests,
      0,
    );

    return Array.from(this.stats.entries()).map(([provider, stat]) => ({
      provider,
      requests: stat.requests,
      percentage: (stat.requests / total) * 100,
      errorRate: (stat.errors / stat.requests) * 100,
      avgLatency: stat.totalLatency / stat.requests,
      lastUsed: new Date(stat.lastUsed).toISOString(),
    }));
  }
}

// Usage
const metrics = new LoadBalancerMetrics();

const ai = new NeuroLink({
  providers: [
    /* ... */
  ],
  onSuccess: (result) => {
    metrics.recordRequest(result.provider, result.latency, false);
  },
  onError: (error, provider) => {
    metrics.recordRequest(provider, 0, true);
  },
});

// View dashboard
console.table(metrics.getStats());
/*
┌─────────┬──────────────┬──────────┬────────────┬───────────┬─────────┬──────────────────────────┐
│ (index) │   provider   │ requests │ percentage │ errorRate │ avgLat  │        lastUsed          │
├─────────┼──────────────┼──────────┼────────────┼───────────┼─────────┼──────────────────────────┤
│    0    │  'openai-1'  │   342    │   34.2     │    0.29   │  125ms  │ 2025-01-15T10:30:45.123Z │
│    1    │  'openai-2'  │   338    │   33.8     │    0.00   │  118ms  │ 2025-01-15T10:30:46.456Z │
│    2    │  'openai-3'  │   320    │   32.0     │    0.31   │  132ms  │ 2025-01-15T10:30:44.789Z │
└─────────┴──────────────┴──────────┴────────────┴───────────┴─────────┴──────────────────────────┘
*/

Best Practices¶

1. ✅ Use Weighted Balancing for Migrations¶

// ✅ Good: Gradual migration from OpenAI to Anthropic
const ai = new NeuroLink({
  providers: [
    { name: "openai", weight: 7 }, // 70% (gradually decrease)
    { name: "anthropic", weight: 3 }, // 30% (gradually increase)
  ],
  loadBalancing: "weighted-round-robin",
});

// Week 1: 70/30 split
// Week 2: 50/50 split
// Week 3: 30/70 split
// Week 4: 0/100 split (fully migrated)

2. ✅ Monitor Distribution Fairness¶

// ✅ Good: Alert if distribution becomes uneven
const expectedDistribution = {
  "provider-1": 33.3,
  "provider-2": 33.3,
  "provider-3": 33.3,
};

setInterval(() => {
  const stats = metrics.getStats();

  for (const stat of stats) {
    const expected = expectedDistribution[stat.provider];
    const deviation = Math.abs(stat.percentage - expected);

    if (deviation > 10) {
      // >10% deviation
      alerting.sendAlert(
        `Uneven distribution: ${stat.provider} at ${stat.percentage}% (expected ${expected}%)`,
      );
    }
  }
}, 60000); // Check every minute

3. ✅ Use Health Checks with Load Balancing¶

// ✅ Good: Don't route to unhealthy providers
const ai = new NeuroLink({
  providers: [
    /* ... */
  ],
  loadBalancing: "round-robin",
  healthCheck: {
    enabled: true,
    interval: 30000,
    excludeUnhealthy: true, // Skip unhealthy providers
  },
});

4. ✅ Implement Circuit Breakers¶

// ✅ Good: Prevent cascading failures
const ai = new NeuroLink({
  providers: [
    /* ... */
  ],
  loadBalancing: "round-robin",
  circuitBreaker: {
    enabled: true,
    failureThreshold: 5,
    resetTimeout: 60000,
  },
});

5. ✅ Test Load Distribution¶

// ✅ Good: Verify even distribution in tests
describe("Load Balancing", () => {
  it("should distribute requests evenly", async () => {
    const usage = new Map<string, number>();

    for (let i = 0; i < 300; i++) {
      const result = await ai.generate({
        input: { text: `Request ${i}` },
      });

      usage.set(result.provider, (usage.get(result.provider) || 0) + 1);
    }

    // Each provider should get ~100 requests (±10%)
    for (const [provider, count] of usage.entries()) {
      expect(count).toBeGreaterThan(90);
      expect(count).toBeLessThan(110);
    }
  });
});

Multi-Provider Failover - Automatic failover
Cost Optimization - Reduce AI costs
Provider Setup - Provider configuration
Monitoring Guide - Observability and metrics

Additional Resources¶

NeuroLink GitHub - Source code
GitHub Discussions - Community support
Issues - Report bugs

Need Help? Join our GitHub Discussions or open an issue.

Load Balancing Guide¶

Overview¶

Key Benefits¶

Use Cases¶

Quick Start¶

Basic Round-Robin Load Balancing¶

Load Balancing Strategies¶

1. Round-Robin (Default)¶

2. Weighted Round-Robin¶

3. Least-Busy¶

4. Latency-Based Routing¶

5. Hash-Based (Consistent Hashing)¶

6. Random¶

Multi-Key Load Balancing¶

Managing Rate Limits¶

Quota Management¶

Multi-Provider Load Balancing¶

Cross-Provider Distribution¶

A/B Testing¶

Geographic Load Balancing¶

Multi-Region Setup¶

Latency-Optimized Routing¶

Advanced Patterns¶

Pattern 1: Tiered Load Balancing¶

Pattern 2: Cost-Optimized Balancing¶

Pattern 3: Request-Type Based Routing¶

Monitoring and Metrics¶

Load Distribution Dashboard¶

Best Practices¶

1. ✅ Use Weighted Balancing for Migrations¶

2. ✅ Monitor Distribution Fairness¶

3. ✅ Use Health Checks with Load Balancing¶

4. ✅ Implement Circuit Breakers¶

5. ✅ Test Load Distribution¶

Related Documentation¶

Additional Resources¶