Cost Optimization Guide¶

Reduce AI costs by 80-95% through smart provider selection, caching, and optimization strategies

Overview¶

AI API costs can quickly escalate in production. This guide shows proven strategies to dramatically reduce AI spending while maintaining quality and performance. Learn how to leverage free tiers, choose cost-effective models, implement caching, and optimize token usage.

Potential Savings¶

Strategy	Typical Savings	Complexity
Free Tier First	80-100%	Low
Model Selection	50-90%	Low
Response Caching	60-95%	Medium
Token Optimization	20-40%	Medium
Prompt Compression	15-30%	Medium
Smart Fallbacks	30-60%	High
Batch Processing	50%	Medium

Cost Comparison¶

Monthly Cost Comparison (1M requests, 500 tokens avg):

Premium (GPT-4):           $6,000/month
Smart Routing:             $1,200/month  (80% savings)
Free Tier First:           $300/month    (95% savings)
Full Optimization:         $150/month    (97.5% savings)

Quick Wins¶

1. Use Free Tiers First¶

Maximize free tier usage before falling back to paid providers.

import { NeuroLink } from "@juspay/neurolink";

const ai = new NeuroLink({
  providers: [
    // Tier 1: Free providers (try these first)
    {
      name: "google-ai",
      priority: 1,
      model: "gemini-2.0-flash",
      config: { apiKey: process.env.GOOGLE_AI_KEY },
      quotas: {
        daily: 1500, // 1,500 requests/day free
        perMinute: 15, // 15 RPM free
      },
    },

    // Tier 2: Cheap paid providers
    {
      name: "openai",
      priority: 2,
      model: "gpt-4o-mini",
      config: { apiKey: process.env.OPENAI_KEY },
      costPer1M: 150, // $0.15/1K tokens
    },

    // Tier 3: Premium (only when necessary)
    {
      name: "anthropic",
      priority: 3,
      model: "claude-3-5-sonnet-20241022",
      config: { apiKey: process.env.ANTHROPIC_KEY },
      costPer1M: 3000, // $3/1K tokens
    },
  ],
  failoverConfig: {
    enabled: true,
    fallbackOnQuota: true, // Auto-failover when quota exhausted
  },
});

// Automatically uses cheapest available provider
const result = await ai.generate({
  input: { text: "Your prompt" },
});

console.log(`Used: ${result.provider}, Cost: $${result.cost}`);

Estimated Monthly Savings:

Before: 1M requests × $3/1K tokens = $1,500/month
After:  900K free + 100K paid × $0.15/1K = $15/month
Savings: $1,485/month (99% reduction)

2. Choose Cost-Effective Models¶

Use cheaper models for simple tasks, premium only when needed.

function selectModel(task: string): { provider: string; model: string } {
  const complexity = analyzeComplexity(task);

  if (complexity === "simple") {
    return {
      provider: "google-ai",
      model: "gemini-2.0-flash", // Free
    };
  } else if (complexity === "medium") {
    return {
      provider: "openai",
      model: "gpt-4o-mini", // $0.15/1K
    };
  } else {
    return {
      provider: "anthropic",
      model: "claude-3-5-sonnet-20241022", // $3/1K
    };
  }
}

function analyzeComplexity(task: string): "simple" | "medium" | "complex" {
  const length = task.length;
  const keywords = /analyze|complex|detailed|comprehensive/i;

  if (length < 100 && !keywords.test(task)) return "simple";
  if (length < 500 && !keywords.test(task)) return "medium";
  return "complex";
}

// Usage
const { provider, model } = selectModel("What is 2+2?"); // → google-ai (free)
const result = await ai.generate({
  input: { text: "What is 2+2?" },
  provider,
  model,
});

Cost Comparison:

Simple query (100 tokens):
- GPT-4:           $0.0003
- GPT-4o-mini:     $0.00001
- Gemini Flash:    $0
Savings: 100% vs GPT-4

Complex query (2000 tokens):
- GPT-4:           $0.006
- Claude Sonnet:   $0.006
- GPT-4o-mini:     $0.0003
Savings: 95% for tasks where mini performs well

3. Implement Response Caching¶

Cache common queries to avoid repeated API calls.

import { createHash } from "crypto";

class ResponseCache {
  private cache = new Map<
    string,
    {
      response: any;
      timestamp: number;
      cost: number;
    }
  >();

  private TTL = 3600000; // 1 hour
  private totalSavings = 0;

  getCacheKey(input: any, provider: string, model: string): string {
    const hash = createHash("sha256");
    hash.update(JSON.stringify({ input, provider, model }));
    return hash.digest("hex");
  }

  get(key: string): any | null {
    const cached = this.cache.get(key);

    if (!cached) return null;

    // Check if expired
    if (Date.now() - cached.timestamp > this.TTL) {
      this.cache.delete(key);
      return null;
    }

    // Track savings
    this.totalSavings += cached.cost;
    console.log(`Cache hit! Saved $${cached.cost.toFixed(4)}`);

    return cached.response;
  }

  set(key: string, response: any, cost: number) {
    this.cache.set(key, {
      response,
      timestamp: Date.now(),
      cost,
    });
  }

  getSavings(): number {
    return this.totalSavings;
  }

  getStats() {
    return {
      entries: this.cache.size,
      totalSavings: this.totalSavings,
      avgCostPerEntry: this.totalSavings / this.cache.size,
    };
  }
}

// Usage
const cache = new ResponseCache();

async function cachedGenerate(prompt: string) {
  const cacheKey = cache.getCacheKey({ text: prompt }, "openai", "gpt-4o-mini");

  // Check cache first
  const cached = cache.get(cacheKey);
  if (cached) {
    return cached;
  }

  // Generate fresh response
  const result = await ai.generate({
    input: { text: prompt },
    provider: "openai",
    model: "gpt-4o-mini",
    enableAnalytics: true,
  });

  // Store in cache
  cache.set(cacheKey, result, result.cost);

  return result;
}

// Check savings
setInterval(() => {
  console.log("Cache stats:", cache.getStats());
  // { entries: 523, totalSavings: 45.67, avgCostPerEntry: 0.087 }
}, 60000);

Estimated Savings:

Cache hit rate: 60% (common in production)
Monthly requests: 1M
Cost without cache: $150
Cost with cache:    $60 (40% of requests)
Savings: $90/month (60% reduction)

Free Tier Optimization¶

Google AI Studio (1,500 RPD Free)¶

class GoogleAIQuotaManager {
  private requestsToday = 0;
  private dayStart = Date.now();

  async canUseFreeTier(): Promise<boolean> {
    // Reset daily counter
    if (Date.now() - this.dayStart > 86400000) {
      this.requestsToday = 0;
      this.dayStart = Date.now();
    }

    return this.requestsToday < 1450; // Buffer before limit
  }

  recordRequest() {
    this.requestsToday++;
  }

  getRemainingQuota(): number {
    return Math.max(0, 1500 - this.requestsToday);
  }
}

// Usage
const googleQuota = new GoogleAIQuotaManager();

const ai = new NeuroLink({
  providers: [
    {
      name: "google-ai",
      priority: 1,
      condition: async () => await googleQuota.canUseFreeTier(),
    },
    {
      name: "openai",
      priority: 2,
      model: "gpt-4o-mini", // Cheap fallback
    },
  ],
});

Monthly Savings:

1,500 requests/day × 30 days = 45,000 free requests
45,000 × 500 tokens × $0.15/1M = $3.37 saved/month
If 100% free tier: $0 cost

Hugging Face (100% Free)¶

// Use Hugging Face for zero-cost inference
const ai = new NeuroLink({
  providers: [
    {
      name: "huggingface",
      priority: 1,
      model: "mistralai/Mistral-7B-Instruct-v0.2",
      config: { apiKey: process.env.HF_API_KEY }, // Free API key
      costPer1M: 0, // Completely free
    },
    {
      name: "openai",
      priority: 2,
      model: "gpt-4o-mini",
      costPer1M: 150, // Fallback when HF quality insufficient
    },
  ],
});

// For simple tasks, 100% free with Hugging Face
const simple = await ai.generate({
  input: { text: "Summarize: AI is transforming industries..." },
  // Uses Hugging Face (free)
});

Token Optimization¶

1. Reduce Output Tokens¶

Limit response length to only what's needed.

// ❌ Bad: No limit (can generate 1000s of tokens)
const wasteful = await ai.generate({
  input: { text: "List AI providers" },
  // Could generate 2000+ tokens
});

// ✅ Good: Set reasonable limit
const efficient = await ai.generate({
  input: { text: "List AI providers" },
  maxTokens: 200, // Only what's needed
});

// Savings per request:
// Before: 2000 tokens × $0.15/1M = $0.0003
// After:  200 tokens × $0.15/1M = $0.00003
// Savings: 90%

2. Optimize Prompts¶

Use concise prompts without sacrificing quality.

// ❌ Bad: Verbose prompt (300 tokens)
const verbose = await ai.generate({
  input: {
    text: `
    I would like you to please help me understand what artificial intelligence
    is all about. Please provide a comprehensive explanation that covers the
    following topics in great detail: machine learning, deep learning, neural
    networks, natural language processing, and computer vision. Make sure to
    explain each concept thoroughly and provide examples where applicable.
  `,
  },
});

// ✅ Good: Concise prompt (50 tokens)
const concise = await ai.generate({
  input: {
    text: "Explain AI: ML, DL, neural networks, NLP, computer vision. Include examples.",
  },
});

// Savings per request:
// Before: 300 input + 500 output = 800 tokens × $0.15/1M = $0.00012
// After:  50 input + 500 output = 550 tokens × $0.15/1M = $0.0000825
// Savings: 31% on input tokens

3. Streaming Optimization¶

Stop generation early when answer is complete.

async function streamWithEarlyStop(prompt: string, stopWords: string[]) {
  let fullResponse = "";
  let stopped = false;

  for await (const chunk of ai.stream({
    input: { text: prompt },
    provider: "openai",
    model: "gpt-4o-mini",
  })) {
    fullResponse += chunk.content;

    // Check for stop condition
    if (stopWords.some((word) => fullResponse.includes(word))) {
      await chunk.cancel(); // Stop generation
      stopped = true;
      break;
    }
  }

  console.log(`Stopped early: ${stopped}`);
  return fullResponse;
}

// Usage
const result = await streamWithEarlyStop(
  "List 10 programming languages",
  ["10."], // Stop after 10th item
);

// Potential savings: 20-40% by not generating unnecessary content

Prompt Engineering for Cost¶

Use Structured Outputs¶

Request specific formats to reduce token waste.

// ❌ Bad: Unstructured (generates 500+ tokens)
const unstructured = await ai.generate({
  input: { text: "Tell me about AI providers" },
});
// Output: "There are many AI providers available today. Let me tell you about them in detail..."

// ✅ Good: Structured (generates 200 tokens)
const structured = await ai.generate({
  input: { text: "List AI providers in format: name|description|pricing" },
});
// Output: "OpenAI|GPT models|$0.002/1K\nAnthropic|Claude|$0.003/1K\n..."

// Savings: 60% fewer tokens

Request Summaries¶

Ask for brief responses when detail isn't needed.

// For detailed analysis
const detailed = await ai.generate({
  input: { text: "Provide detailed analysis of AI market trends (500 words)" },
  maxTokens: 700,
});
// Cost: $0.0001

// For quick insights
const summary = await ai.generate({
  input: { text: "AI market trends: 3 bullet points" },
  maxTokens: 100,
});
// Cost: $0.000015
// Savings: 85%

Batch Processing¶

Process multiple requests in single API call.

// ❌ Bad: 10 separate requests
const wasteful = await Promise.all([
  ai.generate({ input: { text: "Translate to French: Hello" } }),
  ai.generate({ input: { text: "Translate to French: Goodbye" } }),
  // ... 8 more requests
]);
// Cost: 10 × overhead + 10 × processing = high overhead

// ✅ Good: Batch into single request
const batch = await ai.generate({
  input: {
    text: `
    Translate to French:
    1. Hello
    2. Goodbye
    3. Thank you
    ... (10 items)
  `,
  },
  maxTokens: 200,
});
// Cost: 1 × overhead + batch processing = ~50% savings

Batch Processing Pattern:

class BatchProcessor {
  private queue: Array<{
    prompt: string;
    resolve: (value: any) => void;
  }> = [];

  private batchSize = 10;
  private batchTimeout = 1000; // 1 second
  private timer: NodeJS.Timeout | null = null;

  async add(prompt: string): Promise<any> {
    return new Promise((resolve) => {
      this.queue.push({ prompt, resolve });

      if (this.queue.length >= this.batchSize) {
        this.processBatch();
      } else if (!this.timer) {
        this.timer = setTimeout(() => this.processBatch(), this.batchTimeout);
      }
    });
  }

  private async processBatch() {
    if (this.timer) {
      clearTimeout(this.timer);
      this.timer = null;
    }

    const batch = this.queue.splice(0, this.batchSize);
    if (batch.length === 0) return;

    // Combine prompts
    const combinedPrompt = batch
      .map((item, i) => `${i + 1}. ${item.prompt}`)
      .join("\n");

    // Single API call
    const result = await ai.generate({
      input: { text: `Answer each question:\n${combinedPrompt}` },
    });

    // Parse and distribute responses
    const responses = result.content.split("\n");
    batch.forEach((item, i) => {
      item.resolve(responses[i]);
    });
  }
}

// Usage
const batcher = new BatchProcessor();

// These get batched into single request
const results = await Promise.all([
  batcher.add("What is AI?"),
  batcher.add("What is ML?"),
  batcher.add("What is DL?"),
]);

Smart Routing Patterns¶

Cost-Based Routing¶

const ai = new NeuroLink({
  providers: [
    // Route simple queries to free tier
    {
      name: "google-ai",
      priority: 1,
      model: "gemini-2.0-flash",
      condition: (req) => req.complexity === "low",
      costPer1M: 0,
    },

    // Medium complexity → cheap paid
    {
      name: "openai",
      priority: 1,
      model: "gpt-4o-mini",
      condition: (req) => req.complexity === "medium",
      costPer1M: 150,
    },

    // Complex → premium only when necessary
    {
      name: "anthropic",
      priority: 1,
      model: "claude-3-5-sonnet-20241022",
      condition: (req) => req.complexity === "high",
      costPer1M: 3000,
    },
  ],
});

// Classify and route
function classifyComplexity(prompt: string): "low" | "medium" | "high" {
  const length = prompt.length;
  const complexWords = ["analyze", "detailed", "comprehensive", "complex"];
  const hasComplexWords = complexWords.some((w) =>
    prompt.toLowerCase().includes(w),
  );

  if (length < 100 && !hasComplexWords) return "low"; // Free tier
  if (length < 500 || !hasComplexWords) return "medium"; // Cheap paid
  return "high"; // Premium
}

// Usage
const result = await ai.generate({
  input: { text: "What is 2+2?" },
  metadata: { complexity: classifyComplexity("What is 2+2?") },
  // Routes to google-ai (free) → $0 cost
});