---
title: "Error Handling and Resilience"
description: "Production AI bots face rate limits, model failures, and API timeouts. Learn to implement exponential backoff, handle Slack API rate limits, and build resilient systems that degrade gracefully rather than failing completely."
canonical_url: "https://vercel.com/academy/slack-agents/error-handling-and-resilience"
md_url: "https://vercel.com/academy/slack-agents/error-handling-and-resilience.md"
docset_id: "vercel-academy"
doc_version: "1.0"
last_updated: "2026-04-11T07:49:32.263Z"
content_type: "lesson"
course: "slack-agents"
course_title: "Slack Agents on Vercel with the AI SDK"
prerequisites:  []
---

<agent-instructions>
Vercel Academy — structured learning, not reference docs.
Lessons are sequenced.
Adapt commands to the human's actual environment (OS, package manager, shell, editor) — detect from project context or ask, don't assume.
The lesson shows one path; if the human's project diverges, adapt concepts to their setup.
Preserve the learning goal over literal steps.
Quizzes are pedagogical — engage, don't spoil.
Quiz answers are included for your reference.
</agent-instructions>

# Error Handling and Resilience

# Build resilient AI that survives rate limits, timeouts, and model failures

At 3 AM during an incident, your bot hits OpenAI rate limits. Without retries and fallbacks, it dies exactly when your team needs it most. While Vercel AI Gateway handles provider-level failures, your bot still needs application-level resilience for Slack APIs, network issues, and model-specific fallbacks. Production bots don't get to fail gracefully—they get to not fail.

## Outcome

Implement retry logic with exponential backoff, model fallbacks, and graceful degradation for production reliability.

## Core Concept

```
First Try → Fails (rate limit/timeout)
    ↓
Wait 1 second → Retry
    ↓
Still fails? → Wait 2 seconds → Retry
    ↓
Still fails? → Try backup model (gpt-3.5)
    ↓
All failed? → User-friendly error message
```

## Retry Flow Diagram

```
┌─────────────────────────────────────────────────────────────────┐
│                 Exponential Backoff & Fallback Flow            │
└─────────────────────────────────────────────────────────────────┘

Timeline: 0s ──── 1s ──── 3s ──── 7s ──── 15s ──── 20s ──── 25s

Request arrives (t=0)
    │
    ├─[Attempt 1: GPT-4o-mini]──X (429 rate limit @ 0.5s)
    │                            │
    │                      [Wait 1s]
    │                            │
    ├─[Attempt 2: GPT-4o-mini]──X (429 rate limit @ 2s)
    │                            │
    │                      [Wait 2s]
    │                            │
    ├─[Attempt 3: GPT-4o-mini]──X (429 rate limit @ 5s)
    │                            │
    │                      [Wait 4s]
    │                            │
    ├─[Attempt 4: GPT-4o-mini]──X (Still rate limited @ 10s)
    │                            │
    │                   [Fallback triggered]
    │                            │
    ├─[Attempt 1: GPT-3.5-turbo]─X (Network error @ 11s)
    │                            │
    │                      [Wait 1s]
    │                            │
    └─[Attempt 2: GPT-3.5-turbo]─✓ Success! (@ 13s)
                                 │
                           [Response sent]

Retry Strategy:
- Max attempts per model: 4
- Backoff multiplier: 2x
- Max wait time: 8 seconds
- Fallback chain: gpt-4o-mini → gpt-3.5-turbo → error message
```

## Fast Track

1. Create basic retry wrapper with exponential backoff
2. Add model fallback chain to AI responses
3. Test with `/test-resilience` command

## Hands-On Exercise 4.4

Build a retry wrapper that makes your bot resilient to API failures:

**Requirements:**

1. Create `/slack-agent/server/lib/ai/retry-wrapper.ts` with basic retry logic
2. Implement exponential backoff (1s → 2s → 4s → 8s)
3. Add model fallback in `respond-to-message.ts` (gpt-4o-mini → gpt-3.5-turbo)
4. Return a friendly error message if all retries fail

**Implementation hints:**

- Start simple: just count attempts and increase delay
- Check if error status is 429 (rate limit) to know when to retry
- Use `setTimeout` wrapped in a Promise for delays
- Keep the existing system prompt when falling back to another model

**Manifest update for test command:**

```json
{
  "slash_commands": [
    {
      "command": "/test-resilience",
      "url": "https://your-ngrok-url/api/slack/events",
      "description": "Test bot resilience with simulated failures",
      "should_escape": false
    }
  ]
}
```

## Try It

1. **Test the resilience command:**
   ```
   /test-resilience
   ```

2. **Watch the logs to see retry behavior with correlation-style tracking:**
   ```
   [INFO] Simulating error failure (attempt 1) { correlationId: 'retry-1757720494-a3b2c1' }
   [INFO] Attempt 1 failed, retrying in 1000ms { correlationId: 'retry-1757720494-a3b2c1' }
   [INFO] Simulating error failure (attempt 2) { correlationId: 'retry-1757720494-a3b2c1' }
   [INFO] Attempt 2 failed, retrying in 2000ms { correlationId: 'retry-1757720494-a3b2c1' }
   [INFO] Attempting with model: openai/gpt-4o-mini { correlationId: 'retry-1757720494-a3b2c1' }
   ```

3. **If first model fails completely, see fallback:**
   ```
   [ERROR] Model openai/gpt-4o-mini failed after retries { correlationId: 'retry-1757720494-a3b2c1' }
   [INFO] Attempting with model: openai/gpt-3.5-turbo { correlationId: 'retry-1757720494-a3b2c1' }
   ✅ Resilience test completed
   ```

These logs use an operation-level `correlationId` generated inside the retry wrapper. In a full implementation, you'd also include `...context.correlation` from [Bolt Middleware](./bolt-nitro-middleware-and-logging) at the call site so retries can be tied back to the original Slack event.

## Commit

```bash
git add -A
git commit -m "feat(ai): add retry logic with exponential backoff and model fallbacks"
```

## Done-When

- [x] Failed API calls retry with exponential backoff
- [x] Rate limits respect `retry-after` header
- [x] Model fallbacks activate on primary failure
- [x] Users receive helpful messages during degradation
- [x] All retries logged with correlation IDs

## Solution

Create `/slack-agent/server/lib/ai/retry-wrapper.ts`:

```typescript title="/slack-agent/server/lib/ai/retry-wrapper.ts"
import { app } from "~/app";

interface RetryOptions {
  maxRetries?: number;
  initialDelayMs?: number;
}

// Type guard for HTTP errors
function isHttpError(error: unknown): error is {
  status: number;
  headers?: Record<string, string>;
  retryAfter?: number;
} {
  return error instanceof Error && 'status' in error;
}

export async function withRetry<T>(
  fn: () => Promise<T>,
  options: RetryOptions = {}
): Promise<T> {
  const {
    maxRetries = 3,
    initialDelayMs = 1000,
  } = options;

  let lastError: unknown;
  const correlationId = `retry-${Date.now()}-${Math.random().toString(36).substr(2, 7)}`;

  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      return await fn();
    } catch (error) {
      lastError = error;

      // CONTROL DECISION: Don't retry client errors (except rate limits)
      // Rationale: Bad requests won't succeed on retry, fail fast to save cost and time
      if (isHttpError(error)) {
        if (error.status >= 400 && error.status < 500 && error.status !== 429) {
          app.logger.warn('Client error detected, failing fast (no retry)', {
            correlationId,
            status: error.status,
            reason: 'Client errors are permanent, retrying wastes time and money'
          });
          throw error; // Fail fast - no retry will fix this
        }
      }

      // CONTROL DECISION: Last attempt? Stop retrying
      // Rationale: We've exhausted retries, propagate failure to caller for graceful degradation
      if (attempt === maxRetries) {
        app.logger.error(`All ${maxRetries} attempts failed`, {
          correlationId,
          attempts: maxRetries,
          error: error instanceof Error ? error.message : String(error),
          outcome: 'Switching to fallback model or graceful degradation'
        });
        throw error; // Exhausted retries, let caller handle graceful degradation
      }

      // CONTROL DECISION: Calculate backoff with rate limit awareness
      // Rationale: Respect service's retry-after directive to avoid ban
      let delayMs = initialDelayMs * Math.pow(2, attempt - 1);

      // Check for explicit retry-after directive from service
      if (isHttpError(error) && error.status === 429) {
        const serviceRequestedDelay = (error.retryAfter ?? Number(error.headers?.['retry-after'])) * 1000;
        delayMs = serviceRequestedDelay || delayMs;

        app.logger.info('Rate limited, using service-requested delay', {
          correlationId,
          requestedDelayMs: serviceRequestedDelay,
          reason: 'Service told us exactly when to retry - respect it to avoid ban'
        });
      }

      app.logger.info(`Attempt ${attempt} failed, retrying in ${delayMs}ms`, {
        correlationId,
        nextAttempt: attempt + 1,
        strategy: 'exponential_backoff'
      });

      // Execute backoff delay before next attempt
      await new Promise(resolve => setTimeout(resolve, delayMs));
    }
  }

  throw lastError;
}

// Test helper for simulating failures
export function simulateFailure(type: 'error'): void {
  if (process.env.SIMULATE_FAILURES !== 'true') return;
  
  const attemptKey = `__test_attempts_${type}`;
  const attempts = (globalThis as any)[attemptKey] || 0;
  (globalThis as any)[attemptKey] = attempts + 1;
  
  // Fail first 2 attempts, succeed on 3rd
  if (attempts < 2) {
    app.logger.info(`Simulating ${type} failure (attempt ${attempts + 1})`);
    throw new Error('Simulated service error');
  }
  
  // Reset counter after success
  delete (globalThis as any)[attemptKey];
}
```

Create `/slack-agent/server/listeners/commands/test-resilience.ts`:

```typescript title="/slack-agent/server/listeners/commands/test-resilience.ts"
import type { AllMiddlewareArgs, SlackCommandMiddlewareArgs } from "@slack/bolt";
import { respondToMessage } from "~/lib/ai/respond-to-message";

export const testResilienceCallback = async ({
  ack,
  command,
  client,
  logger,
}: AllMiddlewareArgs & SlackCommandMiddlewareArgs) => {
  await ack();
  
  const { user_id, channel_id } = command;
  
  try {
    // Enable failure simulation
    process.env.SIMULATE_FAILURES = 'true';
    
    const response = await client.chat.postMessage({
      channel: channel_id,
      text: `🧪 Testing resilience...`,
    });
    
    // Test the AI response with simulated failures
    const aiResponse = await respondToMessage({
      messages: [{ 
        role: 'user', 
        content: 'Test message for resilience' 
      }],
      event: {
        type: 'message',
        text: 'Test message',
        user: user_id,
        ts: response.ts!,
        channel: channel_id,
        channel_type: 'channel',
      } as any,
      channel: channel_id,
      thread_ts: response.ts,
      botId: undefined,
    });
    
    await client.chat.postMessage({
      channel: channel_id,
      thread_ts: response.ts,
      text: `✅ Resilience test completed:\n${aiResponse}`,
    });
    
  } catch (error) {
    logger.error('Test resilience failed:', error);
    await client.chat.postEphemeral({
      channel: channel_id,
      user: command.user_id,
      text: `❌ Test failed: ${error}`,
    });
  } finally {
    // Disable failure simulation
    delete process.env.SIMULATE_FAILURES;
  }
};
```

\*\*Note: About the \`as any\` cast\*\*

For the `event` in this test command we use `as any` to avoid dragging a full set of Slack event types into the example. In your own code, prefer reusing the typed helpers and payload types from the Bolt middleware lesson instead of broad casts—this keeps your handlers fully type-safe while following the same retry patterns.

Register in `/slack-agent/server/listeners/commands/index.ts`:

```typescript title="/slack-agent/server/listeners/commands/index.ts" {5,11}
import type { App } from "@slack/bolt";
import { echoCallback } from "./echo";
import { sampleCommandCallback } from "./sample-command";
import { testResilienceCallback } from "./test-resilience";

const register = (app: App) => {
  app.command("/sample-command", sampleCommandCallback);
  app.command("/echo", echoCallback);
  app.command("/test-resilience", testResilienceCallback);
};

export default { register };
```

> If you already have additional commands (like `/compare-context`) from other lessons, register them here as well. This snippet only shows the commands relevant to the resilience test.

Update `/slack-agent/server/lib/ai/respond-to-message.ts`:

```typescript title="/slack-agent/server/lib/ai/respond-to-message.ts" {3,36-90}
import type { KnownEventFromType } from "@slack/bolt";
import { generateText, type ModelMessage, stepCountIs } from "ai";
import { withRetry, simulateFailure } from "./retry-wrapper";
import { app } from "~/app";
// ... existing imports ...

// Share a single system prompt builder between createTextStream and respondToMessage.
// In your code, copy the ENTIRE system prompt string from createTextStream into this
// function (including the channel_type-specific prefix) so both flows stay in sync.
const buildSystemPrompt = (
  event: KnownEventFromType<"message"> | KnownEventFromType<"app_mention">
) => `You are Slack Agent, a helpful assistant in Slack.
// ... same full system prompt as createTextStream above ...
`;

export const respondToMessage = async ({
  messages,
  event,
  channel,
  thread_ts,
  botId,
}: RespondToMessageOptions) => {
  // CONTROL STRATEGY: Explicit model fallback chain
  // Primary: gpt-4o-mini (fast, cheap, good quality)
  // Fallback: gpt-3.5-turbo (even cheaper, more reliable availability)
  // Rationale: If primary fails, degrade to cheaper/simpler model rather than total failure
  const models = [
    "openai/gpt-4o-mini",
    "openai/gpt-3.5-turbo",
  ];

  let lastError: unknown;

  // CONTROL FLOW: Try each model in sequence until one succeeds
  // Strategy: Fail forward through cheaper models, only fail completely as last resort
  for (const model of models) {
    try {
      // Wrap AI call with retry logic (inner control layer)
      // Outer loop: model fallback, Inner loop: network retries
      const { text, usage } = await withRetry(
        async () => {
          // Test helper: simulate failures on first model
          if (process.env.SIMULATE_FAILURES === 'true' && model === models[0]) {
            simulateFailure('error');
          }

          app.logger.info(`Attempting with model: ${model}`, {
            position: `${models.indexOf(model) + 1}/${models.length}`,
            reason: model === models[0] ? 'Primary model (optimal quality)' : 'Fallback model (degraded but reliable)'
          });

          return await generateText({
            model,
            // Reuse the SAME full system prompt you implemented in createTextStream.
            // We truncate buildSystemPrompt in this snippet for brevity, but in your
            // code both createTextStream and respondToMessage should import/use it.
            system: buildSystemPrompt(event),
            messages,
            stopWhen: stepCountIs(5),
            tools: {
              updateChatTitleTool,
              getThreadMessagesTool,
              getChannelMessagesTool,
              updateAgentStatusTool,
              reactToMessageTool,
            },
            experimental_context: {
              channel,
              thread_ts: thread_ts || event.ts,
              botId,
            } as ExperimentalContext,
            prepareStep: () => ({
              activeTools: getActiveTools(event),
            }),
            onStepFinish: ({ toolCalls }) => {
              if (toolCalls?.length) {
                app.logger.debug("tool calls:", toolCalls.map((c) => c.input));
              }
            },
          });
        },
        {
          maxRetries: 3,
          initialDelayMs: 1000,
        }
      );

      // CONTROL DECISION: Success - return immediately
      // Rationale: No need to try remaining models, we got a good response
      app.logger.info('AI request succeeded', {
        model,
        usage,
        outcome: 'Returning response to user'
      });

      return text;
    } catch (error) {
      lastError = error;
      app.logger.error(`Model ${model} failed after retries`, {
        model,
        error: error instanceof Error ? error.message : String(error),
        remainingModels: models.length - models.indexOf(model) - 1
      });

      // CONTROL DECISION: Last model in chain?
      // Rationale: All models exhausted - degrade gracefully with user-friendly message
      if (model === models[models.length - 1]) {
        app.logger.error('All models exhausted, returning graceful degradation message', {
          attemptedModels: models,
          outcome: 'User-friendly error message instead of raw exception'
        });

        // Graceful degradation: helpful message instead of stack trace
        return "I'm experiencing high demand right now. Please try again in a few moments.";
      }

      // Not last model - continue to next in fallback chain
      app.logger.info('Falling back to next model', {
        failed: model,
        next: models[models.indexOf(model) + 1],
        strategy: 'degraded_quality_over_no_response'
      });
    }
  }

  // Should never reach here due to graceful degradation above
  throw lastError;
};
```

\*\*Side Quest: Control Flow Observability Dashboard\*\*

## Building on Previous Lessons

This lesson leverages stateless architecture for resilient operations:

- **From [Bolt Middleware](./bolt-nitro-middleware-and-logging)**: Correlation IDs track retry attempts and model fallbacks across the full operation chain
- **From [Repository Flyover](./repository-flyover)**: Context utilities (`getThreadMessages`, `getChannelMessages`) benefit from retry protection against transient Slack API failures
- **From [system prompts](./system-prompts-shape-behavior)**, **[AI tools](./ai-tools-and-functions)**, and **[status communication](./status-communication)**: AI components all flow through retry wrappers
- **Production reasoning**: Stateless handlers enable safe retries - each attempt is idempotent because we don't hold mutable state
- **Graceful degradation**: Fallback to cheaper models (gpt-4o-mini → gpt-3.5-turbo) or cached context when primary systems fail
- **Sets up [Deploy to Vercel](./deploy-to-vercel)**: Production deployment relies on this resilience to handle real-world rate limits and network issues

\*\*Note: Vercel AI Gateway Integration\*\*

If using Vercel AI Gateway, you get provider-level fallback automatically (e.g., OpenAI → Anthropic). This lesson's patterns still apply for:

- Model-level fallbacks within a provider (gpt-4o → gpt-3.5)
- Slack API resilience (not covered by Gateway)
- Application-specific retry logic and testing
- Correlation tracking for debugging


---

[Full course index](/academy/llms.txt) · [Sitemap](/academy/sitemap.md)
