Retry Mechanism
The LLM module implements a robust retry mechanism to handle transient failures when interacting with LLM providers. This system ensures reliable operation even when facing temporary network issues or API rate limits.
Key Features
- Exponential Backoff: Automatically increases delay between retry attempts (default: 1s, 2s, 4s, 8s)
- Error Classification: Differentiates between retryable and non-retryable errors
- Configurable Attempts: Maximum retry attempts can be configured (default: 3)
- Circuit Breaker: Prevents cascading failures by temporarily disabling calls to failing providers
- Jitter: Adds random variation to retry delays to prevent thundering herd problems
Implementation
The retry mechanism is implemented via the @retry_on_failure
decorator from aicore.utils
:
python
from aicore.utils import retry_on_failure
@retry_on_failure(
max_attempts=3,
initial_delay=1.0,
max_delay=10.0,
backoff_factor=2.0
)
async def complete(self, prompt: str) -> str:
"""LLM completion with automatic retries"""
response = await self._client.acreate(prompt)
return response
Configuration Options
Retry behavior can be configured either through environment variables or directly in code:
Environment Variables
bash
# Maximum retry attempts
export LLM_MAX_RETRIES=5
# Initial retry delay in seconds
export LLM_RETRY_INITIAL_DELAY=1.5
# Maximum delay between retries
export LLM_RETRY_MAX_DELAY=15.0
# Backoff multiplier factor
export LLM_RETRY_BACKOFF_FACTOR=2.0
Programmatic Configuration
python
from aicore.llm.config import LlmConfig
config = LlmConfig(
provider="openai",
api_key="your_api_key",
model="gpt-4o"
)
Error Handling
Retryable Errors (automatic retry)
- Network timeouts (408)
- Rate limit errors (429)
- Server errors (5xx)
- Temporary unavailability (503)
- Connection errors
Non-Retryable Errors (fail immediately)
- Authentication failures (401)
- Invalid requests (400, 404)
- Permission errors (403)
- Model not found errors
- Validation errors
Customizing Retry Logic
You can customize the retry behavior by subclassing the base provider:
python
from aicore.llm.providers.base_provider import LlmBaseProvider
from aicore.utils import retry_on_failure
class CustomProvider(LlmBaseProvider):
@retry_on_failure(
max_attempts=5,
retry_on=(RateLimitError, TimeoutError),
give_up_on=(AuthenticationError,)
)
async def acomplete(self, prompt: str) -> str:
# Custom implementation with specialized retry logic
pass