Monitoring¶
Supervision et observabilité de JARVIS.
Stack de Monitoring¶
graph LR
A[FastAPI] -->|/metrics| P[Prometheus]
P --> G[Grafana]
A -->|logs| L[Loki]
L --> G Prometheus¶
Configuration¶
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'jarvis-api'
static_configs:
- targets: ['api:8000']
metrics_path: /metrics
- job_name: 'postgres'
static_configs:
- targets: ['postgres-exporter:9187']
- job_name: 'redis'
static_configs:
- targets: ['redis-exporter:9121']
Métriques Exposées¶
| Métrique | Type | Description |
|---|---|---|
http_requests_total | Counter | Nombre total de requêtes |
http_request_duration_seconds | Histogram | Latence des requêtes |
llm_requests_total | Counter | Requêtes LLM |
llm_tokens_total | Counter | Tokens consommés |
active_users | Gauge | Utilisateurs actifs |
documents_processed | Counter | Documents traités |
Grafana¶
Dashboards Disponibles¶
- Overview - Métriques générales
- API Performance - Latence, erreurs, throughput
- LLM Usage - Tokens, coûts, routing
- Database - Connexions, requêtes, cache
Configuration Docker¶
grafana:
image: grafana/grafana:latest
container_name: jarvis-grafana
restart: unless-stopped
environment:
GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD}
GF_INSTALL_PLUGINS: grafana-clock-panel
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
ports:
- "3000:3000"
Alertes¶
Règles Prometheus¶
# alerts.yml
groups:
- name: jarvis
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "Taux d'erreur élevé"
- alert: SlowAPI
expr: histogram_quantile(0.95, http_request_duration_seconds_bucket) > 2
for: 10m
labels:
severity: warning
annotations:
summary: "API lente (p95 > 2s)"
- alert: DatabaseDown
expr: pg_up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "PostgreSQL down"
Logs¶
Format Structuré¶
# app/middleware/logging.py
import structlog
logger = structlog.get_logger()
@app.middleware("http")
async def log_requests(request: Request, call_next):
logger.info(
"request_started",
method=request.method,
path=request.url.path,
user_id=request.state.user_id
)
response = await call_next(request)
logger.info(
"request_completed",
status=response.status_code,
duration_ms=elapsed
)
return response
Loki Configuration¶
loki:
image: grafana/loki:latest
container_name: jarvis-loki
command: -config.file=/etc/loki/local-config.yaml
volumes:
- loki_data:/loki
ports:
- "3100:3100"
Health Checks¶
Endpoint /health¶
{
"status": "healthy",
"version": "0.13.0",
"uptime_seconds": 86400,
"checks": {
"database": "ok",
"redis": "ok",
"minio": "ok",
"ollama": "ok"
}
}
Endpoint /metrics¶
Format Prometheus standard: