Troubleshooting Guide

Common issues, diagnostic commands, and resolution steps for Eloquent platform operations.

Common Issues

Services Won't Start (CrashLoopBackOff)

Symptoms: Pods show CrashLoopBackOff status, restart count is increasing.

Diagnosis:

# Check pod status
kubectl get pods -n eloquent

# View logs for the crashing service
kubectl logs -n eloquent deployment/<service-name> --previous

# Get detailed pod info
kubectl describe pod -n eloquent -l app=<service-name>

Common causes:

Database connection failure — verify PostgreSQL/ClickHouse/Redis are running
Missing environment variables — check Helm values for required secrets
Port conflict — verify no other service is using the same port
Insufficient resources — check if the node has available CPU/memory

JWT Authentication Failures

Symptoms: API requests return 401 Unauthorized across all endpoints.

Resolution: The jwtSecret must be identical across the API Gateway and all backend services. Verify in your Helm values:

secrets:
  jwtSecret: "<must-be-the-same-everywhere>"

If you recently changed the secret, all active sessions are invalidated. Users must re-login.

Organization Provisioning Fails

Symptoms: Creating a new org in the Admin App shows errors or hangs.

Diagnosis:

# Check admin service logs
kubectl logs -n eloquent deployment/admin-service --tail=200

# Verify database connectivity
kubectl exec -n eloquent deployment/admin-service -- \
  curl -s http://localhost/health

Common causes:

PostgreSQL connection limit reached
Message queue not ready (services may start before it is fully available)
Insufficient disk space for new schemas

Chat Not Connecting

Symptoms: Chat widget or UI shows "disconnected" or messages don't appear.

Diagnosis:

# Check chat service health
kubectl logs -n eloquent deployment/chat-service --tail=100

# Verify NATS connectivity
kubectl exec -n eloquent deployment/chat-service -- \
  curl -s http://localhost/health

Common causes:

Chat ingress not configured for WebSocket/SSE (missing proxy-buffering: off)
Sticky sessions not enabled for WebSocket transport
Message queue storage not provisioned for the organization
Read timeout too short (needs 3600s for long-lived connections)

Products Fail to Install

Symptoms: Product installation in Admin App shows failed artifacts.

Diagnosis:

Check the artifact log in the product management dialog for specific error messages
Check service logs for the failing service:

kubectl logs -n eloquent deployment/admin-service --tail=200

Common causes:

Target service not running (check pod status)
Database schema creation failed (permissions or space)
Internal timeout during artifact provisioning

Diagnostic Commands

PostgreSQL

# Connect to PostgreSQL
kubectl exec -it -n eloquent statefulset/postgresql -- psql -U eloquent

# List databases
\l

ClickHouse

# Connect to ClickHouse
kubectl exec -it -n eloquent statefulset/clickhouse -- clickhouse-client

# List databases
SHOW DATABASES;

Redis

# Check Redis health
kubectl exec -n eloquent statefulset/redis -- redis-cli ping

# Check memory usage
kubectl exec -n eloquent statefulset/redis -- redis-cli info memory

# Check key count
kubectl exec -n eloquent statefulset/redis -- redis-cli dbsize

General Kubernetes

# All pods status
kubectl get pods -n eloquent

# All deployments
kubectl get deployments -n eloquent

# Resource usage
kubectl top pods -n eloquent

# Events (useful for scheduling/resource issues)
kubectl get events -n eloquent --sort-by='.lastTimestamp'

Service Health Checks

Every service exposes an HTTP health endpoint. Use these to verify individual service health:

# Port-forward to a service
kubectl port-forward -n eloquent deployment/<service-name> 8080:80

# Check health
curl http://localhost:8080/health

Graceful Shutdown

All services are configured with a 30-second graceful shutdown timeout. During shutdown:

Service stops accepting new requests
In-flight requests complete (up to 30 seconds)
Database and messaging connections are closed
Pod terminates

If pods take longer than 30 seconds to stop, Kubernetes will force-kill them.

Deployment Rollback

If a new deployment introduces issues:

# Rollback a single service
kubectl rollout undo deployment/<service-name> -n eloquent

# Rollback the entire Helm release
helm rollback eloquent <previous-revision> -n eloquent

# Check Helm revision history
helm history eloquent -n eloquent

Scaling Issues

Symptoms: Slow responses, timeouts, or OOM kills.

Diagnosis:

# Check resource usage
kubectl top pods -n eloquent

# Check HPA status (if enabled)
kubectl get hpa -n eloquent

# Check for OOMKilled pods
kubectl get pods -n eloquent -o json | grep -A5 "OOMKilled"

Resolution:

Increase resource limits for the affected service in Helm values
Enable autoscaling if not already active
Check if the node has available capacity (kubectl top nodes)
For persistent high load, increase replica count