Troubleshooting Guide
Common issues, diagnostic commands, and resolution steps for Eloquent platform operations.
Common Issues
Services Won't Start (CrashLoopBackOff)
Symptoms: Pods show CrashLoopBackOff status, restart count is increasing.
Diagnosis:
# Check pod status
kubectl get pods -n eloquent
# View logs for the crashing service
kubectl logs -n eloquent deployment/<service-name> --previous
# Get detailed pod info
kubectl describe pod -n eloquent -l app=<service-name>
Common causes:
- Database connection failure — verify PostgreSQL/ClickHouse/Redis are running
- Missing environment variables — check Helm values for required secrets
- Port conflict — verify no other service is using the same port
- Insufficient resources — check if the node has available CPU/memory
JWT Authentication Failures
Symptoms: API requests return 401 Unauthorized across all endpoints.
Resolution:
The jwtSecret must be identical across the API Gateway and all backend services. Verify in your Helm values:
secrets:
jwtSecret: "<must-be-the-same-everywhere>"
If you recently changed the secret, all active sessions are invalidated. Users must re-login.
Organization Provisioning Fails
Symptoms: Creating a new org in the Admin App shows errors or hangs.
Diagnosis:
# Check admin service logs
kubectl logs -n eloquent deployment/admin-service --tail=200
# Verify database connectivity
kubectl exec -n eloquent deployment/admin-service -- \
curl -s http://localhost/health
Common causes:
- PostgreSQL connection limit reached
- Message queue not ready (services may start before it is fully available)
- Insufficient disk space for new schemas
Chat Not Connecting
Symptoms: Chat widget or UI shows "disconnected" or messages don't appear.
Diagnosis:
# Check chat service health
kubectl logs -n eloquent deployment/chat-service --tail=100
# Verify NATS connectivity
kubectl exec -n eloquent deployment/chat-service -- \
curl -s http://localhost/health
Common causes:
- Chat ingress not configured for WebSocket/SSE (missing
proxy-buffering: off) - Sticky sessions not enabled for WebSocket transport
- Message queue storage not provisioned for the organization
- Read timeout too short (needs 3600s for long-lived connections)
Products Fail to Install
Symptoms: Product installation in Admin App shows failed artifacts.
Diagnosis:
- Check the artifact log in the product management dialog for specific error messages
- Check service logs for the failing service:
kubectl logs -n eloquent deployment/admin-service --tail=200
Common causes:
- Target service not running (check pod status)
- Database schema creation failed (permissions or space)
- Internal timeout during artifact provisioning
Diagnostic Commands
PostgreSQL
# Connect to PostgreSQL
kubectl exec -it -n eloquent statefulset/postgresql -- psql -U eloquent
# List databases
\l
ClickHouse
# Connect to ClickHouse
kubectl exec -it -n eloquent statefulset/clickhouse -- clickhouse-client
# List databases
SHOW DATABASES;
Redis
# Check Redis health
kubectl exec -n eloquent statefulset/redis -- redis-cli ping
# Check memory usage
kubectl exec -n eloquent statefulset/redis -- redis-cli info memory
# Check key count
kubectl exec -n eloquent statefulset/redis -- redis-cli dbsize
General Kubernetes
# All pods status
kubectl get pods -n eloquent
# All deployments
kubectl get deployments -n eloquent
# Resource usage
kubectl top pods -n eloquent
# Events (useful for scheduling/resource issues)
kubectl get events -n eloquent --sort-by='.lastTimestamp'
Service Health Checks
Every service exposes an HTTP health endpoint. Use these to verify individual service health:
# Port-forward to a service
kubectl port-forward -n eloquent deployment/<service-name> 8080:80
# Check health
curl http://localhost:8080/health
Graceful Shutdown
All services are configured with a 30-second graceful shutdown timeout. During shutdown:
- Service stops accepting new requests
- In-flight requests complete (up to 30 seconds)
- Database and messaging connections are closed
- Pod terminates
If pods take longer than 30 seconds to stop, Kubernetes will force-kill them.
Deployment Rollback
If a new deployment introduces issues:
# Rollback a single service
kubectl rollout undo deployment/<service-name> -n eloquent
# Rollback the entire Helm release
helm rollback eloquent <previous-revision> -n eloquent
# Check Helm revision history
helm history eloquent -n eloquent
Scaling Issues
Symptoms: Slow responses, timeouts, or OOM kills.
Diagnosis:
# Check resource usage
kubectl top pods -n eloquent
# Check HPA status (if enabled)
kubectl get hpa -n eloquent
# Check for OOMKilled pods
kubectl get pods -n eloquent -o json | grep -A5 "OOMKilled"
Resolution:
- Increase resource limits for the affected service in Helm values
- Enable autoscaling if not already active
- Check if the node has available capacity (
kubectl top nodes) - For persistent high load, increase replica count