Eloquent

Documentation

Troubleshooting Guide

Common issues, diagnostic commands, and resolution steps for Eloquent platform operations.

Common Issues

Services Won't Start (CrashLoopBackOff)

Symptoms: Pods show CrashLoopBackOff status, restart count is increasing.

Diagnosis:

# Check pod status
kubectl get pods -n eloquent

# View logs for the crashing service
kubectl logs -n eloquent deployment/<service-name> --previous

# Get detailed pod info
kubectl describe pod -n eloquent -l app=<service-name>

Common causes:

  • Database connection failure — verify PostgreSQL/ClickHouse/Redis are running
  • Missing environment variables — check Helm values for required secrets
  • Port conflict — verify no other service is using the same port
  • Insufficient resources — check if the node has available CPU/memory

JWT Authentication Failures

Symptoms: API requests return 401 Unauthorized across all endpoints.

Resolution: The jwtSecret must be identical across the API Gateway and all backend services. Verify in your Helm values:

secrets:
  jwtSecret: "<must-be-the-same-everywhere>"

If you recently changed the secret, all active sessions are invalidated. Users must re-login.

Organization Provisioning Fails

Symptoms: Creating a new org in the Admin App shows errors or hangs.

Diagnosis:

# Check admin service logs
kubectl logs -n eloquent deployment/admin-service --tail=200

# Verify database connectivity
kubectl exec -n eloquent deployment/admin-service -- \
  curl -s http://localhost/health

Common causes:

  • PostgreSQL connection limit reached
  • Message queue not ready (services may start before it is fully available)
  • Insufficient disk space for new schemas

Chat Not Connecting

Symptoms: Chat widget or UI shows "disconnected" or messages don't appear.

Diagnosis:

# Check chat service health
kubectl logs -n eloquent deployment/chat-service --tail=100

# Verify NATS connectivity
kubectl exec -n eloquent deployment/chat-service -- \
  curl -s http://localhost/health

Common causes:

  • Chat ingress not configured for WebSocket/SSE (missing proxy-buffering: off)
  • Sticky sessions not enabled for WebSocket transport
  • Message queue storage not provisioned for the organization
  • Read timeout too short (needs 3600s for long-lived connections)

Products Fail to Install

Symptoms: Product installation in Admin App shows failed artifacts.

Diagnosis:

  1. Check the artifact log in the product management dialog for specific error messages
  2. Check service logs for the failing service:
kubectl logs -n eloquent deployment/admin-service --tail=200

Common causes:

  • Target service not running (check pod status)
  • Database schema creation failed (permissions or space)
  • Internal timeout during artifact provisioning

Diagnostic Commands

PostgreSQL

# Connect to PostgreSQL
kubectl exec -it -n eloquent statefulset/postgresql -- psql -U eloquent

# List databases
\l

ClickHouse

# Connect to ClickHouse
kubectl exec -it -n eloquent statefulset/clickhouse -- clickhouse-client

# List databases
SHOW DATABASES;

Redis

# Check Redis health
kubectl exec -n eloquent statefulset/redis -- redis-cli ping

# Check memory usage
kubectl exec -n eloquent statefulset/redis -- redis-cli info memory

# Check key count
kubectl exec -n eloquent statefulset/redis -- redis-cli dbsize

General Kubernetes

# All pods status
kubectl get pods -n eloquent

# All deployments
kubectl get deployments -n eloquent

# Resource usage
kubectl top pods -n eloquent

# Events (useful for scheduling/resource issues)
kubectl get events -n eloquent --sort-by='.lastTimestamp'

Service Health Checks

Every service exposes an HTTP health endpoint. Use these to verify individual service health:

# Port-forward to a service
kubectl port-forward -n eloquent deployment/<service-name> 8080:80

# Check health
curl http://localhost:8080/health

Graceful Shutdown

All services are configured with a 30-second graceful shutdown timeout. During shutdown:

  1. Service stops accepting new requests
  2. In-flight requests complete (up to 30 seconds)
  3. Database and messaging connections are closed
  4. Pod terminates

If pods take longer than 30 seconds to stop, Kubernetes will force-kill them.

Deployment Rollback

If a new deployment introduces issues:

# Rollback a single service
kubectl rollout undo deployment/<service-name> -n eloquent

# Rollback the entire Helm release
helm rollback eloquent <previous-revision> -n eloquent

# Check Helm revision history
helm history eloquent -n eloquent

Scaling Issues

Symptoms: Slow responses, timeouts, or OOM kills.

Diagnosis:

# Check resource usage
kubectl top pods -n eloquent

# Check HPA status (if enabled)
kubectl get hpa -n eloquent

# Check for OOMKilled pods
kubectl get pods -n eloquent -o json | grep -A5 "OOMKilled"

Resolution:

  • Increase resource limits for the affected service in Helm values
  • Enable autoscaling if not already active
  • Check if the node has available capacity (kubectl top nodes)
  • For persistent high load, increase replica count