# Alert System Configuration Guide

## Overview
The CoinKrazy alert system provides real-time monitoring, automatic alert triggering, and multi-channel delivery (Slack, PagerDuty, Email).

## Database Setup

### 1. Execute Migration SQL
Run the alert system migration to create all required tables:

```bash
# Execute the migration SQL
mysql -h $DB_HOST -u $DB_USER -p$DB_PASSWORD $DB_NAME < drizzle/migrations/alert_system.sql
```

This creates:
- `alert_logs` - Alert history and status tracking
- `alert_templates` - Notification message templates
- `alert_delivery_logs` - Delivery tracking for each channel
- `alert_template_usage` - Template effectiveness metrics

### 2. Verify Tables
```sql
SHOW TABLES LIKE 'alert%';
SELECT COUNT(*) FROM alert_templates;
```

## Environment Configuration

### Required Environment Variables

Add these to your `.env` file or Settings → Secrets:

#### Slack Integration
```
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/YOUR/WEBHOOK/URL
```

Get your webhook URL from: https://api.slack.com/apps → Your App → Incoming Webhooks

#### PagerDuty Integration
```
PAGERDUTY_INTEGRATION_KEY=your-integration-key-here
```

Get your integration key from: https://pagerduty.com → Services → Your Service → Integrations

#### Email Configuration
```
ALERT_EMAIL_RECIPIENTS=admin@example.com,ops@example.com,support@example.com
BREVO_API_KEY=your-brevo-api-key
```

Get your Brevo API key from: https://app.brevo.com → Settings → API Keys

### Optional Environment Variables

```
# Alert monitoring interval (milliseconds, default: 60000)
ALERT_CHECK_INTERVAL=60000

# Maximum retry attempts for failed deliveries
ALERT_MAX_RETRIES=3

# Alert retention period (days, default: 90)
ALERT_RETENTION_DAYS=90
```

## Configuration via Admin Dashboard

### 1. Webhook Configuration
Navigate to: Admin → Webhook Configuration

- Test Slack webhook connectivity
- Test PagerDuty integration
- Test email delivery
- Configure recipient lists

### 2. Alert Thresholds
Navigate to: Admin → Monitoring → Alert Thresholds

Configure thresholds for:
- **Latency Critical**: WebSocket latency > 500ms
- **Latency Warning**: WebSocket latency > 300ms
- **Delivery Failure**: Success rate < 90%
- **Forecast Accuracy**: Accuracy < 85%
- **Connection Drop**: Active connections < 100

### 3. Escalation Policies
Navigate to: Admin → Escalation Policies

Define escalation chains:
- Initial notification (Slack)
- 5 min escalation (PagerDuty)
- 15 min escalation (Email + SMS)
- 30 min escalation (On-call manager)

## Alert Templates

Default templates are automatically created:

### 1. Critical Latency Alert
- **Slack**: 🚨 Critical Latency Alert with metrics
- **Email**: HTML formatted with context
- **PagerDuty**: Severity: critical

### 2. Delivery Failure Alert
- **Slack**: ⚠️ Delivery Failure Alert with channel info
- **Email**: Formatted with success rate
- **PagerDuty**: Severity: warning

### 3. Forecast Accuracy Alert
- **Slack**: 📊 Forecast Accuracy Alert
- **Email**: Model performance details
- **PagerDuty**: Severity: warning

### 4. Connection Drop Alert
- **Slack**: 🔌 Connection Drop Alert
- **Email**: Connection recovery details
- **PagerDuty**: Severity: info

## Testing

### 1. Test Alert Delivery
```bash
# Via tRPC
curl -X POST http://localhost:3000/api/trpc/alertTriggers.checkMetrics \
  -H "Content-Type: application/json" \
  -d '{}'
```

### 2. Manual Alert Trigger
```bash
# Create test alert via admin dashboard
# Monitoring → Check Metrics Now
```

### 3. Verify Deliveries
- Check Slack channel for alert message
- Check PagerDuty incident creation
- Check email inbox for alert notification

## Monitoring

### Real-time Dashboard
Navigate to: Admin → Monitoring Dashboard

View:
- Active alerts count
- Alert severity breakdown
- Delivery success rates
- Average resolution time
- Threshold status

### Alert History
Navigate to: Admin → Alert History

Search and filter:
- By severity (critical, warning, info)
- By status (active, acknowledged, resolved)
- By date range
- By alert type

### Metrics
- Total alerts triggered
- Average resolution time
- Delivery success rate by channel
- Most common alert types

## Troubleshooting

### Alerts Not Triggering
1. Check monitoring status: Admin → Monitoring → Monitoring Status
2. Verify thresholds: Admin → Alert Thresholds
3. Check logs: `.manus-logs/devserver.log`

### Delivery Failures
1. Verify credentials in Settings → Secrets
2. Test webhook: Admin → Webhook Configuration → Test
3. Check delivery logs: Admin → Alert History → Delivery Logs

### High False Positive Rate
1. Adjust thresholds: Admin → Alert Thresholds
2. Review alert history for patterns
3. Increase threshold values incrementally

## Best Practices

1. **Start Conservative**: Set thresholds higher initially, then lower gradually
2. **Test Thoroughly**: Use "Check Metrics Now" before enabling auto-monitoring
3. **Monitor Deliveries**: Check delivery success rates weekly
4. **Review Alerts**: Archive resolved alerts monthly for compliance
5. **Update Templates**: Customize templates for your team's needs
6. **Set Escalations**: Define clear escalation paths for critical alerts

## API Reference

### Get Active Alerts
```typescript
const alerts = await trpc.alertTriggers.getActiveAlerts.query();
```

### Update Threshold
```typescript
await trpc.alertTriggers.updateThreshold.mutate({
  alertType: 'latency_critical',
  metric: 'websocket_latency',
  operator: 'gt',
  value: 500,
  severity: 'critical'
});
```

### Resolve Alert
```typescript
await trpc.alertTriggers.resolveAlert.mutate({
  alertId: 'alert-001'
});
```

### Get Alert Statistics
```typescript
const stats = await trpc.alertTriggers.getAlertStats.query();
```

## Support

For issues or questions:
1. Check `.manus-logs/` for error details
2. Review alert history for patterns
3. Test webhook connectivity
4. Verify environment variables are set correctly