A SIEM that’s generating thousands of alerts per day with a 90% false positive rate isn’t a security tool — it’s an alert fatigue engine. If you’re evaluating SIEM implementation or looking to mature an existing deployment, this guide covers what high-performing SOC teams do differently. Analysts who can’t investigate everything stop investigating carefully. Serious incidents hide in the noise. Meanwhile, the platform costs $500K+ per year.

The organisations getting real value from their SIEMs are doing a small number of things very differently from those with expensive, underperforming deployments. This guide is about those things.

The SIEM Performance Gap

The Ponemon Institute’s SOC studies consistently find:

Analysts spend 27% of their time on false positives
Only 56% of SIEM alerts are investigated
Mean dwell time (time from compromise to detection) remains over 200 days in many industries

Why? Most SIEM deployments fall into the same traps:

Logging everything by default → terabytes of low-signal data
Enabling all vendor rules out of the box → rule count optimised for marketing, not detection quality
No tuning process → false positives accumulate and stay
Detection that isn’t tested → rules that haven’t fired in months and nobody knows if they work
No feedback loop → analysts triaging alerts have no way to influence rule quality

Foundation: What to Log (and What Not To)

The most expensive mistake in SIEM operations is logging everything and assuming more data = better detection. It doesn’t. Low-signal data increases query costs, storage costs, and the background noise that hides real threats.

High-Value Log Sources (Log These First)

Log Source	Why It Matters	Key Events
Identity provider (Azure AD, Okta)	Authentication is the #1 attack vector	Sign-ins, MFA events, role changes, token issuance
Endpoint (EDR)	Where most attacks begin and execute	Process execution, network connections, file modifications
Cloud platform (CloudTrail, Activity Log)	Where most sensitive data lives	API calls, IAM changes, resource creation/deletion
VPN / remote access	External entry points	Successful and failed authentications, geolocation
DNS	C2 detection, data exfiltration detection	All DNS queries (especially from endpoints)
Firewall / proxy	Network visibility	Allowed and denied outbound connections
Email security	Initial access via phishing	Delivered threats, blocked threats, link clicks
Key Vault / Secrets Manager	Credential theft detection	Access to secrets, especially out of hours
PAM	Privileged access monitoring	Session creation, commands run, approvals

Lower-Value Sources (Think Before Logging)

Source	Issue	Recommendation
Full packet capture	Petabytes of data, tiny signal	Log metadata (NetFlow/flow logs), not payloads
Verbose application logs	Millions of DEBUG entries daily	Log only WARN+ and security-relevant events
CDN access logs (all traffic)	Mostly legitimate users	Log WAF blocks and anomalies only
Performance monitoring	Not security-relevant	Route to observability platform, not SIEM
Complete S3 data event logs	Billions of events, mostly legitimate	Log GetObject only for specific sensitive buckets

Tagging and Enrichment

Raw logs are hard to work with. Enrich at ingestion:

Asset classification: Tag each log with asset tier (production/staging/dev) and asset type
User context: Enrich authentication events with department, manager, employee type (FTE, contractor, vendor)
Geolocation: IP → country, ASN, known VPN/proxy classification
Threat intelligence: Enrich IP addresses and domains against TI feeds on ingestion

Detection Engineering: Quality Over Quantity

The single most impactful improvement most SIEM deployments can make is reducing detection quantity and improving detection quality.

The Detection Quality Framework

A high-quality detection has:

A documented threat hypothesis: What attacker technique are we detecting? (MITRE ATT&CK mapping)
Signal specificity: Does this event pattern indicate malicious behaviour with reasonable confidence?
A tested FP rate: Has this rule been run against 30 days of historical data? What’s the false positive volume?
Defined triage steps: When this alert fires, what does an analyst do to validate or dismiss it?
A known kill rate: How many True Positives has this rule generated in the past 90 days?

Rules with high FP rates and zero True Positives should be tuned or disabled — they’re consuming analyst time without security value.

Detection Rule Lifecycle

Hypothesis → Draft → Historical Validation → Staging (monitor only) → Production (alert) → Tune → Retire

┌─────────────────────────────────────────────────────┐
│ Hypothesis: Attackers using living-off-the-land      │
│ techniques will run encoded PowerShell               │
└─────────────────────────────────────────────────────┘
                      ↓
┌─────────────────────────────────────────────────────┐
│ Draft detection (KQL/SPL):                           │
│ ProcessEvents                                        │
│ | where ProcessName == "powershell.exe"             │
│ | where CommandLine contains "-EncodedCommand"      │
│ | where ParentProcess !in ("expected_parents")      │
└─────────────────────────────────────────────────────┘
                      ↓
┌─────────────────────────────────────────────────────┐
│ Historical validation:                               │
│ - Run against 30 days of data                       │
│ - Identify false positives                          │
│ - Tune exclusions (known-good parents, users)       │
└─────────────────────────────────────────────────────┘
                      ↓
┌─────────────────────────────────────────────────────┐
│ Staging: Run in Log mode for 2 weeks                 │
│ Analyst reviews output daily                         │
│ Target FP rate: < 5 FPs per day                     │
└─────────────────────────────────────────────────────┘
                      ↓
┌─────────────────────────────────────────────────────┐
│ Production: Alert mode. Document:                    │
│ - Expected FP rate                                  │
│ - Triage playbook                                   │
│ - Exclusion management process                      │
└─────────────────────────────────────────────────────┘

Sample High-Value Detection Rules

Suspicious sign-in pattern (Microsoft Sentinel / KQL):

// Sign-in succeeded after many failures from same IP
let failed_threshold = 10;
let time_window = 30m;
SigninLogs
| where TimeGenerated > ago(24h)
| where ResultType != 0  // Failed logins
| summarize FailedCount = count() by IPAddress, UserPrincipalName, bin(TimeGenerated, time_window)
| where FailedCount >= failed_threshold
| join kind=inner (
    SigninLogs
    | where ResultType == 0  // Successful logins
    | where TimeGenerated > ago(24h)
) on IPAddress, UserPrincipalName
| where TimeGenerated1 > TimeGenerated  // Success after failures
| project TimeGenerated, UserPrincipalName, IPAddress, FailedCount, SuccessTime = TimeGenerated1

Unusual admin action (Splunk SPL):

index=cloudtrail eventName IN (CreateUser, AttachUserPolicy, CreateAccessKey)
userIdentity.type=AssumedRole
| stats count by userIdentity.arn, sourceIPAddress, eventName
| where count > 3
| eval risk = case(
    eventName=="CreateUser", "High",
    eventName=="AttachUserPolicy", "Critical",
    eventName=="CreateAccessKey", "High",
    true(), "Medium"
)
| sort - risk, - count

Lateral movement detection:

// SMB/RDP connections from unusual sources (Windows Event Log)
SecurityEvent
| where EventID in (4624, 4625)
| where LogonType in (3, 10)  // Network, RemoteInteractive
| where SubjectUserName !endswith "$"  // Exclude machine accounts
| summarize AttemptCount = count(), TargetHosts = dcount(Computer)
    by SubjectUserName, IpAddress, bin(TimeGenerated, 1h)
| where TargetHosts > 5  // Accessing many hosts = lateral movement
| order by TargetHosts desc

Alert Triage Process

A well-defined triage process is what separates a functional SOC from alert chaos.

Triage Principles

Every alert gets a disposition:

True Positive (TP): Real malicious activity → escalate to incident
False Positive (FP): Known-good activity matching the rule → tune the rule, close alert
True Negative Positive (TNP): Suspicious but not confirmed malicious → monitor, gather context
FP — Exception: Known-good specific to this entity → add exclusion to rule

Document every decision. If an analyst closes an alert as FP, they should note why — this creates institutional knowledge and drives tuning.

Tier 1 Triage Playbook (per alert)

1. Read the alert summary — what's the rule detecting?
2. Enrich the principal (user/device/IP):
   - Is this a known IT admin? Contractor? Recently offboarded?
   - Has this principal triggered similar alerts before?
   - Any recent HR events (termination, role change)?
3. Examine the event in context:
   - What happened before and after this event?
   - Is this behaviour consistent with the principal's normal pattern?
   - What time of day? From what location?
4. Disposition:
   - Clear FP → close, add exclusion if appropriate
   - Suspicious → escalate to Tier 2 with context
   - Confirmed TP → create incident ticket, escalate
5. Feedback to detection team:
   - If high FP rate on this rule → flag for tuning

Alert SLAs

Alert Severity	Triage SLA	Escalation SLA
Critical	15 minutes	30 minutes
High	1 hour	4 hours
Medium	4 hours	24 hours
Low	24 hours	72 hours

Track SLA compliance monthly. If analysts consistently miss SLAs, either increase staffing, reduce alert volume, or both.

Threat Hunting

Threat hunting is proactive — analysts searching for threats that haven’t triggered rules. It’s different from alert triage:

Alert triage: Reactive — respond to what the system flags
Threat hunting: Proactive — search for evidence of techniques the system isn’t detecting

Hunt workflow:

Hypothesis: “Groups targeting our industry use Cobalt Strike with specific beacon intervals. Do we have unexplained beaconing traffic?”
Data query: Search for hosts making repeated outbound connections at regular intervals to unfamiliar destinations
Investigation: Examine each potential match — is this a known update service, monitoring agent, or something unexplained?
Disposition: Benign → document and exclude. Suspicious → escalate. Confirmed → incident.
Detection creation: If the hunt finds a real technique being used against you, create a rule to detect it automatically going forward.

Example hunt query (periodic beaconing detection):

// Detect hosts making periodic outbound connections (potential C2 beaconing)
// Look for connections with low jitter to external IPs
NetworkFlow
| where TimeGenerated > ago(7d)
| where DestinationPort in (80, 443, 8080, 8443)
| where ipv4_is_private(DestinationIP) == false  // External only
| summarize
    ConnectionCount = count(),
    IntervalStdDev = stdev(TimeGenerated - prev(TimeGenerated, 1))
    by SourceIP, DestinationIP, DestinationPort, bin(TimeGenerated, 1h)
| where ConnectionCount > 20 and IntervalStdDev < 5m  // Regular beaconing
| join kind=leftanti (
    // Exclude known good destinations (CDN, monitoring, update services)
    ExternalWhitelist | where Type == "CDN" or Type == "UpdateService"
) on DestinationIP

SIEM Operations Metrics

Track these metrics to measure SIEM effectiveness:

Metric	Target	Why It Matters
True Positive Rate (TPR)	> 10% of alerts	If < 5%, detection quality is poor
False Positive Rate (FPR)	< 50%	If > 80%, analysts disengage
Mean Time to Detect (MTTD)	< 1 hour for Critical	How fast does SIEM catch incidents?
Mean Time to Respond (MTTR)	< 4 hours for Critical	How fast does team act?
Alert volume per analyst	< 20/day per analyst	> 50/day causes burnout
Hunting hours per week	> 20% of analyst time	Proactive hunting finds what rules miss
Detection coverage (MITRE ATT&CK)	> 60% of TTPs	Are major technique families covered?

Review these metrics monthly. Declining TPR or increasing alert volume per analyst are early warning signs.

SIEM Architecture for Scale

Log Routing and Tiering

Not all logs need to be searchable in real-time. Design a tiered architecture:

Hot tier (30–90 days): Real-time indexing in SIEM
  → Incident investigation and real-time detection
  → High cost per GB, fast query

Warm tier (90–365 days): Compressed, slower retrieval
  → Longer-window investigations, compliance queries
  → Lower cost, slower query

Cold tier (1–7 years): Archive storage (S3 Glacier, Azure Archive)
  → Regulatory retention, legal discovery
  → Very low cost, restore takes hours/days

Microsoft Sentinel’s archive tier, Splunk SmartStore, and Elastic Frozen Data tiers all implement this model.

Automation to Reduce Manual Load

SOAR (Security Orchestration, Automation, and Response) automates repetitive Tier 1 actions:

Automate fully (no analyst needed):

Phishing email quarantine: EDR detects malicious attachment → auto-quarantine mailbox item
Known-bad IP blocking: TI match → auto-block on firewall/WAF
Automated password reset for accounts showing impossible travel

Automate with approval:

Account disable: Suspicious account activity → analyst reviews → one-click disable
Endpoint isolation: Suspicious malware activity → analyst reviews → one-click isolate

Keep manual:

Incident declaration and escalation
Customer notification decisions
Evidence collection for legal

CyberneticsPlus helps organisations deploy, tune, and mature their SIEM programmes on Microsoft Sentinel and Splunk. Our SIEM implementation service and 24/7 security monitoring capabilities help you get real value from your security data investment. Contact us to improve your SIEM ROI.

SIEM Best Practices: Get Value from Security Data