A SIEM thatβs generating thousands of alerts per day with a 90% false positive rate isnβt a security tool β itβs an alert fatigue engine. If youβre evaluating SIEM implementation or looking to mature an existing deployment, this guide covers what high-performing SOC teams do differently. Analysts who canβt investigate everything stop investigating carefully. Serious incidents hide in the noise. Meanwhile, the platform costs $500K+ per year.
The organisations getting real value from their SIEMs are doing a small number of things very differently from those with expensive, underperforming deployments. This guide is about those things.
The SIEM Performance Gap
The Ponemon Instituteβs SOC studies consistently find:
- Analysts spend 27% of their time on false positives
- Only 56% of SIEM alerts are investigated
- Mean dwell time (time from compromise to detection) remains over 200 days in many industries
Why? Most SIEM deployments fall into the same traps:
- Logging everything by default β terabytes of low-signal data
- Enabling all vendor rules out of the box β rule count optimised for marketing, not detection quality
- No tuning process β false positives accumulate and stay
- Detection that isnβt tested β rules that havenβt fired in months and nobody knows if they work
- No feedback loop β analysts triaging alerts have no way to influence rule quality
Foundation: What to Log (and What Not To)
The most expensive mistake in SIEM operations is logging everything and assuming more data = better detection. It doesnβt. Low-signal data increases query costs, storage costs, and the background noise that hides real threats.
High-Value Log Sources (Log These First)
| Log Source | Why It Matters | Key Events |
|---|---|---|
| Identity provider (Azure AD, Okta) | Authentication is the #1 attack vector | Sign-ins, MFA events, role changes, token issuance |
| Endpoint (EDR) | Where most attacks begin and execute | Process execution, network connections, file modifications |
| Cloud platform (CloudTrail, Activity Log) | Where most sensitive data lives | API calls, IAM changes, resource creation/deletion |
| VPN / remote access | External entry points | Successful and failed authentications, geolocation |
| DNS | C2 detection, data exfiltration detection | All DNS queries (especially from endpoints) |
| Firewall / proxy | Network visibility | Allowed and denied outbound connections |
| Email security | Initial access via phishing | Delivered threats, blocked threats, link clicks |
| Key Vault / Secrets Manager | Credential theft detection | Access to secrets, especially out of hours |
| PAM | Privileged access monitoring | Session creation, commands run, approvals |
Lower-Value Sources (Think Before Logging)
| Source | Issue | Recommendation |
|---|---|---|
| Full packet capture | Petabytes of data, tiny signal | Log metadata (NetFlow/flow logs), not payloads |
| Verbose application logs | Millions of DEBUG entries daily | Log only WARN+ and security-relevant events |
| CDN access logs (all traffic) | Mostly legitimate users | Log WAF blocks and anomalies only |
| Performance monitoring | Not security-relevant | Route to observability platform, not SIEM |
| Complete S3 data event logs | Billions of events, mostly legitimate | Log GetObject only for specific sensitive buckets |
Tagging and Enrichment
Raw logs are hard to work with. Enrich at ingestion:
- Asset classification: Tag each log with asset tier (production/staging/dev) and asset type
- User context: Enrich authentication events with department, manager, employee type (FTE, contractor, vendor)
- Geolocation: IP β country, ASN, known VPN/proxy classification
- Threat intelligence: Enrich IP addresses and domains against TI feeds on ingestion
Detection Engineering: Quality Over Quantity
The single most impactful improvement most SIEM deployments can make is reducing detection quantity and improving detection quality.
The Detection Quality Framework
A high-quality detection has:
- A documented threat hypothesis: What attacker technique are we detecting? (MITRE ATT&CK mapping)
- Signal specificity: Does this event pattern indicate malicious behaviour with reasonable confidence?
- A tested FP rate: Has this rule been run against 30 days of historical data? Whatβs the false positive volume?
- Defined triage steps: When this alert fires, what does an analyst do to validate or dismiss it?
- A known kill rate: How many True Positives has this rule generated in the past 90 days?
Rules with high FP rates and zero True Positives should be tuned or disabled β theyβre consuming analyst time without security value.
Detection Rule Lifecycle
Hypothesis β Draft β Historical Validation β Staging (monitor only) β Production (alert) β Tune β Retire
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Hypothesis: Attackers using living-off-the-land β
β techniques will run encoded PowerShell β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Draft detection (KQL/SPL): β
β ProcessEvents β
β | where ProcessName == "powershell.exe" β
β | where CommandLine contains "-EncodedCommand" β
β | where ParentProcess !in ("expected_parents") β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Historical validation: β
β - Run against 30 days of data β
β - Identify false positives β
β - Tune exclusions (known-good parents, users) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Staging: Run in Log mode for 2 weeks β
β Analyst reviews output daily β
β Target FP rate: < 5 FPs per day β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Production: Alert mode. Document: β
β - Expected FP rate β
β - Triage playbook β
β - Exclusion management process β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Sample High-Value Detection Rules
Suspicious sign-in pattern (Microsoft Sentinel / KQL):
// Sign-in succeeded after many failures from same IP
let failed_threshold = 10;
let time_window = 30m;
SigninLogs
| where TimeGenerated > ago(24h)
| where ResultType != 0 // Failed logins
| summarize FailedCount = count() by IPAddress, UserPrincipalName, bin(TimeGenerated, time_window)
| where FailedCount >= failed_threshold
| join kind=inner (
SigninLogs
| where ResultType == 0 // Successful logins
| where TimeGenerated > ago(24h)
) on IPAddress, UserPrincipalName
| where TimeGenerated1 > TimeGenerated // Success after failures
| project TimeGenerated, UserPrincipalName, IPAddress, FailedCount, SuccessTime = TimeGenerated1
Unusual admin action (Splunk SPL):
index=cloudtrail eventName IN (CreateUser, AttachUserPolicy, CreateAccessKey)
userIdentity.type=AssumedRole
| stats count by userIdentity.arn, sourceIPAddress, eventName
| where count > 3
| eval risk = case(
eventName=="CreateUser", "High",
eventName=="AttachUserPolicy", "Critical",
eventName=="CreateAccessKey", "High",
true(), "Medium"
)
| sort - risk, - count
Lateral movement detection:
// SMB/RDP connections from unusual sources (Windows Event Log)
SecurityEvent
| where EventID in (4624, 4625)
| where LogonType in (3, 10) // Network, RemoteInteractive
| where SubjectUserName !endswith "$" // Exclude machine accounts
| summarize AttemptCount = count(), TargetHosts = dcount(Computer)
by SubjectUserName, IpAddress, bin(TimeGenerated, 1h)
| where TargetHosts > 5 // Accessing many hosts = lateral movement
| order by TargetHosts desc
Alert Triage Process
A well-defined triage process is what separates a functional SOC from alert chaos.
Triage Principles
Every alert gets a disposition:
- True Positive (TP): Real malicious activity β escalate to incident
- False Positive (FP): Known-good activity matching the rule β tune the rule, close alert
- True Negative Positive (TNP): Suspicious but not confirmed malicious β monitor, gather context
- FP β Exception: Known-good specific to this entity β add exclusion to rule
Document every decision. If an analyst closes an alert as FP, they should note why β this creates institutional knowledge and drives tuning.
Tier 1 Triage Playbook (per alert)
1. Read the alert summary β what's the rule detecting?
2. Enrich the principal (user/device/IP):
- Is this a known IT admin? Contractor? Recently offboarded?
- Has this principal triggered similar alerts before?
- Any recent HR events (termination, role change)?
3. Examine the event in context:
- What happened before and after this event?
- Is this behaviour consistent with the principal's normal pattern?
- What time of day? From what location?
4. Disposition:
- Clear FP β close, add exclusion if appropriate
- Suspicious β escalate to Tier 2 with context
- Confirmed TP β create incident ticket, escalate
5. Feedback to detection team:
- If high FP rate on this rule β flag for tuning
Alert SLAs
| Alert Severity | Triage SLA | Escalation SLA |
|---|---|---|
| Critical | 15 minutes | 30 minutes |
| High | 1 hour | 4 hours |
| Medium | 4 hours | 24 hours |
| Low | 24 hours | 72 hours |
Track SLA compliance monthly. If analysts consistently miss SLAs, either increase staffing, reduce alert volume, or both.
Threat Hunting
Threat hunting is proactive β analysts searching for threats that havenβt triggered rules. Itβs different from alert triage:
- Alert triage: Reactive β respond to what the system flags
- Threat hunting: Proactive β search for evidence of techniques the system isnβt detecting
Hunt workflow:
- Hypothesis: βGroups targeting our industry use Cobalt Strike with specific beacon intervals. Do we have unexplained beaconing traffic?β
- Data query: Search for hosts making repeated outbound connections at regular intervals to unfamiliar destinations
- Investigation: Examine each potential match β is this a known update service, monitoring agent, or something unexplained?
- Disposition: Benign β document and exclude. Suspicious β escalate. Confirmed β incident.
- Detection creation: If the hunt finds a real technique being used against you, create a rule to detect it automatically going forward.
Example hunt query (periodic beaconing detection):
// Detect hosts making periodic outbound connections (potential C2 beaconing)
// Look for connections with low jitter to external IPs
NetworkFlow
| where TimeGenerated > ago(7d)
| where DestinationPort in (80, 443, 8080, 8443)
| where ipv4_is_private(DestinationIP) == false // External only
| summarize
ConnectionCount = count(),
IntervalStdDev = stdev(TimeGenerated - prev(TimeGenerated, 1))
by SourceIP, DestinationIP, DestinationPort, bin(TimeGenerated, 1h)
| where ConnectionCount > 20 and IntervalStdDev < 5m // Regular beaconing
| join kind=leftanti (
// Exclude known good destinations (CDN, monitoring, update services)
ExternalWhitelist | where Type == "CDN" or Type == "UpdateService"
) on DestinationIP
SIEM Operations Metrics
Track these metrics to measure SIEM effectiveness:
| Metric | Target | Why It Matters |
|---|---|---|
| True Positive Rate (TPR) | > 10% of alerts | If < 5%, detection quality is poor |
| False Positive Rate (FPR) | < 50% | If > 80%, analysts disengage |
| Mean Time to Detect (MTTD) | < 1 hour for Critical | How fast does SIEM catch incidents? |
| Mean Time to Respond (MTTR) | < 4 hours for Critical | How fast does team act? |
| Alert volume per analyst | < 20/day per analyst | > 50/day causes burnout |
| Hunting hours per week | > 20% of analyst time | Proactive hunting finds what rules miss |
| Detection coverage (MITRE ATT&CK) | > 60% of TTPs | Are major technique families covered? |
Review these metrics monthly. Declining TPR or increasing alert volume per analyst are early warning signs.
SIEM Architecture for Scale
Log Routing and Tiering
Not all logs need to be searchable in real-time. Design a tiered architecture:
Hot tier (30β90 days): Real-time indexing in SIEM
β Incident investigation and real-time detection
β High cost per GB, fast query
Warm tier (90β365 days): Compressed, slower retrieval
β Longer-window investigations, compliance queries
β Lower cost, slower query
Cold tier (1β7 years): Archive storage (S3 Glacier, Azure Archive)
β Regulatory retention, legal discovery
β Very low cost, restore takes hours/days
Microsoft Sentinelβs archive tier, Splunk SmartStore, and Elastic Frozen Data tiers all implement this model.
Automation to Reduce Manual Load
SOAR (Security Orchestration, Automation, and Response) automates repetitive Tier 1 actions:
Automate fully (no analyst needed):
- Phishing email quarantine: EDR detects malicious attachment β auto-quarantine mailbox item
- Known-bad IP blocking: TI match β auto-block on firewall/WAF
- Automated password reset for accounts showing impossible travel
Automate with approval:
- Account disable: Suspicious account activity β analyst reviews β one-click disable
- Endpoint isolation: Suspicious malware activity β analyst reviews β one-click isolate
Keep manual:
- Incident declaration and escalation
- Customer notification decisions
- Evidence collection for legal
CyberneticsPlus helps organisations deploy, tune, and mature their SIEM programmes on Microsoft Sentinel and Splunk. Our SIEM implementation service and 24/7 security monitoring capabilities help you get real value from your security data investment. Contact us to improve your SIEM ROI.