Adaptive Alert Classification to Reduce False Positives in ...tadek.pietraszek.org/publications/pietraszek06_alert-thesis-slides.pdf · Alert classification to reduce false positives

Albert-Ludwigs-Universität FreiburgFakultät für Angewandte WissenschaftenDec 5, 2006

Alert classification to reduce false positives in intrusion detection

Tadeusz Pietraszek / 'tʌ·deuʃ pɪe·'trʌ·ʃek/[email protected]

P h D D e f e n s e P r e s e n t a t i o n

PhD Defense2 13.08.07

Thesis Statement

Thesis at the intersection of machine learning and computer security.

1. Using machine learning it is possible to train classifiers in the form of human readable classification rules by observing the human analyst.

2. Abstaining Classifiers can significantly reduce the number of misclassified alerts with acceptable abstention rate and are useful in intrusion detection.

3. Combining supervised and unsupervised learning in a two-stage alert-processing system forms a robust framework for alert processing.


Outline

Background and problem statement.

1. Adaptive learning for alert classification.

2. Abstaining classifiers.

3. Combining supervised and unsupervised learning.

Summary and conclusions.


Intrusion Detection Background Intrusion Detection Systems (IDSs) [And80,Den87] detect intrusions i.e. sets of actions that attempt to compromise the integrity, confidentiality, or availability of computer resource [HLMS90].

IDS have to be effective (detect as many intrusions as possible)and keep false positives to the acceptable level, however, in real environments 95-99% alerts are false positives [Axe99, Jul01, Jul03].

Eliminating false positives is a difficult problem:– intrusion may only slightly differ from normal actions (IDSs have limited

context processing capabilities),

– writing a good signature is a difficult task (specific vs. general),

– actions considered intrusive in one systems, may be normal in others,

– viewed as a statistical problem – base rate fallacy.


Global picture – IDS monitoring

Manual knowledge acquisition is not used for classifying alerts– Fact 1: Large database of historical alerts.

– Fact 2: Analyst typically analyzes alerts in real time.


Problem statement

Given

– A sequence of alerts (A1, A2, …, Ai, …) in an alert log L

– A set of classes C = {C1, C2, …, Cn}

– An intrusion detection analyst O sequentially and in real-time assigning classes to alerts

– A utility function U describing the value of a classifier to the analyst O

Find

– A system classifying alerts, maximizing the utility function U• Misclassified alerts• Analyst’s workload• Abstentions


Outline







ALAC (Adaptive Learner for Alert Classification)

Automatically learn an alert classifier based on analyst’s feedback using machine learning techniques.

BackgroundKnowledge

RulesParams

AlertClassifier

Classified AlertsAlerts

Update RulesID Analyst

IDS

MachineLearning

TrainingExamples

Feedback

Model UpdateRecommender mode

• Misclassifications


ALAC (Adaptive Learner for Alert Classification)

BackgroundKnowledge

RulesParams

AlertClassifier

NoAlerts

Update Rules

ID Analyst

IDS

MachineLearning

TrainingExamples

Feedback

Model Update

Confident?

Process

Yes

Agent mode

• Misclassifications

• Analyst’s workload


Why does learning work and why can it be difficult?

The approach hinges on the two assumptions

– Analysts are able to classify most of alerts correctly

– It is possible to learn a classifier based on historical alerts

Difficult learning problem1. Use analyst’s feedback (learning from training examples).2. Generate the rules in a human readable form (correctness can be

verified).3. Be efficient for large data files.4. Use background knowledge.5. Asses the confidence of classification.6. Work with skewed class distributions / misclassification costs.7. Adapt to environment changes.


Requirements - revisited

1. Core algorithm - RIPPER.

2. Rules in readable form.

3. Efficient to work on large datasets.

4. Background knowledge represented in attribute-value form.

5. Confidence – rule performance on testing data with Laplacecorrection.

6. Cost Sensitivity – weighted examples.

7. Incremental Learning – “batch incremental approach” –batch size depends on the current classification accuracy.


Results - Thesis Statement (1)

Adaptive Learner for Alert Classification (ALAC)• Human feedback, background knowledge, ML

techniques.– Recommender Mode (focusing on the

misclassifications in the utility function U).• Good performance: fn=0.025, fp=0.038 (DARPA),

fn = 0.003, fp = 0.12 (Data Set B).– Agent Mode (focusing on the misclassifications and the

workload in the utility function U).• Similar number of misclassifications and more than 66%

of false positives are automatically discarded.– Many rules are interpretable.


Outline







Metaclassifier Aα,βAbstaining binary classifier A is a classifier that in certain case can refrain from classification. We construct it as follows:

where Cα, Cβ is such that:

(Conditions used by Flach&Wu [FW05] in their work on repairing concavities of the ROC curves, met in particular if Cα, Cβ are constructed from a single scoring classifier R).

Can we optimally select Cα, Cβ?

Cα Cβ Result+ + +- + ?+ - Impossible

- - -

( ) ( )⎪⎩

⎪⎨

⎧

−=−+=∧−=

+=+=

)()()(?

)()(,

xxx

xx

β

βα

α

βα

CCC

CA

))()(())()((:−=⇒−=∧

+=⇒+=∀

xxxxx

αβ

βα

CCCC


“Optimal” Metaclassifier Aα,β

How do we compare binary classifiers and abstaining classifiers? How to select an optimal classifier?

No clear answer– Use cost based model (Cost-Based Model) (extension of

[Tor04]

– Use boundary conditions:• Maximum number of instances classified as “?” (Bounded-

Abstention Model)• Maximum misclassification cost (Bounded-Improvement Model)


Cost-based model – a simulated example

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

ROC curve with two optimal classifiers

FP

TP

Classifier A

Classifier B

FP(a)

0.0

0.2

0.40.6

0.81.0

FP(b)

0.00.2

0.40.6

0.8

1.0

Cost

0.2

0.3

0.4

0.5

Misclassification cost for different combinations of A and B

FP(a)

0.0

0.2

0.40.6

0.81.0

FP(b)

0.00.2

0.40.6

0.8

1.0

Cost

0.2

0.3

0.4

0.5

Misclassification cost for different combinations of A and B

PN

cccfpf

PN

cccfpf

ROC

ROC

13

2321

1312

23

)(

)(

−=′

−=′

α

β


Bounded models

Problem: 2x3 cost matrix is not always given and would have to be estimated. However, classifier is very sensitive to c13, c23.

Finding other optimization criteria for an abstaining classifier using a standard cost matrix.– Calculate misclassification costs per classified

instance.

Follow the same reasoning to find the optimal classifier.


Bounded models equation

Obtained the following equation, determining the relationship between k and rc for as a function of classifiers Cα, Cβ.

– Constrain k, minimize rc → bounded-abstention

– Constrain rc, minimize k → bounded-improvement

No algebraic solution, however, for a convex ROCCH we can show an efficient algorithm.

( )( ) ( )

( ) ( )( )βααβ

βα

FNFNFPFPPN

k

cFNcFPPNk

rc

−+−+

=

++−

=

11

11221


Bounded-abstention modelAmong classifiers abstaining for no more than a fraction of kMAX instances find the one that minimizes rc.

Useful application in real-time processing instances where the non-classified instances will be processed by another classifier with a limited processing speed.

Algorithm: Three-step derivation– Step 1: Show an (impractical) solution for a smooth

ROCCH and equality k = kMAX.

– Step 2: Extend for a inequality k ≤ kMAX

– Step 3: Derive an algorithm for ROCCH.


Bounded-abstention model – Step 1 and 2

Using Lagrange method (constrained optimization under equality conditions) ∇rc ×∇k=0 we obtain

Starting for known optimal classifier for a given k can construct an optimal classifier path for k + δk.

– Known points can be either the optimal binary classifier or an all-abstaining classifier.

– Such a solution is impractical.

Can show that except for a very special boundary case this classifier is also optimal for k ≤ kMAX.

( ) ( )12

212

12

211cc

PN

cc

PNfpffpf ROCROC ⎟

⎠⎞

⎜⎝⎛=⎟

⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛−+′′ αβ


Bounded-abstention model – Step 3

ROCCH consists of line segments connecting points Pi, Pi+1 with coefficients Ai and Bi such that tp = Ai fp + Bi

Using similar reasoning as in Step 1 we obtain that:– Either Cα or Cβ is located on

the vertex Pi or Pj.

– The optimal classifier depends on the sign of X.

O(n) algorithm for finding the optimal classifier.

Pi

Pj

Pi+1

Case 1X > 0

Pi

Pj

Pj−1

Case 2X < 0

12

212

12

211cc

PN

cc

PNAAX ij ⎟

⎠⎞

⎜⎝⎛−⎟

⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛−+=


Bounded-abstention model – a simulated example

FP(a)

0.0

0.2

0.4

0.60.8

1.0

FP(b)

0.00.2

0.4

0.6

0.8

1.0

Cost

0.1

0.2

0.3

0.4

0.5

Optimal classifier path − bounded−abstention

FP(a)

0.0

0.2

0.4

0.60.8

1.0

FP(b)

0.00.2

0.4

0.6

0.8

1.0

Cost

0.2

0.3

0.4

0.5

Optimal classifier path − bounded−abstention


How can we use it in ALAC? ALAC+

ALAC architecture fits a tri-state classifier.

Tri-StateAlert

Classifier

No (?)Alerts

ID Analyst

IDS

Learn Tri-StateClassifier

TrainingExamples

FeedbackClassAssigned?

Process

Yes (+/-)


Results - Thesis Statement (2)Applied abstaining classifiers for alert classification (ALAC+)

– Recommender mode• DARPA: comparable fn, significantly lower fp (up to 97%),

cost reduction by 15-20%.• Data Set B: lowered fn (76%) and fp (97%), cost reduction

by 87%. – Agent mode

• DARPA: comparable fn, much lower fp, comparable cost• Data Set B: lowered fn (60%) and fp (96%), cost reduction

by 72%.– ALAC+ reduced the overall number of misclassifications

(in particular fp) and, in most cases, misclassification costs.

– Higher precision is better for human analysts [Axe99].


Outline





Summary, conclusions and contributions.


Clustering (CLARAty)Julisch[Jul03] observed that a great number of alerts can be attributed to a small number of root causes, which are persistent over time.

– Julisch used a modified AOI[Jul03] to generate human readable cluster descriptions

– Root causes can be identified and root causes can be removed.

Inputs:– Alerts

– Generalization Hierarchies (mostly IP addresses)

Outputs:– Clusters (in the form of generalized alerts)


Two-stage alert classification system

AlertClassifier

MachineLearning

TrainingExamples

BackgroundKnowledge

RulesParams

Alerts

Update Rules

ID AnalystIDS

AgentConfident?

Process

Yes

No

Environment:- investigating network and configuration problems

AlertClustering

Clusters

Analyst- Interpretation- Finding root causes

IDS:- filtering rules

Alert Filter

IDS:- modif. signatures

Environment:- investigating intrusions

Feedback

Alert Clustering Adaptive Alert Classification

CLARAty used for filtering and labeling alerts– Filtering mode (FI)

– Feature Construction mode (FC)

Alerts subsequently passed onto ALAC


Results - Thesis Statement (3)

Proposed a two-stage alert classification system, based on CLARAty[Jul03]

• using clusters for retrospective alert analysis,• automated cluster processing system,• two-stage alert processing system.

Feature Construction (FC) does not yield big improvements.Filtering (FI) performs better (in terms of FN ) and comparably (in terms of FP ). Most likely because the “easy” alerts have been removed.Thanks to the first stage, the number of alerts to be processed in the second stage (analyst’s workload) have been reduced by 63%.


Outline







Conclusions

Evolution of IDSs– Level 1: Improving IDSs themselves– Level 2: Leveraging the Environment– Level 3: Alert Postprocessing– Level 4: Analyst’s Involvement

Using ML techniques for IDS alert classification.Verified the three-part thesis statement.System works but inherently there is a risk that some attacks might be missed.Step towards a more efficient and reliable alert-management system.

Thank you!


Future Work

Combining with existing multi-stage alert correlation systems.

Other learning algorithms: SVM, Bayesian, predictive clustering rules?

Multi-class classification.

Link mining.

Dynamic ROC evaluation in incremental settings.

HCI issues.


Can Machine Learning be secure? [NKS06], [BNSJ+06]

ML does not deal with active attackers [CB06]– “Mutagenesis dataset never tried to evade your classifier”

All automated classification systems bear certain risk (it’s a matter of trade-offs!)– attacker may try to hide their activities among

background alerts hoping to evade detection

– BUT they do it anyway because such attacks already have a lower chance of being caught!• By removing irrelevant alerts the system can highlight the

important ones, but there’s no guarantee.• But it is also possible that this effect is amplified by ALAC.


Can Machine Learning be secure? [NKS06], [BNSJ+06]Good news– ALAC does not provide immediate feedback.– The interaction with background knowledge is complex.– There is only that many attacks the attacker might try.– Might be treated as noise.

Bad news– There is no guarantee.– Once such systems are common, they may turn into

“arms race” (cf. spam). But for this to happen IDSs would have to be much better than they are now. Let’s see how spam filters and automated signature generation deals with it first ;-)


Publication ListTadeusz Pietraszek. On the use of ROC analysis for the optimization of abstaining classifiers. Machine Learning Journal, (accepted with minor revisions to appear), 2007.Tadeusz Pietraszek. Classification of intrusion detection alerts using abstaining classifiers. Intelligent Data Analysis Journal, 11(3):(to appear), 2007. Tadeusz Pietraszek, Axel Tanner. Data Mining and Machine Learning---Towards Reducing False Positives in Intrusion Detection. Information Security Technical Report Journal, Volume 10(3), pages 169--183, 2005.Tadeusz Pietraszek, Chris Vanden Berghe. Defending against Injection Attacks through Context-Sensitive String Evaluation. In Recent Advances in Intrusion Detection (RAID 2005), volume 3858 of Lecture Notes in Computer Science, pages 124--145, Seattle, WA, 2005. Tadeusz Pietraszek. Optimizing Abstaining Classifiers using ROC Analysis. In Proceedings of 22nd International Conference in Machine Learning (ICML 2005), pages 665-672, Bonn, Germany, 2005.Tadeusz Pietraszek. Using Adaptive Alert Classification to Reduce False Positives in Intrusion Detection. In Recent Advances in Intrusion Detection (RAID2004), volume 3324 of Lecture Notes in Computer Science, pages 102-124, Sophia Antipolis, France, 2004.


References (1)[And80] James P. Anderson. Computer security threat monitoring and surveillance.Technical report, James P. Anderson Co., 1980.[Axe05] Stefan Axelsson.Understanding Intrusion Detection Through Visualization. PhD thesis, Chalmers University of Technology, 2005.[Axe99] Stefan Axelsson.The base-rate fallacy and its implications for the intrusion detection. In Proceedings of the 6th ACM Conference on Computer and Communications Security, pages 1-7, Kent Ridge Digital Labs, Singapore, 1999.[BNSJ+06] Marco Barreno, Blaine Nelson, Russell Sears, Anthony D.Joseph, J. D. Tygar. Can Machine Learning Be Secure?. Conference on Computer and Communications Security. Proceedings of the 2006 ACM Symposium on Information, computer and communications security. Pages 16-25. Taipei, Taiwan 2006.[Bugtraq03] SecurityFocus. BugTraq. Web page at http://www.securityfocus.com/bid, 1998-2004.[CAMB02] Frederic Cuppens, Fabien Autrel, Alexandre Miege, and Salem Benferhat. Correlation in an intrusion detection process. In Proceedings SEcurit\e des Communications sur Internet(SECI02), pages 153-171, 2002.[CB06] Alvaro A. Cárdenas and John S. Baras. Evaluation of Classifiers and Learning Rules: Considerations for Security Applications. Proceedings of the AAAI 06 Workshop on Evaluation Methods for Machine Learning. Boston, Massachusetts, July 17, 2006.[DC01] Olivier Dain and Robert K. Cunningham. Fusing a heterogeneous alert stream into scenarios. In Proceedings of the 2001 ACM Workshop on Data Mining for Security Application, pages 1-13, Philadelphia, PA, 2001.[Den87] Dorothy E. Denning. An intrusion detection model. IEEE Transactions on Software Engineering, SE-13(2):222-232, 1987.

http://www.securityfocus.com/bid


References (2)[Der03] Renaud Deraison.The Nessus Project. Web page at http://www.nessus.org, 2000-2003.[DW01] Herve Debar and Andreas Wespi. Aggregation and correlation of intrusion-detection alerts. In Recent Advances in Intrusion Detection (RAID2001), volume 2212 of Lecture Notes in Computer Science, pages 85-103. Springer-Verlag, 2001.[FW05] P. A. Flach and S. Wu.Repairing concavities in ROC curves. In Proceedings 2003 UK Workshop on Computational Intelligence, pages 38-44, Bristol, UK, 2003.[HLMS90] Richard Heady, George Luger, Arthur Maccabe, and Mark Servilla. The architecture of a network level intrusion detection system. Technical report, University of New Mexico, 1990.[How97] John D. Howard.An Analysis of Security Incidents on the Internet 1989-1995. PhD thesis, Carnegie Mellon University, 1997.[IBM03] IBM. IBM Tivoli Risk Manager. Tivoli Risk Manager User's Guide. Version 4.1, 2002.[Jul01] Klaus Julisch. Mining Alarm Clusters to Improve Alarm Handling Efficiency. InProceedings 17th Annual Computer Security Applications Conference, pages 12-21, New Orleans, LA, Dec. 2001.[Jul03a] Klaus Julisch. Clustering intrusion detection alarms to support root cause analysis. ACM Transactions on Information and System Security (TISSEC), 6(4):443-471, 2003.[Jul03b] Klaus Julisch. Using Root Cause Analysis to Handle Intrusion Detection Alarms. PhD thesis, University of Dortmund, Germany, 2003.[Krs98] Ivan Victor Krsul. Software Vulnerability Analysis. PhD thesis, Purdue University, 1998.


References (3)[LBMC94] Carl E. Landwehr, Alan R. Bull, John P. McDermott, and William S. Choi. A taxonomy of computer program security flaws. ACM Computing Surveys (CSUR), 26(3):211-254, 1994.[LWS02] Richard Lippmann, Seth Webster, and Douglas Stetson. The effect of identifying vulnerabilities and patching software on the utility of network intrusion detection. In Recent Advances in Intrusion Detection (RAID2002), volume 2516 of Lecture Notes in Computer Science, pages 307-326. Springer-Verlag, 2002.[MC03] Matthew V. Mahoney and Philip K. Chan. An analysis of the 1999 DARPA/Lincoln Laboratory evaluation data for network anomaly detection. In Recent Advances in Intrusion Detection (RAID2003), volume 2820 of Lecture Notes in Computer Science, pages 220-237. Springer-Verlag, 2003.[McH00] John McHugh. The 1998 Lincoln Laboratory IDS evaluation. A critique. In Recent Advances in Intrusion Detection (RAID2000), volume 1907 of Lecture Notes in Computer Science, pages 145-161. Springer-Verlag, 2000.[MIT03] MITRE. Common Vulnerabilites and Exposures. Web page at http://cve.mitre.org, 1999-2004.[MHL94] Biswanath Mukherjee, Todd L. Heberlein, and Karl N. Levitt. Network intrusion detection. IEEE Network, 8(3):26-41, 1994.[NKS06] James Newsome, Brad Karp, Dawn Song. Paragraph: Thwarting Signature Learning By Training Maliciously. In Recent Advances in Intrusion Detection (RAID2006), Hamburg, Germany 2006.[PF98] Mark Paradies and David Busch. Root cause analysis at Savannah river plant. In Proceedings of the IEEE Converence on Human Factors and Power Plants, 1988.


References (4)[PV05] Tadeusz Pietraszek and Chris Vanden Berghe. Defending against injection attacks through context-sensitive string evaluation. In Recent Advances in Intrusion Detection (RAID2005), volume 3858 of Lecture Notes in Computer Science, pages 124-145, Seattle, WA, 2005. Springer-Verlag.[RZD05] James Riordan, Diego Zamboni, and Yann Duponchel.Billy Goat, an accurate worm-detection system (revised version) (RZ 3609). Technical report, IBM Zurich Research Laboratory, 2005.[SP01] Umesh Shankar and Vern Paxson. Active mapping: Resisting NIDS evasion without altering traffic.In Proceedings of the 2003 IEEE Symposium on Security and Privacy, pages 44-62, Oakland, CA, 2001.[SP03] Robin Sommer and Vern Paxson. Enhancing byte-level network intrusion detection signatures with context. In Proceedings of the 10th ACM Conference on Computer and Communication Security, pages 262-271, Washington, DC, 2003.[VS01] Alfonso Valdes and Keith Skinner. Probabilistic alert correlation. In Recent Advances in Intrusion Detection (RAID2001), volume 2212 of Lecture Notes in Computer Science, pages 54-68. Springer-Verlag, 2001.[VVCK04] F. Valeur, G. Vigna, C.Kruegel, and R. Kemmerer. A comprehensive approach to intrusion detection alert correlation. IEEE Transactions on Dependable and Secure Computing, 1(3):146-169, 2004.[WF00] Ian H. Witten and Eibe Frank. Data Mining: Practical machine learning tools with Java implementations. Morgan Kaufmann Publishers, San Francisco, CA, 2000.

Security – supporting slides


Computer Security

Confidentiality – prevention of (un)intentionalunauthorized disclosure of data

Integrity – prevention of (un)intentionalunauthorized modification of data.

Availability - prevention of unauthorized withholding of computing resources.

Intrusion – any set of actions that attempt to compromise the confidentiality, integrity or availability of a computing resource.


Intrusion detection

Traditional approach to security: build a “protective shield” around systems [MHL94].– Tradeoff security vs. usability

– Open systems are more productive

– “Secure systems” are vulnerable to attacks exploiting internal errors (e.g., buffer overflows, injection attacks, race conditions),

– Systems are vulnerable to insider attacks (intentional or unintentional)

Intrusion detection [Den87]: retrofit systems with security by detecting attacks and alerting a SSO.


Intrusion Detection Systems (IDS)

Intrusion Detection System - an automated system detecting and alarming of any situation where an intrusion has taken or is about to take place [Axe05].

SSO/ID AnalystMonitored

SystemAudit

collectionAudit

storageProcessingComponent

Alert

Active/ProcessingData

ReferenceData

ConfigurationData

AutomatedResponse

Automated Response

Manual Response


Intrusion Detection Systems: anomaly vs. misuse

Anomaly-based model

Misuse-base model, detects only known attacks. Why not just prevent them if they are known?– Window of vulnerability

– Detecting failed attacks

– Detecting policy violations

– Additional layer of protection

– Generalized and intent-guessing signatures.


Snort – an open source IDS


Snort – signature examples


Striving to reduce false positivesLevel 1: Improving IDSs themselves

– More sophisticated protocol analyzers, state keeping [Roe05,Pax99]

– Highly specialized IDSs: Billygoat [RZD05], CSSE[PV05]

Level 2: Leveraging the environment

– Active mapping [SP01], Context Signatures [SP03]

– Vulnerability correlation [LWS02, VVCK04]

Level 3: Alert Postprocessing

– Data mining [Jul03b]

– Alert correlation systems [CAMB02, DW01, VS01, VVCK04]

Level 4: Analyst’s Involvement

– Idea pursued in this thesis and mostly orthogonal to the other approaches.

Alerts

ID AnalystIDS

Level 1

Level 4Level 3

Level 2


Binary vs. multi-class classification

Analysts analyze– The root cause of alerts [PB88]

– The impact on the environment

– The actions that need to be taken

Taxonomizing root causes is a difficult task [How97, Jul03b, Krs98, LBMC94].

Ad-hoc classifications exist, for example:– Intentional/malicious (e.g. scanning, unauthorized access, privilege escalation, policy

violation, DoS attack),

– Inadvetent/non-malicious (e.g. network misconfiguraion, normal acitivities).

The main distinction for the analyst “Is the alert actionable or not?”– Determined by the combination of the root cause and the impact on the environment.

– For our purposes we will assume that the this is equivalent to our two classes: true positives and false positives.

ALAC – supporting slides


Evaluation problem – two datasets [Pie04, PT05]DARPA1999 Data Set

– Used network traces, run through Snort IDS

– Used attack truth tables to label the alerts

Data Set B– Real network traces collected in a mid-sized corporate network

– Used Snort IDS to generate alerts

– Manually labeled (bias!)


Evaluation Problem – new dataset

MSSD Datasets– Real datasets from MSSD, different commercial NIDSs, some companies

more than 1.

– Looked at some 20 companiesfor time period of 6 months

– Some alerts belong to incidents, labeled by security analysts


Evaluation problemLack of publicly available data sources for the evaluation of IDSs

– No common reference for evaluation

– Everybody can install an IDS in their own network• Yes, but this data often cannot be shared (sensitive information)• Has no labels

– Honeypots Data [PDP05] • All data is by definition suspicious• More useful for detecting automated attacks then real attackers.

DARPA1998 and DARPA1999 efforts

– MIT Lincoln Labs simulated environment

– Many flaws have been identified [McH01, MC03]

– Still used in many papers (e.g., UCI Dataset and KDD CUP 1999)

Recent effort presented at ETRICS06 (Qian et at.)

Proprietary

– Data Set B: undisclosed customer, collected with Snort, classified by the author

– MSSD Data Sets: data from IBM’s SOC, implicitly classified by real security analysts.


Evaluation Problem – Summary

These datasets are quite different!– DARPA1999 DataSet & Data Set B

• On average 1472 alerts per company per day out of which 359 are true positives (24%)

– MSSD Dataset• On average 3250 alerts per company per day out of which 11 are

true positives (0.34%).• Most alerts are clustered in incidents, on average 1 incident every 9

days– Moreover, we are not sure if all the labels are correct

• Some incidents could have been missed• Some incidents may have turned out to be false positives

We should probably handle them differently


Background Knowledge

Network topology– Classification of IP addresses

– Create rules using generalized concepts

Installed software

Alert semantics– How do we understand the attack? CVE[MIT03], Bugtraq[03]

– Was the attack successful? IDD[IBM03], Nessus[Des03]

Alert context i.e. alerts related to the current one (correlation in intrusion detection, e.g. [DW01, DC01, VS01]).– Set or sequence of alerts related to the current one

– Additional features (aggregates, alert summaries, alert categories), expressing domain knowledge in intrusion detection


Background Knowledge

Alerts with classification have been written to relational database

Use scripts to generate background knowledge in A-V form:

– IP address classification

– OS classification

– Aggregates (all in three different time windows – 1min, 5min, 30min)

• Number of alerts coming from the same IP addresses (src,

dst, src dst, src dst)

• Number of alerts of the same type

• Number of alerts with similar classification


Misclassification, Statistics (DARPA1999)

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

20000 30000 40000 50000 60000

Fal

se N

egat

ive

rate

(fn

)

Alerts Processed by System

0

0.01

0.02

0.03

0.04

0.05

20000 30000 40000 50000 60000

Fal

se P

ositi

ve r

ate

(fp)

Alerts Processed by the System

Agent - sampling 0.1Agent - sampling 0.25

Agent - sampling 0.5Recommender

Batch Classification


Misclassification, Statistics (Data Set B)

0

0.005

0.01

0.015

0.02

10000 15000 20000 25000 30000 35000 40000 45000 50000

Fal

se N

egat

ive

rate

(fn

)


Agent - sampling 0.1Agent - sampling 0.25Agent - sampling 0.5

RecommenderBatch Learning

0

0.05

0.1

0.15

0.2

10000 20000 30000 40000 50000

Fal

se P

ositi

ve r

ate

(FP

)



Agent - sampling 0.5Recommender

Batch Classification


Automatic Processing

DARPA1999 Data Set B

0

0.25

0.5

0.75

1

20000 30000 40000 50000 60000

Dis

card

ed F

alse

Pos

itive

rat

e



Agent - sampling 0.5

0

0.25

0.5

0.75

10000 20000 30000 40000 50000D

isca

rded

Fal

se P

ositi

ve r

ate



Agent - sampling 0.5


Understanding the Rules

Rules are quite understandable. They use attributes generated by the background knowledge.

(cnt_intr_w1 <= 0) and (cnt_sign_w3 >= 1) and (cnt_sign_w1 >= 1) and (cnt_dstIP_w1 >= 1) => class=FALSE

(cnt_srcIP_w3 <= 6) and (cnt_int_w2 <= 0) and (cnt_ip_w2 >= 2)and (sign = ICMP PING NMAP) => class=FALSE

“If there’s been similar alerts recently and they are all false alarms (no intrusions) then the current alert is a false alert.”“If the number of NMAP pings is small and there are no intrusions, the alert is a false alert.”


Experiments - Setting ALAC Parameters

Using ROC curve, one can choose the optimal classifier.– Need to know target class distributions and

misclassification costs• Didn’t have such data -> selected value ad-hoc CR=50

(more on this later!)

Classification accuracy– When to retrain the model

• Selected a value based on performance in ROC

Automatic processing – confidence– Right now ad-hoc, looking for something better (more on

this later!)

ML – supporting slides


Evaluating classifiers – confusion and cost matrices

A/C + -

+ 0 c12

- c21 0

A = Actual, C = Classified as

TNFPTNtn

TNFPFPfp

FNTPFNfn

FNTPTPtp

+=

+=

+=

+=

Cost Matrix

Confusion Matrix

12

21

ccCR =

A/C + -

+ TP FN P

- FP TN N


ROC Background

ROC (Received Operating Characteristic) used for model evaluation and model selection for binary classifiers

– multiple class extensions are not practically used

Allows to evaluate model performance under all class and cost distributions

– 2D plot fp × tp (X – false positive rate, Y - true positive rate)

– one point corresponds to one classifier


ROC Background

A Classifier C produces a single point on the ROC curve (fp, tp).

Classifier Cτ (or a machine learning method Lτ) has a parameter τ varying which produces multiple points.

Therefore we consider a ROC curve a function f : τa(fpτ, tpτ).

Can find an inverse function f-1 : (fpτ, tpτ) a τ and approximate it with f ˆ-1


ROC Background

ROC Convex Hull

– A piecewise-linear convex down curve fR, having the following properties:• fR(0) = 0, fR(1) = 1

• Slope of fR is monotonically non-increasing.

– Assume that for any value m, there [PF98] exists fR(x) = m.• Vertices have ``slopes’’ assuming values between the slopes of adjacent

edges• Assume sentinel edges: 0th edge with a slope ∞ and (n+1)th edge with a

slope 0.

– We will use ROCCH instead of ROC.

Abstaining Classifiers – supporting slides


Selecting the Optimal Classifier

Criteria – minimize the misclassification cost

( )

( )

01

11

)1(1

)(1

1221

1221

1221

1221

=⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎠⎞

⎜⎝⎛′⋅⋅−

+=

⎟⎟⎠

⎞⎜⎜⎝

⎛⋅⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎠⎞

⎜⎝⎛−+⋅

+=

⎟⎠⎞

⎜⎝⎛⋅=⋅−+⋅

+=

=⋅+⋅+

=

NFPfc

NPc

PNFPdrcd

cNFPfPcFP

PNrc

NFPfPTPcTPPcFP

PNrc

fpftpcFNcFPPN

rc

ROC

ROC

ROC

ROC


Cost Minimizing Criteria for One Classifier

( )PNCRfpfROC =′

Known iso-performance lines [PF98]


Cost-based model - selecting the optimal classifier

Similar criteria – minimize the cost

( ) ( )

PN

cccfpf

PN

cccfpf

FPrc

FPrc

cFNFNcFPFPcFPcFNPN

rc

ROC

ROC

misclassdisagreemisclassdisagreefpfpfnfn

13

2321

1312

23

.

13

.

23

,

21

,

12

)(

)(

00

1

−=′

−=′

⇒=∂∂

∧=∂∂

⎟⎟⎟

⎠

⎞

⎜⎜⎜

⎝

⎛−+−++

+=

−−

α

β

αβ

α

βα

β

αβ

βα

α

αβ

β 44 344 2144 344 21321321

Depends only on the slopes of ROC (similar to a single-classifier case)


Cost-based model - understanding cost matrices

2x2 cost matrix is well known. 2x3 cost matrices has some interesting properties:e.g. under which conditions the optimal classifier is an abstaining classifier

Our derivation is valid for

we can prove that if this condition is not met the classifier is a trivial binary classifier

( ) ( ) ( )12231321122113122321 cccccccccc +≥∧>∧≥


Cost-based model - understanding cost matrices

Theorem. If (*) is not met, the classifier is a trivial binary classifier.

Proof (sketch)– show that for an optimal classifier

fR’(fp*α) ≥ fR’(fp*) ≥ fR’(fp*

β), where fp* corresponds to an optimal binary classifier.

– show that if (*) is not met, is positive for fp*α < fp*

and that is positive for fp* β > fp*

– therefore fp*α = fp* = fp*

β

( ) ( ) ( ) (*)12231321122113122321 cccccccccc +≥∧>∧≥

αfprc

∂∂

βfprc

∂∂


Cost-based model - interesting cases

How to set c13, c23 so that the classifier is a non-trivial abstaining classifier?

Two interesting cases– Symmetric case (c13=c23)

– Proportional case (c13/c12 = c23/c21)1221

21122313 cc

cccc+

≤=

2221

2312

13cccc ≤⇔≤


Bounded-abstention model – Algorithm


Experiments

Tested with 15 UCI KDD datasets, using averaged cross-validation.

In each model used one independent parameter c13=c23, k or f.

Classifier – Bayesian classifier from Weka [WF00].

In many cases obtained a large cost reduction even with a small abstention k=0.1.

Now applying it to alert classification.


Results with abstaining classifiers

CLARAty – supporting slides


CLARAty algorithm[Jul03b]


CLARAty & cluster labeling

Running CLARAty with no labels and trying to label clusters

– Containing only false positives

– Containing only true positives

– Mixed

Two main purposes

– Retroactive alert analysis

• By looking at cluster descriptions again, the analysts may spot previously

missed incidents or large groups of alerts indicating problems

• Can write rules recognizing some incidents

– Predictive value

HistoricalIncident Data

Alerts

AlertClustering

Create Trigger Rule (TP)

Alert Filter

HistoricalAlert Data Incident

Correlation

Create Filtering Rule (FP)

Split Clusters or Investigate missed TP


Clustering ConclusionsClusters are persistent

– Average clustering coverage 90%

– Average filtering coverage 63%

Most of the clusters are FP only clusters (avg. 95%)– These clusters tend to be persistent

There are only very few TP clusters (avg. 0.2%)– These clusters are ephemeral

Mixed clusters (avg. 5%)– These clusters need to be investigated more carefully.

Filtering works well, although if clusters are not reviewed some attacks can be missed

– Investigated, mostly due to incorrect labeling.


Automated Clustering & Analysis Framework

Only FP clusters are filtered out.– Evaluation verifies that no FP are missed

– Applied to DARPA1999 Data Set and Data Set B

– Applied 26 times to each company on MSSD data (weekly clustering)


Cluster filtering (DARPA1999 and Data Set B)0

5000

1000

015

000

2000

0

DARPA1999 Data

date (clustering period 1 week)

#ale

rts

Feb 28 Mar 10 Mar 20 Mar 30

#alerts#positivesclusters

0.0

0.2

0.4

0.6

0.8

1.0

DARPA1999 Data


frac

tion

of a

lert

s (1

wee

k)


050

0010

000

1500

020

000

Data Set B


#ale

rts

Nov 14 Nov 19 Nov 24 Nov 29 Dec 04 Dec 09

#alerts#positivesclusters

0.0

0.2

0.4

0.6

0.8

1.0

Data Set B


frac

tion

of a

lert

s (1

wee

k)



Cluster persistency (DARPA1999 and Data Set B)0

5000

1000

015

000

2000

0

Filtering using Clustering (DARPA 1999 Data)


#ale

rts


#alerts#covered (clustering)#covered (FP clusters)#filtered#positives#positives missed

0.0

0.2

0.4

0.6

0.8

1.0

Filtering using Clustering (DARPA 1999 Data))


frac

tion

1/(N

+P

)


covered (clustering)covered (FP clusters)filteredpositivesmissed (FN/P)

050

0010

000

1500

020

000

Filtering using Clustering (Data Set B)


#ale

rts


#alerts#covered (clustering)#covered (FP clusters)#filtered#positives#positives missed

0.0

0.2

0.4

0.6

0.8

1.0

Filtering using Clustering (DARPA 1999 Data))


frac

tion

1/(N

+P

)


covered (clustering)covered (FP clusters)filteredpositivesmissed (FN/P)


Cluster Accuracy & Coverage (DARPA1999)72

076

7208

472

088

7212

872

193

7219

472

195

7220

072

203

7220

672

242

7224

372

244

7224

572

246

7224

772

248

7224

972

250

7225

172

252

7227

572

288

7228

972

290

7231

172

312

7235

172

352

7239

772

424

7244

872

457

7246

872

472

7249

472

517

7251

872

519

7252

072

521

7252

272

523

7252

472

525

7252

672

534

7255

272

553

7255

472

576

7257

772

578

7258

572

590

7259

272

594

7259

572

599

7260

172

602

7260

472

605

7260

772

608

7261

072

630

7263

472

648

7265

072

651

7265

272

654

7266

172

674

7267

772

683

7268

472

685

7268

672

687

7268

872

689

7269

072

691

7269

272

699

7270

072

701

7270

272

703

7270

472

716

7272

072

721

7272

272

723

7272

472

725

7273

072

792

7280

072

802

7280

372

805

7281

272

815

7283

572

836

7283

772

857

7285

872

859

7286

472

865

7286

972

870

7287

672

880

7288

1

20282202802027820275202732026920260202572025120250202492024820244202392023620234202192021320204201972019620190201872018520184201782017620166201562015520151201492014620143201392013420132201152010820102200522004620040200352002720024200202001820013

Clustering Accuracy CLUSTERING STAGE (DARPA 1999 Data)

Clusters

#ale

rts

010

0020

0030

0040

0050

00

7207

672

084

7208

872

128

7220

072

203

7227

572

351

7235

272

397

7242

472

448

7245

772

468

7247

272

494

7251

772

518

7251

972

520

7252

172

522

7252

372

524

7252

572

526

7253

472

552

7257

672

585

7259

072

592

7259

472

595

7259

972

602

7260

472

605

7260

772

630

7263

472

648

7266

172

683

7268

472

685

7268

672

687

7268

872

689

7269

172

692

7279

272

800

7280

272

803

7280

572

812

7281

572

835

7283

672

837

7287

6

202822028020278202512024820234202312021920204201902018420178201662015520151201492013920134201322011520102200402002420011−

Clustering Accuracy FILTERING STAGE (DARPA 1999 Data)

Clusters

#ale

rts

010

020

030

040

050

060

0

2001

120

012

2001

320

015

2001

820

020

2002

420

027

2003

020

034

2003

520

036

2003

920

040

2004

120

042

2004

520

046

2004

720

052

2005

320

100

2010

120

102

2010

520

106

2010

820

109

2011

020

111

2011

220

114

2011

520

129

2013

020

131

2013

220

134

2013

520

136

2013

920

140

2014

120

143

2014

620

148

2014

920

150

2015

120

153

2015

420

155

2015

620

157

2015

920

166

2016

720

168

2016

920

170

2017

120

173

2017

420

176

2017

820

180

2018

220

183

2018

420

185

2018

720

188

2019

020

194

2019

520

196

2019

720

199

2020

020

201

2020

220

204

2020

520

207

2020

820

209

2021

020

212

2021

320

214

2021

520

216

2021

720

218

2021

920

221

2022

520

227

2022

920

231

2023

220

233

2023

420

235

2023

620

238

2023

920

241

2024

420

247

2024

820

249

2025

020

251

2025

320

254

2025

620

257

2025

820

259

2026

020

262

2026

420

265

2026

620

269

2027

020

271

2027

320

275

2027

820

279

2028

020

282

2028

320

285

2029

020

291

2029

320

297

2029

9

72881728807287672870728697286572864728597285872857728377283672835728157281272805728037280272800727927273072725727247272372722727217272072716727047270372702727017270072699726927269172690726897268872687726867268572684726837267772674726547265272651

Clustering Coverage CLUSTERING STAGE (DARPA 1999 Data)

Incidents

#ale

rts

010

0020

0030

0040

0050

00

2001

120

012

2001

320

015

2001

820

020

2002

420

027

2003

020

034

2003

520

036

2003

920

040

2004

120

042

2004

520

046

2004

720

052

2005

320

100

2010

120

102

2010

520

106

2010

820

109

2011

020

111

2011

220

114

2011

520

129

2013

020

131

2013

220

134

2013

520

136

2013

920

140

2014

120

143

2014

620

148

2014

920

150

2015

120

153

2015

420

155

2015

620

157

2015

920

166

2016

720

168

2016

920

170

2017

120

173

2017

420

176

2017

820

180

2018

220

183

2018

420

185

2018

720

188

2019

020

194

2019

520

196

2019

720

199

2020

020

201

2020

220

204

2020

520

207

2020

820

209

2021

020

212

2021

320

214

2021

520

216

2021

720

218

2021

920

221

2022

520

227

2022

920

231

2023

220

233

2023

420

235

2023

620

238

2023

920

241

2024

420

247

2024

820

249

2025

020

251

2025

320

254

2025

620

257

2025

820

259

2026

020

262

2026

420

265

2026

620

269

2027

020

271

2027

320

275

2027

820

279

2028

020

282

2028

320

285

2029

020

291

2029

320

297

2029

9

7266172648726307260772604725997259472592725907252672525725247252372522725217252072519725187251772494724727246872457724487242472397723527235172128720887208472076−

Clustering Coverage FILTERING STAGE (DARPA 1999 Data)

Incidents

#ale

rts

010

0020

0030

0040

0050

00


Cluster Accuracy & Coverage (Data Set B)72

890

7295

0

7296

2

7297

9

7298

9

7300

4

7303

5

7305

4

7305

5

7305

6

7305

8

7306

0

7306

8

7307

6

7307

7

7309

9

7310

4

7317

0

7317

1

7317

2

7320

3

7321

9

7325

6

20309203072030620305−

Clustering Accuracy CLUSTERING STAGE (Data Set B)

Clusters

#ale

rts

020

0040

0060

0080

00

2030

5

2030

6

2030

7

2030

8

2030

97325673219732037317273171731707310473099730777307673068730607305873056730557305473035730047298972979729627295072890−

Clustering Coverage CLUSTERING STAGE (Data Set B)

Incidents

#ale

rts

020

0040

0060

0080

0010

000

1200

0

2030

5

2030

6

2030

7

2030

8

2030

9

7325673172731717305873056730557305472979729627295072890−

Clustering Coverage FILTERING STAGE (Data Set B)

Incidents

#ale

rts

020

0040

0060

0080

0010

000

1200

0

7289

0

7295

0

7296

2

7297

9

7305

4

7305

5

7305

6

7305

8

7317

1

7317

2

7325

6

203092030720306

Clustering Accuracy FILTERING STAGE (Data Set B)

Clusters

#ale

rts

020

4060

8010

012

0


Two-stage alert classification – ROC analysis

Feature construction performs only marginally better

Filtering performs much better for DARPA and comparably for Data Set B.

0.0 0.1 0.2 0.3 0.4

0.90

0.92

0.94

0.96

0.98

1.00

Two−Staged System − DARPA

fp

tp

1

4

8

12

3953.554.06

80.12128

2

4

8

16

84127.5

1

3

4.5

7

18

54

1

3

4.5

7

18

54.34

81

OriginalFeature construction (2FC)Filtering (2FI)Filtering (2FI) (rescaled)

0.0 0.1 0.2 0.3 0.4

0.90

0.92

0.94

0.96

0.98

1.00

Two−Staged System − Data Set B

fp

tp

0.25

1

2.5

6

163245 48 112.39

0.25

11

2

4.254.69

7

1119.1134.8854.49 132.45 512

0.5

1

2.75

4.256

12

24

0.5

1

2.75

4.256

12

24 25.4 61.7269.

OriginalFeature construction (2FC)Filtering (2FI)Filtering (2FI) (rescaled)


Misclassifications (two-stage) (DARPA)

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0 10000 20000 30000 40000 50000 60000

Fal

se N

egat

ive

rate

(fn

)


RecommenderRecommender (2FC)Recommender (2FI)

AgentAgent (2FC)Agent (2FI)

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0 10000 20000 30000 40000 50000 60000

Fal

se P

ositi

ve r

ate

(fp)



Misclassifications (two-stage) (Data Set B)

0

0.002

0.004

0.006

0.008

0.01

0 10000 20000 30000 40000 50000

Fal

se N

egat

ive

rate

(fn

)


RecommenderRecommender (2FC)Recommender (2FI)


0

0.05

0.1

0.15

0.2

0 10000 20000 30000 40000 50000

Fal

se P

ositi

ve r

ate

(fp)



Automatic Processing (two-stage)

0

0.25

0.5

0.75

0 10000 20000 30000 40000 50000 60000

Dis

card

ed F

alse

Pos

itive

rat

e



0

0.25

0.5

0.75

1

0 10000 20000 30000 40000 50000

Dis

card

ed F

alse

Pos

itive

rat

e



DARPA1999 Data Set B

END!

Documents

Adaptive Alert Classification to Reduce False Positives in ...tadek.pietraszek.org/publications/pietraszek06_alert-thesis-slides.pdf · Alert classification to reduce false positives