7. Real Test Scenarios

Step Contents

Now when the workflow is set up and explained, let us look into a few real case scenarios to understand how it works.

Test 1 – High CPU Usage

I have simulated a situation when in one LXC (container), I ran the following command to trigger a CPU stress test:

stress --vm 1 --vm-bytes $(awk '/MemTotal/{printf "%d\\n", $2 * 0.8}' /proc/meminfo)k --timeout 600

The CPU started maxing out straight away, as visible in Proxmox:

Screenshot from Proxmox showing CPU at 99.54%. — Proxmox showing that CPU is maxed out

After a couple of minutes, the workflow in n8n was executed and the result was clear:

Screenshot from Discord showing a medium risk action proposed to execute a kill-process job. — A kill-process job is suggested in relation to logs from Loki that AI processed to determine what is the best course of action.

I approved the remediation to go ahead and a job in AWX was triggered:

Screenshot from AWX showing the termination of a job that was hogging the CPU resources. — An AWX job kicked in and executed he kill-process template to terminate the job

Since I had the command running in a Terminal, I noticed that it got terminated:

Screenshot from the Terminal (an SSH session) showing the process that was running got 'Killed'. — The process was terminated as I was running it in the Terminal

The Discord message confirmed the findings:

Screenshot from Discord showing how AI analyzed both the results of the AWX job as well as system logs using Loki. — The results were analyzed by AI from both Loki & the AWX job

Test 2 – An Inactive Critical Service

In this scenario, the fail2ban service on a web1 VM is made inactive by running sudo systemctl stop fail2ban .
The workflow picks it up on its next run and offers to restart it automatically:

Screenshot from Discord showing an inactive fail2ban service on web1. — A service on web1 (fail2ban) got flagged as inactive

The workflow triggers an AWX job and confirms that the service is back up:

Screenshot from AWX showing the previously inactive service was started successfully. — Result of an AWX job that restarted the service

The result on Discord confirms the findings:

Screenshot from Discord revealing how AI can formulate the results based on AWX's (or Semaphore's) job results as well as from Loki. — Results from the action clearly pulling logs from AWX as well as Loki

Now to spice things up a bit, I also made a config error in the /etc/fail2ban/jail.local file and then stopped the service. I wanted to see how will AI handle that. As you can see, the analysis explains clearly

Screenshot of Discord showing how AI correctly analyzed what went wrong with a jail.local config file. — A follow-up test with a purposefully messed up jail.local file – AI correctly revealed what went wrong

Test 3 – Low Disk Space On Proxmox3

In test 3, I temporarily lowered the threshold for disk space to get a Proxmox host flagged by the workflow. Due to a definition that any type 1 hypervisor operations are to be considered HIGH risk, AI made the correct judgement. If not approved within an hour, the remediation template would not get executed. I approved it to see what happens.

Screenshot from Discord showing a proposed template to be executed on a Proxmox host, which is high risk by default even on a low-risk type of operation. — A proposed high risk template since the host concerned in a Type 1 Hypervisor (otherwise it would be considered low risk)

The result in AWX revealed that not much could be removed:

Screenshot from AWX showing how did the space cleaning job go. — Results of the playbook execution in AWX

Our playbook is designed to only remove residual / temporary files, so if the host is filled with ISO images and backups, it would not really do much anyway, as can be seen below.

Screenshot from Discord showing how AI summarized the results with additional recommendations. — A detailed summary and advice from AI in Discord

💡 You can easily modify the ‘Message a Model’ node to fit your needs, define exceptions or even remove certain metrics from the Alloy agent monitoring to ensure that a mission critical host will never be affected by the workflow. I have put a placeholder in that node that any changes on a Type 1 hypervisor will be flagged as high risk. This worked during my testing phase but may benefit from more specific guidance and validation.

More tests could be conducted and I did run many in my environment. The three above demonstrate the functionality sufficiently. Now let’s look into what has not been covered in this tutorial from the security perspective and what are other desirable features that could be implemented.

Test 1 – High CPU Usage

Test 2 – An Inactive Critical Service

Test 3 – Low Disk Space On Proxmox3

Leave a Comment Cancel Reply