Due to the length and focus of the article on delivering functionality, we have not looked more in details into the security aspect of the set up, esp. when preparing such a workflow for production. Authentication and HTTPS access has not been covered, yet it is essential for production-ready set ups. If you are considering using a workflow like this in production, consider the following standard security features:
- Reverse proxy with TLS for n8n (Caddy/nginx)
- Basic auth or OAuth2 proxy for Prometheus/Loki
- If you need to use external packages and thus use the
NODE_FUNCTION_ALLOW_EXTERNALdirective, please restrict it to only the required packages - Set
allowUnauthorizedCertstofalsein production and provide proper certificates (even if sources from OPNSense’s ACME service). - Deploy Fail2ban on the host running Docker containers
- Audit trail enhancement: The Discord messages provide a human-readable audit log, but for compliance, consider writing remediation events to a structured log (ELK/Loki) with timestamps, approvers, and outcomes.
- JSON Validation: While Claude’s JSON responses are generally reliable, the
Parse Claude Responsewe are relying on the output from AI without validation. For production, consider adding a schema validation step (e.g., JSON Schema or Zod) before acting on AI output.
Other Functional Considerations
- Hostname interpretation: In the ‘Process Metrics’ node, the
stripDomainhelper splits on.which will truncate hostnames likeweb1.internalto justweb1. This is ok for as long as short hostnames are unique across the fleet. - Cache register: The advisory workflow uses
alert-cache-advisory.jsonwhile the remediation workflow usesalert-cache.json. This is on purpose to separate them. If you want to re-run a workflow and have the previously flagged hosts to be reported on again, simply remove the file by running a removal command, such assudo docker exec n8n rm /home/node/.n8n/alert-cache.json. - Alloy requirement (covered in Part 1): The process-level metrics (
top_cpu_processes,top_memory_processes) require theprocess-exporter/namedprocess-exporter. We have covered this in Part 1 of this series – in case you skipped to Part 2, then take it into account. - Execution timeout – In the past, n8n had a default timeout of 5 minutes, which would not be enough for some workflows. During my testing of version 2.6.3, a workflow ran even for several hours and was not terminated,.
Limitations / Out of scope
- No High Availability. This could be covered in the future if there is interest. So if the very VM host that runs these docker containers go down, you will not learn about issues. For this reason, it is always good to pair it up with a separate host running uptime monitoring, such as UptimeKuma or Zabbix.
- n8n supports PostgreSQL as a database backend and can run with multiple workers. For the monitoring stack, it is possible to run Prometheus in HA pair with Thanos or Mimir for long-term storage.
- No rollback functionality? For those who are more cautious, esp. in production use, before any changes are made to your infrastructure, you could always run a ‘take a snapshot’ playbook first before changes like reboots and disk space cleaning are implemented (unless those are bare metal hosts). The playbooks suggested in this tutorial are non-destructive:
- The disk cleaning job does not delete user data, only cache/temp/logs
- The process termination job uses TERM signal to allow for graceful shutdown
- The reboot host job captures pre and post-reboot events logs to keep an audit trail
- Rate limiting & local LLMs – in case you run the workflow several times in an hour and each time issues are found, you might incur additional charges when using public models like Claude. You might wish to consider a self-hosted LLMs like Qwen3-235B, Llama 4, Mistral Large 3, Phi-4 or even just an RPI 5 with a 5Stack LLM-8850 HAT (as the link reveals, you can install it with Qwen3:1.7b, Whisper and MeloTTS).
- Distro Limitations: Since I wrote this tutorial with Debian/Ubuntu in mind, other Linux distros (and UNIX) like RHEL and FreeBSD are not covered. You will need to adapt these as per your needs.
- Definition of exceptions: In the AI prompt, you could define a list of hosts that you simply do not want to touch. This could be your Type 1 hypervisors.
What’s Next – Part 3
- Fully Autonomous Workflow? In Part 3, I would like to look into providing even more autonomous access for AI to make informed decisions on infrastructure changes. We will need to employ some caution.
- Docker container remediation? One big topic that has been intentionally left out in Part 2 is the remediation of docker containers. For Docker environments, a **
restart-container.ym**lplaybook paired withcAdvisormetrics could extend this workflow to container-level remediation – something we will explore in Part 3. - Support for other distros? In addition, based on your comments below, we could look into automated jobs for other distros like FreeBSD and to pull logs from other systems like OPNSense. Everything is possible!
- Multi-step remediation approach? Lastly, what if more than one job needs to be executed? For example, metrics reveal an issue and you want to run a deeper diagnostic playbook before deciding on which action to take. A multi-step remediation approach may be required for such situation. The reason for not going there at this point is that I did not want to overload the readers with overly complicated workflows before the concept of AI-assisted remediation is properly laid out with the basic four templates.
Let me know what else you would like covered or what is missing in this series from your perspective. You can very much influence on where will things go next!