AWS Infrastructure Management

Overview

AWS Infrastructure Management

My role: Product concept to deployed system.

Challenge: Efficiently manage the infrastructure and cloud application for an SME client, including deployment processes, backup/restore and monitoring from Windows and macOS workstations

Solution: Multi-platform desktop application integrating infrastructure-as-code, status monitoring and log analytics

Impact: Enabled safe day-to-day operations (deploy/backup/restore/log review) without requiring AWS console access or broad administrative permissions.

Technologies: Avalonia, C#/.NET, AWS SDK

Infrastructure As Code

In infrastructure as code, the key question is how much of the system is declarative configuration and how much is behavioural automation. For this client, a suite of Python automation already existed to manage the cloud infrastructure: it did exactly what we needed, but it was command-line driven and required deep operational knowledge to use safely.

Transitioning from Python to C# while still using the AWS SDK allowed us to keep the proven automation model while overlaying a modern Avalonia UI for safer day-to-day operations.

Standard Processes

Running and monitoring deployment processes was also part of our Python suite of tools. With the new UI, deployment and recovery workflows are easier to initiate and monitor. This reduces the risk of operator error (wrong command line calls or parameters) and makes day-to-day operations accessible to non-specialists who previously had to delegate these tasks to technical team members.

Log Analytics

Logs are essential, and they are large. There are various analytics tools available and AWS Console also has CloudWatch log review capabilities. But if an existing system needs significant tailoring, or if the workflow relies on a basic UI like the AWS Console, there is still substantial effort required to interpret logs consistently.

The solution enabled local analytics with semantic parsing of log formats and supporting visualisations (e.g., histogram charts) to understand when key events occur.

Delivery and Technical Approach

The production environment is intentionally self-contained. Core services (e.g., RDS and the ECS/Fargate cluster) are not directly administered through day-to-day console usage. Instead, operational actions are wrapped into predetermined, externally triggerable workflows inside the environment:

  • AWS Lambda functions and ECS tasks for build/deploy, backup/restore, and rollback operations
  • read-only status visibility into what is running and what artefacts/backups are available

The desktop UI allows authorised users to execute these workflows without logging into the AWS console and without requiring broad administrative permissions.

Least-Privilege IAM Model

The application uses a constrained IAM role:

  • read-only access to selected resources (e.g., available backups, available container images, running ECS tasks)
  • execution permission for specific Lambdas
  • permission to start specific ECS tasks that are wired to perform a single operational action

This reduces the blast radius and makes operational intent explicit.

Build and Deploy Integrity

Container builds run inside the production environment from source code in a protected GitHub main branch. The build workflow reports the Git commit hash used to build the images, allowing operators to validate that production artefacts correspond to an expected, reviewed commit before deployment.

Deployment is a separately triggered step after validation, updating the running container constellation to the new version. Backup, restore, and rollback to earlier versions follow the same controlled, explicit workflow pattern.

Local Log Analytics

CloudWatch logs are accessible, but interactive investigation through the console can be slow and limited without a dedicated log management tool. The solution therefore provides local log analytics:

  • Lucene-based deep search and semantic parsing for known log formats
  • distribution charts across time slices (e.g., day/hour/15-minute buckets)

This reduces time-to-diagnosis for operational issues from hours to minutes.