Multi-Tenant Cloud Infrastructure Architecture: Design and Technical Decisions

Documentation of the network architecture I designed for a multi-tenant infrastructure on Hetzner Cloud, focusing on design decisions and the technical motivations behind each choice.

Project Context and Goals

I designed and implemented a cloud infrastructure to manage multiple instances of e-commerce applications (primarily Magento) in a multi-tenant model. The main requirements were:

  • Tenant isolation: Each customer must be isolated from others, with different isolation levels based on tier (shared, business, enterprise)
  • Scalability: The architecture must support growth from a few customers to hundreds without re-design
  • Security by design: Least privilege principle and defense in depth applied at all levels
  • Cost efficiency: Limited budget, need to optimize costs without compromising security
  • Manageable operational complexity: Small team, need a maintainable architecture

I chose Hetzner Cloud as the provider for its cost/performance ratio and European datacenter location (GDPR compliance). The entire infrastructure is managed as code with Terraform and Ansible.

1. Fundamental Architectural Decisions

1.1 VPC-Based Architecture vs Flat Network

The first fundamental decision was adopting a Virtual Private Cloud (VPC) based architecture instead of a flat network with all servers publicly exposed. I evaluated three approaches:

Approach Advantages Disadvantages Evaluation
Flat Network (public IPs) Simple setup, no network overhead Maximum attack surface, difficult to manage firewalls on N servers ❌ Rejected for security reasons
VPN Always-On Maximum security, no public servers Every admin must configure VPN, single point of failure ⚠️ Too complex for distributed teams
VPC + Bastion Host Security/usability balance, consolidated pattern More complex initial setup Selected

I opted for the VPC + Bastion architecture because it offers the best compromise between security and operability. The bastion host becomes the single point of entry, simplifying audit and monitoring. I still integrated WireGuard VPN on the bastion for direct private network access when needed.

1.2 NAT Gateway: Managed vs Self-Hosted

Private VMs need internet access for updates, Docker image downloads, and external API calls. I evaluated two options for the NAT Gateway:

Aspect AWS NAT Gateway (Managed) Self-Hosted (iptables on VM)
Monthly cost ~$32 + $0.045/GB transfer ~€4.50 (CPX11 VM cost)
Maintenance Zero Security updates, monitoring
Control Limited Complete (custom iptables rules, logging)
Lock-in Vendor-specific Portable across providers

Decision: I implemented a self-hosted NAT gateway on the bastion host. The main motivations are:

  • 85% savings: €4.50/month vs $32/month + transfer
  • Complete control: I can implement custom rules, detailed logging, traffic shaping
  • Portability: The solution works on any cloud provider with the same code
  • Acceptable overhead: For a technical team, maintaining iptables rules is not a problem

The trade-off is operational overhead (security patches, monitoring) and single point of failure. The latter is mitigable with a second bastion in HA, configurable when necessary.

1.3 Multi-Tenant Segmentation

To support different customer tiers (shared, business, enterprise) I designed a 4-level network segmentation. The key is balancing isolation and costs: not all customers need (or can afford) dedicated infrastructure.

Implemented Segmentation Model

Tier Network Isolation Use Case
Management 10.0.0.0/16 Private network for core infrastructure (Bastion, Rancher, Vault, ArgoCD) Centralized management
Shared 10.10.0.0/16 Kubernetes namespaces + Network Policies Standard tier customers (limited budget)
Business 10.20.0.0/16
(subnet /24 per customer)
Dedicated Kubernetes nodes per customer Business customers (guaranteed performance)
Enterprise 10.100+.0.0/16
(dedicated /16 network)
Completely isolated network, dedicated bastion Enterprise customers (compliance, auditing)

This structure allows me to offer three isolation levels with increasing costs. The shared customer pays little but shares resources, business has dedicated nodes, enterprise has an entire separate infrastructure.

2. Network Topology Design

2.1 Management Network (10.0.0.0/16)

The management network hosts the core infrastructure. I designed IP allocation with room for future growth:

Network: 10.0.0.0/16 (65,534 available hosts)

Subnet allocation:
- 10.0.0.0/24    Infrastructure core (254 hosts)
  ├─ 10.0.0.1    Gateway (reserved)
  ├─ 10.0.0.2    Bastion host (NAT + Jump + VPN)
  ├─ 10.0.0.3    Reserved (future HA bastion)
  ├─ 10.0.0.4    Rancher management cluster
  ├─ 10.0.0.5    Vault server (secrets management)
  ├─ 10.0.0.6    ArgoCD (GitOps)
  └─ 10.0.0.7-10 Reserved for future services

- 10.0.1.0/24    Rancher worker nodes
- 10.0.2.0/24    Monitoring stack (Prometheus, Grafana, Loki)
- 10.0.3.0/24    CI/CD infrastructure
- 10.0.10.0/24+  Reserved for expansion (room for ~240 subnets)
            

Rationale for /16 choice: Even though I currently use only a few dozen IPs, I chose a /16 to avoid re-numbering in the future. The cost of private IPs on Hetzner is zero, so I can afford to be generous with allocation.

2.2 Customer Networks

Shared Network (10.10.0.0/16)

For standard tier customers I implemented Kubernetes-level isolation:

  • Dedicated namespace per customer
  • Kubernetes Network Policies to isolate pod-to-pod traffic
  • Resource quotas to avoid noisy neighbor
  • Separate service accounts and RBAC

This approach is economical (many customers on the same nodes) but requires well-configured Kubernetes. Isolation is strong but not total: all pods run on the same kernel.

Business Network (10.20.0.0/16)

Business customers get a dedicated /24 subnet (254 hosts) and dedicated Kubernetes nodes:

10.20.0.0/24     Reserved (base subnet)
10.20.1.0/24     Business Customer A (up to 254 hosts)
10.20.2.0/24     Business Customer B
10.20.3.0/24     Business Customer C
...
10.20.255.0/24   Business Customer 255

Capacity: 256 business customers
            

Each customer has their own worker nodes, thus guaranteed performance and better isolation. The cost is proportionally higher (dedicated VMs).

Enterprise Networks (10.100+.0.0/16)

Enterprise customers get a completely separate /16 network:

10.100.0.0/16    Enterprise Customer A (65,534 hosts)
10.101.0.0/16    Enterprise Customer B
10.102.0.0/16    Enterprise Customer C
...

Capacity: 56 enterprise customers (10.100-10.155)
            

Each enterprise network has its own bastion host, its own NAT gateway, and shares nothing with others. This is necessary for compliance (e.g., PCI-DSS) or for customers with specific auditing and security requirements.

2.3 Routing Table Design

I configured routing tables to ensure traffic always follows the intended paths. The design is based on two principles:

  1. Intra-VPC traffic stays local: Must never exit and re-enter
  2. Internet traffic always goes through NAT gateway: Centralized control

Routing Table: Management Network

Destination         Next Hop              Priority    Note
10.0.0.0/16        Local                 1           Intra-VPC (higher priority)
0.0.0.0/0          10.0.0.2 (Bastion)    2           Default route via NAT
            

Priority is fundamental: the more specific route (10.0.0.0/16) has priority over the default (0.0.0.0/0). This ensures that a VM wanting to talk to another VM in the same VPC never goes through the bastion.

Bastion Host Configuration

The bastion is configured as a dual-homed host (two network interfaces):

eth0 (Public interface):
  - Hetzner public IP
  - Default gateway to internet
  - Exposed to internet (SSH + WireGuard only)

eth1 (Private interface):
  - IP: 10.0.0.2
  - Connected to management VPC
  - Not reachable from internet

Kernel configuration:
  net.ipv4.ip_forward = 1

iptables configuration:
  # NAT for traffic from VPC to internet
  iptables -t nat -A POSTROUTING -s 10.0.0.0/16 -o eth0 -j MASQUERADE

  # Allow forwarding from VPC to internet
  iptables -A FORWARD -i eth1 -o eth0 -j ACCEPT
  iptables -A FORWARD -i eth0 -o eth1 -m state --state RELATED,ESTABLISHED -j ACCEPT

  # Block unsolicited connections from internet to VPC
  iptables -A FORWARD -i eth0 -o eth1 -j DROP
            

How NAT works: When a private VM (e.g., 10.0.0.4) wants to reach the internet (e.g., 8.8.8.8), the packet arrives at the bastion which applies SNAT (Source NAT), replacing the source IP with its own public IP. It maintains a connection tracking table to know where to send replies back. It's completely transparent to VMs.

2.4 Firewall Rules

I implemented firewall rules based on the least privilege principle: everything is blocked by default, I only allow strictly necessary traffic. Hetzner offers cloud-level firewall (before the VM), which I combined with local iptables on each host for defense in depth.

Bastion Host Firewall

INBOUND (Hetzner Cloud Firewall):
  ✅ TCP 22 (SSH) from MY_OFFICE_IP/32
  ✅ UDP 51820 (WireGuard) from 0.0.0.0/0
  ❌ Everything else: DROP

OUTBOUND:
  ✅ Allow all (required for NAT function)

FORWARD:
  ✅ From 10.0.0.0/16 to 0.0.0.0/0 (NAT traffic)
  ✅ ESTABLISHED,RELATED connections
  ❌ From internet to 10.0.0.0/16: DROP
            

Note on SSH port: I restricted SSH to only my office IP when possible. For remote access I use WireGuard VPN, which offers strong authentication via cryptographic keys.

Private VMs Firewall (Rancher, Vault, etc.)

INBOUND:
  ✅ TCP 22 (SSH) from 10.0.0.2/32 (bastion only)
  ✅ TCP 6443 (K8s API) from 10.0.0.0/16 (management network)
  ✅ TCP 443 (HTTPS) from 10.0.0.2/32 (via reverse proxy)
  ✅ All from 10.0.0.0/16 (intra-VPC communication)
  ❌ Everything else: DROP

OUTBOUND:
  ✅ To 10.0.0.0/16 (intra-VPC)
  ✅ To 0.0.0.0/0 (internet via NAT)
            

No private VM accepts direct connections from the internet. The only way to reach them is:

  1. SSH jump through bastion: ssh -J bastion private-vm
  2. WireGuard VPN connected to private network
  3. Reverse proxy on bastion for web services (Nginx Proxy Manager)

3. Security Considerations

3.1 Threat Model

I analyzed the main attack vectors for this architecture and implemented specific mitigations:

Threat Likelihood Impact Mitigation
SSH brute-force on bastion High Medium fail2ban, key-only auth, IP whitelisting, rate limiting
Bastion compromise Low Critical Hardening, monitoring, session recording, 2FA for sudo
Lateral movement post-breach Medium High Network segmentation, K8s Network Policies, strict RBAC
DDoS on bastion Medium High Rate limiting, connection limits, Cloudflare for web services
Data exfiltration Low Critical Egress filtering, anomaly detection, audit logging

3.2 Defense in Depth

I implemented security across 5 layers. Compromise of a single layer should not compromise the entire system:

Layer 1: Network

  • VPC isolation between management and customer networks
  • Cloud firewall (Hetzner) + local iptables on each host
  • NAT gateway for outbound traffic control
  • Single point of entry (bastion) easily monitorable

Layer 2: Host

  • SSH hardening: disable password auth, custom port, key-only
  • Automatic security updates (unattended-upgrades on Ubuntu)
  • fail2ban for intrusion prevention
  • Minimal attack surface: disable unnecessary services

Layer 3: Application (Kubernetes)

  • Network Policies for pod-to-pod traffic control
  • Pod Security Standards (no privileged containers for workloads)
  • Granular RBAC for service accounts
  • Image scanning (Trivy) in CI/CD pipeline

Layer 4: Data

  • Encryption at rest (LUKS for critical volumes)
  • Encryption in transit (TLS 1.3 everywhere)
  • Secrets management with HashiCorp Vault (no secrets in code)
  • Database encryption for PII data

Layer 5: Operations

  • Centralized logging (Loki + Promtail)
  • Monitoring and alerting (Prometheus + Grafana + Alertmanager)
  • Daily automated backups with retention policy
  • Documented and tested incident response plan

3.3 Known Limitations

It's important to be honest about the limits of this architecture. It does not protect against:

  • Complete bastion compromise: If an attacker gets root on the bastion, they potentially have access to the entire private network. Partial mitigation: rigorous monitoring, session recording, 2FA for critical operations.
  • Insider threats: An administrator with legitimate access can cause damage. Requires separation of duties and audit logging.
  • Application-layer attacks: SQL injection, XSS, etc. are not mitigated by network architecture. Requires secure coding and WAF.
  • Supply chain attacks: Compromised dependencies or Docker images. Requires image scanning, SBOM, and signature verification.

4. Cost Analysis

4.1 Infrastructure Base Cost

I did a comparative cost analysis to validate the choice of Hetzner vs more expensive alternatives:

Component Hetzner AWS Equivalent Savings
Bastion (CPX11: 2vCPU, 2GB RAM) €4.51/month t3.small: ~$15/month ~70%
NAT Gateway €0 (self-hosted) ~$32/month + $0.045/GB ~100%
Rancher (CPX21: 3vCPU, 4GB) €8.21/month t3.medium: ~$30/month ~73%
Vault (CPX11) €4.51/month t3.small: ~$15/month ~70%
ArgoCD (CPX11) €4.51/month t3.small: ~$15/month ~70%
Traffic (20TB included) €0 ~$50/month (1TB estimated) 100%
TOTAL €21.74/month ~$157/month 85%

Annual savings: €1,440 (~$1,600) for base infrastructure alone. At scale with N customer worker nodes, savings become even more significant.

4.2 Scaling Economics

For each business customer with dedicated nodes:

  • CPX31 (4 vCPU, 8GB RAM): €14.28/month → recommended for Magento
  • CPX41 (8 vCPU, 16GB RAM): €26.64/month → for heavy workloads

Example: 10 business customers with 1x CPX31 = 10 × €14.28 = €142.80/month additional. On AWS the same setup would cost ~$500-600/month.

Conclusions

The architecture I designed allowed me to build a secure, scalable, and cost-effective multi-tenant infrastructure on Hetzner Cloud. Key decisions were:

  • 4-level network segmentation for different isolation tiers
  • Self-hosted NAT gateway to reduce costs by 85%
  • Bastion host as single point of entry with integrated WireGuard VPN
  • Defense in depth across 5 layers
  • Infrastructure as Code for reproducibility

In the next article I'll show the practical implementation with Terraform: how to transform this design into reproducible code, with automatic Ansible inventory generation and complete bastion host configuration via cloud-init.

Resources

Next article: Infrastructure as Code with Terraform on Hetzner Cloud