Multi-Tenant Cloud Infrastructure Architecture: Design and Technical Decisions

Documentation of the network architecture I designed for a multi-tenant infrastructure on Hetzner Cloud, focusing on design decisions and the technical motivations behind each choice.

Project Context and Goals

I designed and implemented a cloud infrastructure to manage multiple instances of e-commerce applications (primarily Magento) in a multi-tenant model. The main requirements were:

Tenant isolation: Each customer must be isolated from others, with different isolation levels based on tier (shared, business, enterprise)
Scalability: The architecture must support growth from a few customers to hundreds without re-design
Security by design: Least privilege principle and defense in depth applied at all levels
Cost efficiency: Limited budget, need to optimize costs without compromising security
Manageable operational complexity: Small team, need a maintainable architecture

I chose Hetzner Cloud as the provider for its cost/performance ratio and European datacenter location (GDPR compliance). The entire infrastructure is managed as code with Terraform and Ansible.

1. Fundamental Architectural Decisions

1.1 VPC-Based Architecture vs Flat Network

The first fundamental decision was adopting a Virtual Private Cloud (VPC) based architecture instead of a flat network with all servers publicly exposed. I evaluated three approaches:

Approach	Advantages	Disadvantages	Evaluation
Flat Network (public IPs)	Simple setup, no network overhead	Maximum attack surface, difficult to manage firewalls on N servers	❌ Rejected for security reasons
VPN Always-On	Maximum security, no public servers	Every admin must configure VPN, single point of failure	⚠️ Too complex for distributed teams
VPC + Bastion Host	Security/usability balance, consolidated pattern	More complex initial setup	✅ Selected

I opted for the VPC + Bastion architecture because it offers the best compromise between security and operability. The bastion host becomes the single point of entry, simplifying audit and monitoring. I still integrated WireGuard VPN on the bastion for direct private network access when needed.

1.2 NAT Gateway: Managed vs Self-Hosted

Private VMs need internet access for updates, Docker image downloads, and external API calls. I evaluated two options for the NAT Gateway:

Aspect	AWS NAT Gateway (Managed)	Self-Hosted (iptables on VM)
Monthly cost	~$32 + $0.045/GB transfer	~€4.50 (CPX11 VM cost)
Maintenance	Zero	Security updates, monitoring
Control	Limited	Complete (custom iptables rules, logging)
Lock-in	Vendor-specific	Portable across providers

Decision: I implemented a self-hosted NAT gateway on the bastion host. The main motivations are:

85% savings: €4.50/month vs $32/month + transfer
Complete control: I can implement custom rules, detailed logging, traffic shaping
Portability: The solution works on any cloud provider with the same code
Acceptable overhead: For a technical team, maintaining iptables rules is not a problem

The trade-off is operational overhead (security patches, monitoring) and single point of failure. The latter is mitigable with a second bastion in HA, configurable when necessary.

1.3 Multi-Tenant Segmentation

To support different customer tiers (shared, business, enterprise) I designed a 4-level network segmentation. The key is balancing isolation and costs: not all customers need (or can afford) dedicated infrastructure.

Implemented Segmentation Model

Tier	Network	Isolation	Use Case
Management	10.0.0.0/16	Private network for core infrastructure (Bastion, Rancher, Vault, ArgoCD)	Centralized management
Shared	10.10.0.0/16	Kubernetes namespaces + Network Policies	Standard tier customers (limited budget)
Business	10.20.0.0/16 (subnet /24 per customer)	Dedicated Kubernetes nodes per customer	Business customers (guaranteed performance)
Enterprise	10.100+.0.0/16 (dedicated /16 network)	Completely isolated network, dedicated bastion	Enterprise customers (compliance, auditing)

This structure allows me to offer three isolation levels with increasing costs. The shared customer pays little but shares resources, business has dedicated nodes, enterprise has an entire separate infrastructure.

2. Network Topology Design

2.1 Management Network (10.0.0.0/16)

The management network hosts the core infrastructure. I designed IP allocation with room for future growth:

Network: 10.0.0.0/16 (65,534 available hosts)

Subnet allocation:
- 10.0.0.0/24    Infrastructure core (254 hosts)
  ├─ 10.0.0.1    Gateway (reserved)
  ├─ 10.0.0.2    Bastion host (NAT + Jump + VPN)
  ├─ 10.0.0.3    Reserved (future HA bastion)
  ├─ 10.0.0.4    Rancher management cluster
  ├─ 10.0.0.5    Vault server (secrets management)
  ├─ 10.0.0.6    ArgoCD (GitOps)
  └─ 10.0.0.7-10 Reserved for future services

- 10.0.1.0/24    Rancher worker nodes
- 10.0.2.0/24    Monitoring stack (Prometheus, Grafana, Loki)
- 10.0.3.0/24    CI/CD infrastructure
- 10.0.10.0/24+  Reserved for expansion (room for ~240 subnets)

Rationale for /16 choice: Even though I currently use only a few dozen IPs, I chose a /16 to avoid re-numbering in the future. The cost of private IPs on Hetzner is zero, so I can afford to be generous with allocation.

2.2 Customer Networks

Shared Network (10.10.0.0/16)

For standard tier customers I implemented Kubernetes-level isolation:

Dedicated namespace per customer
Kubernetes Network Policies to isolate pod-to-pod traffic
Resource quotas to avoid noisy neighbor
Separate service accounts and RBAC

This approach is economical (many customers on the same nodes) but requires well-configured Kubernetes. Isolation is strong but not total: all pods run on the same kernel.

Business Network (10.20.0.0/16)

Business customers get a dedicated /24 subnet (254 hosts) and dedicated Kubernetes nodes:

10.20.0.0/24     Reserved (base subnet)
10.20.1.0/24     Business Customer A (up to 254 hosts)
10.20.2.0/24     Business Customer B
10.20.3.0/24     Business Customer C
...
10.20.255.0/24   Business Customer 255

Capacity: 256 business customers

Each customer has their own worker nodes, thus guaranteed performance and better isolation. The cost is proportionally higher (dedicated VMs).

Enterprise Networks (10.100+.0.0/16)

Enterprise customers get a completely separate /16 network:

10.100.0.0/16    Enterprise Customer A (65,534 hosts)
10.101.0.0/16    Enterprise Customer B
10.102.0.0/16    Enterprise Customer C
...

Capacity: 56 enterprise customers (10.100-10.155)

Each enterprise network has its own bastion host, its own NAT gateway, and shares nothing with others. This is necessary for compliance (e.g., PCI-DSS) or for customers with specific auditing and security requirements.

2.3 Routing Table Design

I configured routing tables to ensure traffic always follows the intended paths. The design is based on two principles:

Intra-VPC traffic stays local: Must never exit and re-enter
Internet traffic always goes through NAT gateway: Centralized control

Routing Table: Management Network

Destination         Next Hop              Priority    Note
10.0.0.0/16        Local                 1           Intra-VPC (higher priority)
0.0.0.0/0          10.0.0.2 (Bastion)    2           Default route via NAT

Priority is fundamental: the more specific route (10.0.0.0/16) has priority over the default (0.0.0.0/0). This ensures that a VM wanting to talk to another VM in the same VPC never goes through the bastion.

Bastion Host Configuration

The bastion is configured as a dual-homed host (two network interfaces):

eth0 (Public interface):
  - Hetzner public IP
  - Default gateway to internet
  - Exposed to internet (SSH + WireGuard only)

eth1 (Private interface):
  - IP: 10.0.0.2
  - Connected to management VPC
  - Not reachable from internet

Kernel configuration:
  net.ipv4.ip_forward = 1

iptables configuration:
  # NAT for traffic from VPC to internet
  iptables -t nat -A POSTROUTING -s 10.0.0.0/16 -o eth0 -j MASQUERADE

  # Allow forwarding from VPC to internet
  iptables -A FORWARD -i eth1 -o eth0 -j ACCEPT
  iptables -A FORWARD -i eth0 -o eth1 -m state --state RELATED,ESTABLISHED -j ACCEPT

  # Block unsolicited connections from internet to VPC
  iptables -A FORWARD -i eth0 -o eth1 -j DROP

How NAT works: When a private VM (e.g., 10.0.0.4) wants to reach the internet (e.g., 8.8.8.8), the packet arrives at the bastion which applies SNAT (Source NAT), replacing the source IP with its own public IP. It maintains a connection tracking table to know where to send replies back. It's completely transparent to VMs.

2.4 Firewall Rules

I implemented firewall rules based on the least privilege principle: everything is blocked by default, I only allow strictly necessary traffic. Hetzner offers cloud-level firewall (before the VM), which I combined with local iptables on each host for defense in depth.

Bastion Host Firewall

INBOUND (Hetzner Cloud Firewall):
  ✅ TCP 22 (SSH) from MY_OFFICE_IP/32
  ✅ UDP 51820 (WireGuard) from 0.0.0.0/0
  ❌ Everything else: DROP

OUTBOUND:
  ✅ Allow all (required for NAT function)

FORWARD:
  ✅ From 10.0.0.0/16 to 0.0.0.0/0 (NAT traffic)
  ✅ ESTABLISHED,RELATED connections
  ❌ From internet to 10.0.0.0/16: DROP

Note on SSH port: I restricted SSH to only my office IP when possible. For remote access I use WireGuard VPN, which offers strong authentication via cryptographic keys.

Private VMs Firewall (Rancher, Vault, etc.)

INBOUND:
  ✅ TCP 22 (SSH) from 10.0.0.2/32 (bastion only)
  ✅ TCP 6443 (K8s API) from 10.0.0.0/16 (management network)
  ✅ TCP 443 (HTTPS) from 10.0.0.2/32 (via reverse proxy)
  ✅ All from 10.0.0.0/16 (intra-VPC communication)
  ❌ Everything else: DROP

OUTBOUND:
  ✅ To 10.0.0.0/16 (intra-VPC)
  ✅ To 0.0.0.0/0 (internet via NAT)

No private VM accepts direct connections from the internet. The only way to reach them is:

SSH jump through bastion: ssh -J bastion private-vm
WireGuard VPN connected to private network
Reverse proxy on bastion for web services (Nginx Proxy Manager)

3. Security Considerations

3.1 Threat Model

I analyzed the main attack vectors for this architecture and implemented specific mitigations:

Threat	Likelihood	Impact	Mitigation
SSH brute-force on bastion	High	Medium	fail2ban, key-only auth, IP whitelisting, rate limiting
Bastion compromise	Low	Critical	Hardening, monitoring, session recording, 2FA for sudo
Lateral movement post-breach	Medium	High	Network segmentation, K8s Network Policies, strict RBAC
DDoS on bastion	Medium	High	Rate limiting, connection limits, Cloudflare for web services
Data exfiltration	Low	Critical	Egress filtering, anomaly detection, audit logging

3.2 Defense in Depth

I implemented security across 5 layers. Compromise of a single layer should not compromise the entire system:

Layer 1: Network

VPC isolation between management and customer networks
Cloud firewall (Hetzner) + local iptables on each host
NAT gateway for outbound traffic control
Single point of entry (bastion) easily monitorable

Layer 2: Host

SSH hardening: disable password auth, custom port, key-only
Automatic security updates (unattended-upgrades on Ubuntu)
fail2ban for intrusion prevention
Minimal attack surface: disable unnecessary services

Layer 3: Application (Kubernetes)

Network Policies for pod-to-pod traffic control
Pod Security Standards (no privileged containers for workloads)
Granular RBAC for service accounts
Image scanning (Trivy) in CI/CD pipeline

Layer 4: Data

Encryption at rest (LUKS for critical volumes)
Encryption in transit (TLS 1.3 everywhere)
Secrets management with HashiCorp Vault (no secrets in code)
Database encryption for PII data

Layer 5: Operations

Centralized logging (Loki + Promtail)
Monitoring and alerting (Prometheus + Grafana + Alertmanager)
Daily automated backups with retention policy
Documented and tested incident response plan

3.3 Known Limitations

It's important to be honest about the limits of this architecture. It does not protect against:

Complete bastion compromise: If an attacker gets root on the bastion, they potentially have access to the entire private network. Partial mitigation: rigorous monitoring, session recording, 2FA for critical operations.
Insider threats: An administrator with legitimate access can cause damage. Requires separation of duties and audit logging.
Application-layer attacks: SQL injection, XSS, etc. are not mitigated by network architecture. Requires secure coding and WAF.
Supply chain attacks: Compromised dependencies or Docker images. Requires image scanning, SBOM, and signature verification.

4. Cost Analysis

4.1 Infrastructure Base Cost

I did a comparative cost analysis to validate the choice of Hetzner vs more expensive alternatives:

Component	Hetzner	AWS Equivalent	Savings
Bastion (CPX11: 2vCPU, 2GB RAM)	€4.51/month	t3.small: ~$15/month	~70%
NAT Gateway	€0 (self-hosted)	~$32/month + $0.045/GB	~100%
Rancher (CPX21: 3vCPU, 4GB)	€8.21/month	t3.medium: ~$30/month	~73%
Vault (CPX11)	€4.51/month	t3.small: ~$15/month	~70%
ArgoCD (CPX11)	€4.51/month	t3.small: ~$15/month	~70%
Traffic (20TB included)	€0	~$50/month (1TB estimated)	100%
TOTAL	€21.74/month	~$157/month	85%

Annual savings: €1,440 (~$1,600) for base infrastructure alone. At scale with N customer worker nodes, savings become even more significant.

4.2 Scaling Economics

For each business customer with dedicated nodes:

CPX31 (4 vCPU, 8GB RAM): €14.28/month → recommended for Magento
CPX41 (8 vCPU, 16GB RAM): €26.64/month → for heavy workloads

Example: 10 business customers with 1x CPX31 = 10 × €14.28 = €142.80/month additional. On AWS the same setup would cost ~$500-600/month.

Conclusions

The architecture I designed allowed me to build a secure, scalable, and cost-effective multi-tenant infrastructure on Hetzner Cloud. Key decisions were:

4-level network segmentation for different isolation tiers
Self-hosted NAT gateway to reduce costs by 85%
Bastion host as single point of entry with integrated WireGuard VPN
Defense in depth across 5 layers
Infrastructure as Code for reproducibility

In the next article I'll show the practical implementation with Terraform: how to transform this design into reproducible code, with automatic Ansible inventory generation and complete bastion host configuration via cloud-init.

Resources

Next article: Infrastructure as Code with Terraform on Hetzner Cloud