Multi-Tenant Cloud Infrastructure Architecture: Design and Technical Decisions
Documentation of the network architecture I designed for a multi-tenant infrastructure on Hetzner Cloud, focusing on design decisions and the technical motivations behind each choice.
Project Context and Goals
I designed and implemented a cloud infrastructure to manage multiple instances of e-commerce applications (primarily Magento) in a multi-tenant model. The main requirements were:
- Tenant isolation: Each customer must be isolated from others, with different isolation levels based on tier (shared, business, enterprise)
- Scalability: The architecture must support growth from a few customers to hundreds without re-design
- Security by design: Least privilege principle and defense in depth applied at all levels
- Cost efficiency: Limited budget, need to optimize costs without compromising security
- Manageable operational complexity: Small team, need a maintainable architecture
I chose Hetzner Cloud as the provider for its cost/performance ratio and European datacenter location (GDPR compliance). The entire infrastructure is managed as code with Terraform and Ansible.
1. Fundamental Architectural Decisions
1.1 VPC-Based Architecture vs Flat Network
The first fundamental decision was adopting a Virtual Private Cloud (VPC) based architecture instead of a flat network with all servers publicly exposed. I evaluated three approaches:
| Approach | Advantages | Disadvantages | Evaluation |
|---|---|---|---|
| Flat Network (public IPs) | Simple setup, no network overhead | Maximum attack surface, difficult to manage firewalls on N servers | ❌ Rejected for security reasons |
| VPN Always-On | Maximum security, no public servers | Every admin must configure VPN, single point of failure | ⚠️ Too complex for distributed teams |
| VPC + Bastion Host | Security/usability balance, consolidated pattern | More complex initial setup | ✅ Selected |
I opted for the VPC + Bastion architecture because it offers the best compromise between security and operability. The bastion host becomes the single point of entry, simplifying audit and monitoring. I still integrated WireGuard VPN on the bastion for direct private network access when needed.
1.2 NAT Gateway: Managed vs Self-Hosted
Private VMs need internet access for updates, Docker image downloads, and external API calls. I evaluated two options for the NAT Gateway:
| Aspect | AWS NAT Gateway (Managed) | Self-Hosted (iptables on VM) |
|---|---|---|
| Monthly cost | ~$32 + $0.045/GB transfer | ~€4.50 (CPX11 VM cost) |
| Maintenance | Zero | Security updates, monitoring |
| Control | Limited | Complete (custom iptables rules, logging) |
| Lock-in | Vendor-specific | Portable across providers |
Decision: I implemented a self-hosted NAT gateway on the bastion host. The main motivations are:
- 85% savings: €4.50/month vs $32/month + transfer
- Complete control: I can implement custom rules, detailed logging, traffic shaping
- Portability: The solution works on any cloud provider with the same code
- Acceptable overhead: For a technical team, maintaining iptables rules is not a problem
The trade-off is operational overhead (security patches, monitoring) and single point of failure. The latter is mitigable with a second bastion in HA, configurable when necessary.
1.3 Multi-Tenant Segmentation
To support different customer tiers (shared, business, enterprise) I designed a 4-level network segmentation. The key is balancing isolation and costs: not all customers need (or can afford) dedicated infrastructure.
Implemented Segmentation Model
| Tier | Network | Isolation | Use Case |
|---|---|---|---|
| Management | 10.0.0.0/16 | Private network for core infrastructure (Bastion, Rancher, Vault, ArgoCD) | Centralized management |
| Shared | 10.10.0.0/16 | Kubernetes namespaces + Network Policies | Standard tier customers (limited budget) |
| Business | 10.20.0.0/16 (subnet /24 per customer) |
Dedicated Kubernetes nodes per customer | Business customers (guaranteed performance) |
| Enterprise | 10.100+.0.0/16 (dedicated /16 network) |
Completely isolated network, dedicated bastion | Enterprise customers (compliance, auditing) |
This structure allows me to offer three isolation levels with increasing costs. The shared customer pays little but shares resources, business has dedicated nodes, enterprise has an entire separate infrastructure.
2. Network Topology Design
2.1 Management Network (10.0.0.0/16)
The management network hosts the core infrastructure. I designed IP allocation with room for future growth:
Network: 10.0.0.0/16 (65,534 available hosts)
Subnet allocation:
- 10.0.0.0/24 Infrastructure core (254 hosts)
├─ 10.0.0.1 Gateway (reserved)
├─ 10.0.0.2 Bastion host (NAT + Jump + VPN)
├─ 10.0.0.3 Reserved (future HA bastion)
├─ 10.0.0.4 Rancher management cluster
├─ 10.0.0.5 Vault server (secrets management)
├─ 10.0.0.6 ArgoCD (GitOps)
└─ 10.0.0.7-10 Reserved for future services
- 10.0.1.0/24 Rancher worker nodes
- 10.0.2.0/24 Monitoring stack (Prometheus, Grafana, Loki)
- 10.0.3.0/24 CI/CD infrastructure
- 10.0.10.0/24+ Reserved for expansion (room for ~240 subnets)
Rationale for /16 choice: Even though I currently use only a few dozen IPs, I chose a /16 to avoid re-numbering in the future. The cost of private IPs on Hetzner is zero, so I can afford to be generous with allocation.
2.2 Customer Networks
Shared Network (10.10.0.0/16)
For standard tier customers I implemented Kubernetes-level isolation:
- Dedicated namespace per customer
- Kubernetes Network Policies to isolate pod-to-pod traffic
- Resource quotas to avoid noisy neighbor
- Separate service accounts and RBAC
This approach is economical (many customers on the same nodes) but requires well-configured Kubernetes. Isolation is strong but not total: all pods run on the same kernel.
Business Network (10.20.0.0/16)
Business customers get a dedicated /24 subnet (254 hosts) and dedicated Kubernetes nodes:
10.20.0.0/24 Reserved (base subnet)
10.20.1.0/24 Business Customer A (up to 254 hosts)
10.20.2.0/24 Business Customer B
10.20.3.0/24 Business Customer C
...
10.20.255.0/24 Business Customer 255
Capacity: 256 business customers
Each customer has their own worker nodes, thus guaranteed performance and better isolation. The cost is proportionally higher (dedicated VMs).
Enterprise Networks (10.100+.0.0/16)
Enterprise customers get a completely separate /16 network:
10.100.0.0/16 Enterprise Customer A (65,534 hosts)
10.101.0.0/16 Enterprise Customer B
10.102.0.0/16 Enterprise Customer C
...
Capacity: 56 enterprise customers (10.100-10.155)
Each enterprise network has its own bastion host, its own NAT gateway, and shares nothing with others. This is necessary for compliance (e.g., PCI-DSS) or for customers with specific auditing and security requirements.
2.3 Routing Table Design
I configured routing tables to ensure traffic always follows the intended paths. The design is based on two principles:
- Intra-VPC traffic stays local: Must never exit and re-enter
- Internet traffic always goes through NAT gateway: Centralized control
Routing Table: Management Network
Destination Next Hop Priority Note
10.0.0.0/16 Local 1 Intra-VPC (higher priority)
0.0.0.0/0 10.0.0.2 (Bastion) 2 Default route via NAT
Priority is fundamental: the more specific route (10.0.0.0/16) has priority over the default (0.0.0.0/0). This ensures that a VM wanting to talk to another VM in the same VPC never goes through the bastion.
Bastion Host Configuration
The bastion is configured as a dual-homed host (two network interfaces):
eth0 (Public interface):
- Hetzner public IP
- Default gateway to internet
- Exposed to internet (SSH + WireGuard only)
eth1 (Private interface):
- IP: 10.0.0.2
- Connected to management VPC
- Not reachable from internet
Kernel configuration:
net.ipv4.ip_forward = 1
iptables configuration:
# NAT for traffic from VPC to internet
iptables -t nat -A POSTROUTING -s 10.0.0.0/16 -o eth0 -j MASQUERADE
# Allow forwarding from VPC to internet
iptables -A FORWARD -i eth1 -o eth0 -j ACCEPT
iptables -A FORWARD -i eth0 -o eth1 -m state --state RELATED,ESTABLISHED -j ACCEPT
# Block unsolicited connections from internet to VPC
iptables -A FORWARD -i eth0 -o eth1 -j DROP
How NAT works: When a private VM (e.g., 10.0.0.4) wants to reach the internet (e.g., 8.8.8.8), the packet arrives at the bastion which applies SNAT (Source NAT), replacing the source IP with its own public IP. It maintains a connection tracking table to know where to send replies back. It's completely transparent to VMs.
2.4 Firewall Rules
I implemented firewall rules based on the least privilege principle: everything is blocked by default, I only allow strictly necessary traffic. Hetzner offers cloud-level firewall (before the VM), which I combined with local iptables on each host for defense in depth.
Bastion Host Firewall
INBOUND (Hetzner Cloud Firewall):
✅ TCP 22 (SSH) from MY_OFFICE_IP/32
✅ UDP 51820 (WireGuard) from 0.0.0.0/0
❌ Everything else: DROP
OUTBOUND:
✅ Allow all (required for NAT function)
FORWARD:
✅ From 10.0.0.0/16 to 0.0.0.0/0 (NAT traffic)
✅ ESTABLISHED,RELATED connections
❌ From internet to 10.0.0.0/16: DROP
Note on SSH port: I restricted SSH to only my office IP when possible. For remote access I use WireGuard VPN, which offers strong authentication via cryptographic keys.
Private VMs Firewall (Rancher, Vault, etc.)
INBOUND:
✅ TCP 22 (SSH) from 10.0.0.2/32 (bastion only)
✅ TCP 6443 (K8s API) from 10.0.0.0/16 (management network)
✅ TCP 443 (HTTPS) from 10.0.0.2/32 (via reverse proxy)
✅ All from 10.0.0.0/16 (intra-VPC communication)
❌ Everything else: DROP
OUTBOUND:
✅ To 10.0.0.0/16 (intra-VPC)
✅ To 0.0.0.0/0 (internet via NAT)
No private VM accepts direct connections from the internet. The only way to reach them is:
- SSH jump through bastion:
ssh -J bastion private-vm - WireGuard VPN connected to private network
- Reverse proxy on bastion for web services (Nginx Proxy Manager)
3. Security Considerations
3.1 Threat Model
I analyzed the main attack vectors for this architecture and implemented specific mitigations:
| Threat | Likelihood | Impact | Mitigation |
|---|---|---|---|
| SSH brute-force on bastion | High | Medium | fail2ban, key-only auth, IP whitelisting, rate limiting |
| Bastion compromise | Low | Critical | Hardening, monitoring, session recording, 2FA for sudo |
| Lateral movement post-breach | Medium | High | Network segmentation, K8s Network Policies, strict RBAC |
| DDoS on bastion | Medium | High | Rate limiting, connection limits, Cloudflare for web services |
| Data exfiltration | Low | Critical | Egress filtering, anomaly detection, audit logging |
3.2 Defense in Depth
I implemented security across 5 layers. Compromise of a single layer should not compromise the entire system:
Layer 1: Network
- VPC isolation between management and customer networks
- Cloud firewall (Hetzner) + local iptables on each host
- NAT gateway for outbound traffic control
- Single point of entry (bastion) easily monitorable
Layer 2: Host
- SSH hardening: disable password auth, custom port, key-only
- Automatic security updates (unattended-upgrades on Ubuntu)
- fail2ban for intrusion prevention
- Minimal attack surface: disable unnecessary services
Layer 3: Application (Kubernetes)
- Network Policies for pod-to-pod traffic control
- Pod Security Standards (no privileged containers for workloads)
- Granular RBAC for service accounts
- Image scanning (Trivy) in CI/CD pipeline
Layer 4: Data
- Encryption at rest (LUKS for critical volumes)
- Encryption in transit (TLS 1.3 everywhere)
- Secrets management with HashiCorp Vault (no secrets in code)
- Database encryption for PII data
Layer 5: Operations
- Centralized logging (Loki + Promtail)
- Monitoring and alerting (Prometheus + Grafana + Alertmanager)
- Daily automated backups with retention policy
- Documented and tested incident response plan
3.3 Known Limitations
It's important to be honest about the limits of this architecture. It does not protect against:
- Complete bastion compromise: If an attacker gets root on the bastion, they potentially have access to the entire private network. Partial mitigation: rigorous monitoring, session recording, 2FA for critical operations.
- Insider threats: An administrator with legitimate access can cause damage. Requires separation of duties and audit logging.
- Application-layer attacks: SQL injection, XSS, etc. are not mitigated by network architecture. Requires secure coding and WAF.
- Supply chain attacks: Compromised dependencies or Docker images. Requires image scanning, SBOM, and signature verification.
4. Cost Analysis
4.1 Infrastructure Base Cost
I did a comparative cost analysis to validate the choice of Hetzner vs more expensive alternatives:
| Component | Hetzner | AWS Equivalent | Savings |
|---|---|---|---|
| Bastion (CPX11: 2vCPU, 2GB RAM) | €4.51/month | t3.small: ~$15/month | ~70% |
| NAT Gateway | €0 (self-hosted) | ~$32/month + $0.045/GB | ~100% |
| Rancher (CPX21: 3vCPU, 4GB) | €8.21/month | t3.medium: ~$30/month | ~73% |
| Vault (CPX11) | €4.51/month | t3.small: ~$15/month | ~70% |
| ArgoCD (CPX11) | €4.51/month | t3.small: ~$15/month | ~70% |
| Traffic (20TB included) | €0 | ~$50/month (1TB estimated) | 100% |
| TOTAL | €21.74/month | ~$157/month | 85% |
Annual savings: €1,440 (~$1,600) for base infrastructure alone. At scale with N customer worker nodes, savings become even more significant.
4.2 Scaling Economics
For each business customer with dedicated nodes:
- CPX31 (4 vCPU, 8GB RAM): €14.28/month → recommended for Magento
- CPX41 (8 vCPU, 16GB RAM): €26.64/month → for heavy workloads
Example: 10 business customers with 1x CPX31 = 10 × €14.28 = €142.80/month additional. On AWS the same setup would cost ~$500-600/month.
Conclusions
The architecture I designed allowed me to build a secure, scalable, and cost-effective multi-tenant infrastructure on Hetzner Cloud. Key decisions were:
- 4-level network segmentation for different isolation tiers
- Self-hosted NAT gateway to reduce costs by 85%
- Bastion host as single point of entry with integrated WireGuard VPN
- Defense in depth across 5 layers
- Infrastructure as Code for reproducibility
In the next article I'll show the practical implementation with Terraform: how to transform this design into reproducible code, with automatic Ansible inventory generation and complete bastion host configuration via cloud-init.
Resources
Next article: Infrastructure as Code with Terraform on Hetzner Cloud