Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions .github/workflows/egress-test.yaml.yml
Original file line number Diff line number Diff line change
Expand Up @@ -50,3 +50,23 @@ jobs:
run: |
chmod +x tests/smoke-nft.sh
./tests/smoke-nft.sh
- name: Run dynamic ip test
working-directory: components/egress
run: |
chmod +x tests/smoke-dynamic-ip.sh
./tests/smoke-dynamic-ip.sh
bench:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4

- name: Run bench test
working-directory: components/egress
run: |
chmod +x tests/bench-dns-nft.sh
./tests/bench-dns-nft.sh
env:
BENCH_SAMPLE_SIZE: "20"
30 changes: 26 additions & 4 deletions components/egress/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ The **Egress Sidecar** is a core component of OpenSandbox that provides **FQDN-b
- **FQDN-based Allowlist**: Control outbound traffic by domain name (e.g., `api.github.com`).
- **Wildcard Support**: Allow subdomains using wildcards (e.g., `*.pypi.org`).
- **Transparent Interception**: Uses transparent DNS proxying; no application configuration required.
- **Dynamic DNS (dns+nft mode)**: When a domain is allowed and the proxy resolves it, the resolved A/AAAA IPs are added to nftables with TTL so that default-deny + domain-allow is enforced at the network layer.
- **Privilege Isolation**: Requires `CAP_NET_ADMIN` only for the sidecar; the application container runs unprivileged.
- **Graceful Degradation**: If `CAP_NET_ADMIN` is missing, it warns and disables enforcement instead of crashing.

Expand All @@ -23,8 +24,9 @@ The egress control is implemented as a **Sidecar** that shares the network names
- Filters queries based on the allowlist.
- Returns `NXDOMAIN` for denied domains.

2. **Network Filter (Layer 2)** (Roadmap):
- Will use `nftables` to enforce IP-level restrictions based on resolved domains.
2. **Network Filter (Layer 2)** (when `OPENSANDBOX_EGRESS_MODE=dns+nft`):
- Uses `nftables` to enforce IP-level allow/deny. Resolved IPs for allowed domains are added to dynamic allow sets with TTL (dynamic DNS).
- At startup, the sidecar whitelists **127.0.0.1** (redirect target for the proxy) and **nameserver IPs** from `/etc/resolv.conf` so DNS resolution and proxy upstream work (including private DNS). Nameserver count is capped and invalid IPs are filtered; see [Configuration](#configuration).

## Requirements

Expand All @@ -43,6 +45,11 @@ The egress control is implemented as a **Sidecar** that shares the network names
- Mode (`OPENSANDBOX_EGRESS_MODE`, default `dns`):
- `dns`: DNS proxy only, no nftables (IP/CIDR rules have no effect at L2).
- `dns+nft`: enable nftables; if nft apply fails, fallback to `dns`. IP/CIDR enforcement and DoH/DoT blocking require this mode.
- **DNS and nft mode (nameserver whitelist)**
In `dns+nft` mode, the sidecar automatically allows:
- **127.0.0.1** — so packets redirected by iptables to the proxy (127.0.0.1:15353) are accepted by nft.
- **Nameserver IPs** from `/etc/resolv.conf` — so client DNS and proxy upstream work (e.g. private DNS).
Nameserver IPs are validated (unspecified and loopback are skipped) and capped. Use `OPENSANDBOX_EGRESS_MAX_NS` (default `3`; `0` = no cap, `1``10` = cap). See [SECURITY-RISKS.md](SECURITY-RISKS.md) for trust and scope of this whitelist.
- DoH/DoT blocking:
- DoT (tcp/udp 853) blocked by default.
- Optional DoH over 443: `OPENSANDBOX_EGRESS_BLOCK_DOH_443=true`. If enabled without blocklist, all 443 is dropped.
Expand Down Expand Up @@ -139,15 +146,30 @@ To test the sidecar with a sandbox application:
- **Key Packages**:
- `pkg/dnsproxy`: DNS server and policy matching logic.
- `pkg/iptables`: `iptables` rule management.
- `pkg/nftables`: nftables static/dynamic rules and DNS-resolved IP sets.
- `pkg/policy`: Policy parsing and definition.
- **Main (egress)**:
- `nameserver.go`: Builds the list of IPs to whitelist for DNS in nft mode (127.0.0.1 + validated/capped nameservers from resolv.conf).
```bash
# Run tests
go test ./...
```
### E2E benchmark: dns vs dns+nft (sync dynamic IP write)
An end-to-end benchmark compares **dns** (pass-through, no nft write) and **dns+nft** (sync `AddResolvedIPs` before each DNS reply) under real conditions: sidecar in Docker, iptables redirect, real DNS + HTTPS from a client container.
```bash
./tests/bench-dns-nft.sh
```
More details in [docs/benchmark.md](docs/benchmark.md).
## Troubleshooting
- **"iptables setup failed"**: Ensure the sidecar container has `--cap-add=NET_ADMIN`.
- **DNS resolution fails for all domains**: Check if the upstream DNS (from `/etc/resolv.conf`) is reachable.
- **Traffic not blocked**: If nftables应用失败会回退为 DNS-only;检查日志、`nft list table inet opensandbox`、以及 `CAP_NET_ADMIN` 权限。
- **DNS resolution fails for all domains**:
- Check if the upstream DNS (from `/etc/resolv.conf`) is reachable.
- In `dns+nft` mode, the sidecar whitelists nameserver IPs from resolv.conf at startup; check logs for `[dns] whitelisting proxy listen + N nameserver(s)` and ensure `/etc/resolv.conf` is readable and contains valid, reachable nameservers. The proxy prefers the first non-loopback nameserver from resolv.conf; if only loopback exists (e.g. Docker 127.0.0.11), it is used (proxy upstream traffic bypasses the redirect). Fallback to 8.8.8.8 only when resolv.conf is empty or unreadable.
- **Traffic not blocked**: If nftables apply fails, the sidecar falls back to dns; check logs, `nft list table inet opensandbox`, and `CAP_NET_ADMIN`.
4 changes: 2 additions & 2 deletions components/egress/TODO.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Egress Sidecar TODO (Linux MVP → Full OSEP-0001)

- Layer 2 still partial: static IP/CIDR now pushed to nftables, DoH/DoT blocking added (853 + optional 443 blocklist). DNS-learned IPs/dynamic isolation intentionally NOT targeted (see No goals).
- Policy surface: IP/CIDR parsing/validation done; `require_full_isolation` and richer validation messages are out of scope (see No goals). Dynamic IP learn/apply is out of scope (see No goals).
- Layer 2 still partial: static IP/CIDR now pushed to nftables, DoH/DoT blocking added (853 + optional 443 blocklist). DNS-learned IPs/dynamic isolation planned (see Short-term priorities).
- Policy surface: IP/CIDR parsing/validation done; `require_full_isolation` and richer validation messages are out of scope (see No goals).
- Observability missing: no violation logs.
- Capability probing missing: no CAP_NET_ADMIN/nftables detection; hostNetwork 已由 server 侧阻断。 Capability detection + mode exposure moved to No goals.
- Platform integration completed: specs/SDK/server wiring done; NET_ADMIN only on sidecar.
Expand Down
83 changes: 83 additions & 0 deletions components/egress/docs/benchmark.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
# Egress Benchmark

This document describes the **Egress Sidecar** end-to-end benchmark: it compares **dns** and **dns+nft** modes under real conditions for latency and throughput.

## Purpose

- **dns**: DNS proxy only (pass-through), no nftables writes; used as the baseline.
- **dns+nft**: DNS proxy plus synchronous `AddResolvedIPs` before each DNS reply, writing resolved IPs into nftables for
L2 egress enforcement.

The benchmark runs the same workload in both modes and reports end-to-end latency (P50, P99) and throughput (Req/s) to
measure the overhead of the synchronous nft write path.

## Environment and Flow

- **Environment**: The Egress sidecar runs in a Docker container on the host. The container includes the sidecar (DNS
proxy and optional nft), iptables redirect of port 53 to the proxy, and the policy server on port 18080. The workload
runs **inside the same container**: DNS and HTTPS traffic go through the proxy.
- **Flow** (per phase):
1. Start the sidecar with the chosen mode (`dns` or `dns+nft`).
2. Wait for health checks, then push the allow list to `/policy` (see domain list below).
3. Write the domain list into the container as `/tmp/bench-domains.txt` (one `https://<domain>` per line).
4. **Warm-up**: One request to each of the first 10 domains (10 concurrent), 1 round.
5. **Timed run**: One request per domain for all domains (N concurrent per round), for 10 rounds; each request
records `time_namelookup` and `time_total`.
6. Copy results from the container and compute P50, P99, average latency, and Req/s.
- **Execution order**: **dns+nft** runs first, then **dns**; the comparison table is printed at the end.

## Workload

- **Domain list**: Read from `components/egress/tests/hostname.txt`, one domain per line (lines starting with `#` and
empty lines are ignored). Default is about 100 resolvable domains.
- **Rounds and concurrency**: The script uses `ROUNDS=10`. Each round issues one HTTPS request per domain in
`hostname.txt`, with all requests in that round concurrent; 10 rounds total.
- **Total requests**: `TOTAL_REQUESTS = ROUNDS × NUM_DOMAINS` (e.g. 10 × 100 = 1000).
- **Per request**: Inside the container, `curl -o /dev/null -s -w "%{time_namelookup}\t%{time_total}\n"` is used against
`https://<domain>`, with a 10s timeout per request; the whole benchmark run has a 300s wall-clock timeout.

## Policy

- Policy is default-deny with explicit allow rules: one `{"action":"allow","target":"<domain>"}` per domain in
`hostname.txt` is sent via `POST /policy`, so every domain used in the benchmark is allowed.

## How to Run

**Script**: `components/egress/tests/bench-e2e-dns-nft.sh`

**Requirements**: Docker and `curl` on the host (for pushing policy); the Egress image includes `curl` for the workload.

**Commands** (from repo root or from `components/egress`):

```bash
./tests/bench-dns-nft.sh
```

The script resolves `tests/hostname.txt` relative to its own path, so the working directory does not need to be changed.

## Configuration

| Item | Location / variable | Default / notes |
|---------------------|----------------------------------------|------------------------------------------------|
| Domain list | `components/egress/tests/hostname.txt` | One domain per line; `#` comments allowed |
| Rounds | `ROUNDS` in script | 10 |
| Per-request timeout | `CURL_TIMEOUT` in script | 10 seconds |
| Benchmark timeout | `BENCH_EXEC_TIMEOUT` in script | 300 seconds (max wall time for the timed run) |
| Image | `IMG` in script | See script; override for a locally built image |

Changing the number of domains or rounds updates the total request count; the report shows “N rounds × M domains” for
the current config.

## Output and Metrics

- **Terminal**: A table with **Req/s**, **Avg(s)**, **P50(s)**, **P99(s)** for both modes, plus short notes (dns vs
dns+nft, warm-up, first-resolution cost).
- **Artifacts** (on the host under `/tmp`): `bench-e2e-dns-total.txt`, `bench-e2e-dns+nft-total.txt` (one
`time_total` per line), and `-namelookup.txt`, `-wall.txt`, etc., for further analysis or plotting.

## Notes

- The first resolution of a domain in dns+nft triggers a DNS lookup and an nft write, so cost is higher; later requests
for the same domain hit the set and are cheaper. The multi-round, multi-domain design mixes cold and warm resolution.
- In CI (e.g. GitHub Actions), the script wraps the timed-run `docker exec` with `timeout` inside the shell function so
`timeout` runs a real command, not a function name, avoiding “No such file or directory” errors.
74 changes: 13 additions & 61 deletions components/egress/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,6 @@ package main
import (
"context"
"log"
"net/netip"
"os"
"os/signal"
"strings"
Expand All @@ -26,33 +25,22 @@ import (
"github.com/alibaba/opensandbox/egress/pkg/constants"
"github.com/alibaba/opensandbox/egress/pkg/dnsproxy"
"github.com/alibaba/opensandbox/egress/pkg/iptables"
"github.com/alibaba/opensandbox/egress/pkg/nftables"
)

// Linux MVP: DNS proxy + iptables REDIRECT. No nftables/full isolation yet.
func main() {
ctx, cancel := signal.NotifyContext(context.Background(), syscall.SIGINT, syscall.SIGTERM)
defer cancel()

// Optional bootstrap via env; still allow runtime HTTP updates.
initialPolicy, err := dnsproxy.LoadPolicyFromEnvVar(constants.EnvEgressRules)
initialRules, err := dnsproxy.LoadPolicyFromEnvVar(constants.EnvEgressRules)
if err != nil {
log.Fatalf("failed to parse %s: %v", constants.EnvEgressRules, err)
}
if initialPolicy != nil {
log.Printf("loaded initial egress policy from %s", constants.EnvEgressRules)
}

requestedMode := parseMode()
enforcementMode := requestedMode

var nftMgr nftApplier
if requestedMode == constants.PolicyDnsNft {
nftOpts := parseNftOptions()
nftMgr = nftables.NewManagerWithOptions(nftOpts)
}
allowIPs := AllowIPsForNft("/etc/resolv.conf")

proxy, err := dnsproxy.New(initialPolicy, "")
mode := parseMode()
nftMgr := createNftManager(mode)
proxy, err := dnsproxy.New(initialRules, "")
if err != nil {
log.Fatalf("failed to init dns proxy: %v", err)
}
Expand All @@ -66,20 +54,11 @@ func main() {
}
log.Printf("iptables redirect configured (OUTPUT 53 -> 15353) with SO_MARK bypass for proxy upstream traffic")

if nftMgr != nil {
if err := nftMgr.ApplyStatic(ctx, initialPolicy); err != nil {
log.Fatalf("nftables static apply failed; please check logs): %v", err)
} else {
log.Printf("nftables static policy applied (table inet opensandbox)")
}
}
setupNft(ctx, nftMgr, initialRules, proxy, allowIPs)

httpAddr := os.Getenv(constants.EnvEgressHTTPAddr)
if httpAddr == "" {
httpAddr = constants.DefaultEgressServerAddr
}
token := os.Getenv(constants.EnvEgressToken)
if err := startPolicyServer(ctx, proxy, nftMgr, enforcementMode, httpAddr, token); err != nil {
// start policy server
httpAddr := envOrDefault(constants.EnvEgressHTTPAddr, constants.DefaultEgressServerAddr)
if err = startPolicyServer(ctx, proxy, nftMgr, mode, httpAddr, os.Getenv(constants.EnvEgressToken), allowIPs); err != nil {
log.Fatalf("failed to start policy server: %v", err)
}
log.Printf("policy server listening on %s (POST /policy)", httpAddr)
Expand All @@ -89,38 +68,11 @@ func main() {
_ = os.Stderr.Sync()
}

func parseNftOptions() nftables.Options {
opts := nftables.Options{BlockDoT: true}
if isTruthy(os.Getenv(constants.EnvBlockDoH443)) {
opts.BlockDoH443 = true
}
if raw := os.Getenv(constants.EnvDoHBlocklist); strings.TrimSpace(raw) != "" {
parts := strings.Split(raw, ",")
for _, p := range parts {
target := strings.TrimSpace(p)
if target == "" {
continue
}
if addr, err := netip.ParseAddr(target); err == nil {
if addr.Is4() {
opts.DoHBlocklistV4 = append(opts.DoHBlocklistV4, target)
} else if addr.Is6() {
opts.DoHBlocklistV6 = append(opts.DoHBlocklistV6, target)
}
continue
}
if prefix, err := netip.ParsePrefix(target); err == nil {
if prefix.Addr().Is4() {
opts.DoHBlocklistV4 = append(opts.DoHBlocklistV4, target)
} else if prefix.Addr().Is6() {
opts.DoHBlocklistV6 = append(opts.DoHBlocklistV6, target)
}
continue
}
log.Printf("ignoring invalid DoH blocklist entry: %s", target)
}
func envOrDefault(key, defaultVal string) string {
if v := strings.TrimSpace(os.Getenv(key)); v != "" {
return v
}
return opts
return defaultVal
}

func isTruthy(v string) bool {
Expand Down
91 changes: 91 additions & 0 deletions components/egress/nameserver.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
// Copyright 2026 Alibaba Group Holding Ltd.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

package main

import (
"log"
"net/netip"
"os"
"strconv"

"github.com/alibaba/opensandbox/egress/pkg/constants"
"github.com/alibaba/opensandbox/egress/pkg/dnsproxy"
)

// AllowIPsForNft returns the list of IPs to merge into the nft allow set for DNS in dns+nft mode:
// 127.0.0.1 (proxy listen / iptables redirect target) plus validated, capped nameserver IPs from resolvPath.
// Validation: skips unspecified (0.0.0.0, ::) and loopback (127.x, ::1).
// Cap: at most max nameservers (default 3; set EGRESS_MAX_NAMESERVERS=0 for no cap, or 1–10).
func AllowIPsForNft(resolvPath string) []netip.Addr {
raw, _ := dnsproxy.ResolvNameserverIPs(resolvPath)
maxNsCount := maxNameserversFromEnv()

var validated []netip.Addr
for _, ip := range raw {
if maxNsCount > 0 && len(validated) >= maxNsCount {
break
}
if !isValidNameserverIP(ip) {
continue
}
validated = append(validated, ip)
}

// 127.0.0.1 first so packets redirected to proxy are accepted by nft.
out := make([]netip.Addr, 0, 1+len(validated))
out = append(out, netip.MustParseAddr("127.0.0.1"))
out = append(out, validated...)

if len(out) > 1 {
log.Printf("[dns] whitelisting proxy listen + %d nameserver(s) for nft: %v", len(validated), formatIPs(out))
} else {
log.Printf("[dns] whitelisting proxy listen (127.0.0.1); no valid nameserver IPs from %s", resolvPath)
}
return out
}

func maxNameserversFromEnv() int {
s := os.Getenv(constants.EnvMaxNameservers)
if s == "" {
return constants.DefaultMaxNameservers
}
n, err := strconv.Atoi(s)
if err != nil || n < 0 {
return constants.DefaultMaxNameservers
}
if n > 10 {
return 10
}
// 0 = no cap
return n
}

func isValidNameserverIP(ip netip.Addr) bool {
if ip.IsUnspecified() {
return false
}
if ip.IsLoopback() {
return false
}
return true
}

func formatIPs(ips []netip.Addr) []string {
out := make([]string, len(ips))
for i, ip := range ips {
out[i] = ip.String()
}
return out
}
Loading
Loading