Browse Source

fix: resolve trojan-go startup failure from TLS permissions and missing capability

Copy TLS certs to /etc/trojan-go/tls/ owned by the trojan user,
update certbot renewal hook to propagate certs on renewal, and
add CapabilityBoundingSet to the systemd unit

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
kotoyuuko 3 weeks ago
parent
commit
b671cbfcbf

+ 2 - 0
openspec/changes/archive/2026-04-22-fix-trojan-go-startup-failure/.openspec.yaml

@@ -0,0 +1,2 @@
+schema: spec-driven
+created: 2026-04-22

+ 29 - 0
openspec/changes/archive/2026-04-22-fix-trojan-go-startup-failure/design.md

@@ -0,0 +1,29 @@
+## Context
+
+The trojan-go service runs as the `trojan` user (non-root) and needs to: (1) read TLS cert/key files from `/etc/letsencrypt/`, (2) bind to port 443. Currently, cert files are owned by root with restrictive permissions, and the systemd unit only uses `AmbientCapabilities` without `CapabilityBoundingSet`.
+
+## Goals / Non-Goals
+
+**Goals:**
+- Ensure the `trojan` user can read the TLS certificate and key files
+- Ensure `CAP_NET_BIND_SERVICE` is properly granted to the trojan-go process
+- Maintain certbot auto-renewal with correct permission propagation
+
+**Non-Goals:**
+- Change the trojan-go binary or version
+- Replace certbot with a different certificate solution
+
+## Decisions
+
+**Copy cert files to a trojan-owned directory instead of changing /etc/letsencrypt permissions**
+
+Changing the entire `/etc/letsencrypt` tree to be readable by the `trojan` user is a security concern. Instead, copy the cert and key to `/etc/trojan-go/tls/` owned by the `trojan` user after each certbot renewal. This keeps the Let's Encrypt directory locked down while giving trojan exactly what it needs.
+
+**Add CapabilityBoundingSet alongside AmbientCapabilities**
+
+`AmbientCapabilities` alone may not be sufficient on all systems. Adding `CapabilityBoundingSet` ensures the capability is available to the process even when the bounding set is restricted.
+
+## Risks / Trade-offs
+
+- [Cert copy lag on renewal] → The certbot `--deploy-hook` copies certs after renewal, so there's a brief window where trojan serves the old cert. Mitigation: the hook reloads trojan after copying.
+- [Extra disk I/O] → Negligible — two small PEM files copied per renewal.

+ 23 - 0
openspec/changes/archive/2026-04-22-fix-trojan-go-startup-failure/proposal.md

@@ -0,0 +1,23 @@
+## Why
+
+The trojan-go service on the landing server exits immediately with status=1/FAILURE and enters an auto-restart loop. The most likely causes are: (1) the TLS certificate and key files under `/etc/letsencrypt/` are not readable by the `trojan` service user, and (2) the systemd unit uses `AmbientCapabilities` without `CapabilityBoundingSet`, which may not properly grant `CAP_NET_BIND_SERVICE` to the process.
+
+## What Changes
+
+- Fix TLS certificate file permissions: use a certbot `--deploy-hook` to copy certs with correct ownership for the `trojan` user, and update the trojan config template to point to the copied paths
+- Add `CapabilityBoundingSet=CAP_NET_BIND_SERVICE` to the systemd unit alongside `AmbientCapabilities`
+- Add a task to copy initial cert files after the first certbot run
+
+## Capabilities
+
+### New Capabilities
+<!-- none -->
+
+### Modified Capabilities
+- `trojan-landing`: TLS certificate access and systemd capability configuration must allow the trojan service to start successfully
+
+## Impact
+
+- `roles/trojan/templates/trojan.service.j2` — add `CapabilityBoundingSet`
+- `roles/trojan/tasks/main.yml` — add cert file copy tasks, update deploy-hook
+- `roles/trojan/templates/trojan-config.json.j2` — update cert/key paths to copied locations

+ 24 - 0
openspec/changes/archive/2026-04-22-fix-trojan-go-startup-failure/specs/trojan-landing/spec.md

@@ -0,0 +1,24 @@
+## MODIFIED Requirements
+
+### Requirement: TLS certificate is provisioned via Let's Encrypt
+The trojan role SHALL use certbot to obtain a TLS certificate for the landing server's domain, with automatic renewal. After provisioning or renewal, the certificate and key SHALL be copied to a trojan-owned directory (`/etc/trojan-go/tls/`) so the service user can read them.
+
+#### Scenario: Certificate provisioning
+- **WHEN** the trojan role runs with a configured domain name
+- **THEN** certbot obtains a TLS certificate for that domain
+- **THEN** the certificate and key are copied to `/etc/trojan-go/tls/` owned by the trojan user
+
+#### Scenario: Certificate auto-renewal
+- **WHEN** the certificate is within 30 days of expiry
+- **THEN** certbot renews it automatically via systemd timer or cron
+- **THEN** a deploy-hook copies the renewed certs to `/etc/trojan-go/tls/`
+- **THEN** the Trojan service is reloaded after renewal
+
+### Requirement: Trojan runs as a systemd service
+The trojan role SHALL create a systemd unit file for Trojan and ensure it is enabled and started. The unit SHALL include both `AmbientCapabilities` and `CapabilityBoundingSet` for `CAP_NET_BIND_SERVICE`.
+
+#### Scenario: Service is running
+- **WHEN** the trojan role completes
+- **THEN** the Trojan systemd service is enabled and running
+- **THEN** the service runs under a dedicated non-root user with `CAP_NET_BIND_SERVICE` for port 443
+- **THEN** the trojan user can read the TLS certificate and key files from `/etc/trojan-go/tls/`

+ 13 - 0
openspec/changes/archive/2026-04-22-fix-trojan-go-startup-failure/tasks.md

@@ -0,0 +1,13 @@
+## 1. Fix TLS certificate access
+
+- [x] 1.1 Add task to copy initial cert files to `/etc/trojan-go/tls/` after certbot obtains the certificate
+- [x] 1.2 Update certbot renewal hook to copy certs and reload trojan after renewal
+- [x] 1.3 Update `trojan-config.json.j2` to use `/etc/trojan-go/tls/` for cert and key paths
+
+## 2. Fix systemd capabilities
+
+- [x] 2.1 Add `CapabilityBoundingSet=CAP_NET_BIND_SERVICE` to `trojan.service.j2`
+
+## 3. Verify
+
+- [x] 3.1 Run `ansible-playbook site.yml --syntax-check` to confirm playbook parses

+ 5 - 3
openspec/specs/trojan-landing/spec.md

@@ -21,24 +21,26 @@ The trojan role SHALL download and install the Trojan binary (trojan-go or troja
 - **THEN** the service is restarted
 
 ### Requirement: Trojan runs as a systemd service
-The trojan role SHALL create a systemd unit file for Trojan and ensure it is enabled and started.
+The trojan role SHALL create a systemd unit file for Trojan and ensure it is enabled and started. The unit SHALL include both `AmbientCapabilities` and `CapabilityBoundingSet` for `CAP_NET_BIND_SERVICE`.
 
 #### Scenario: Service is running
 - **WHEN** the trojan role completes
 - **THEN** the Trojan systemd service is enabled and running
 - **THEN** the service runs under a dedicated non-root user (with `CAP_NET_BIND_SERVICE` for port 443)
+- **THEN** the trojan user can read the TLS certificate and key files from `/etc/trojan-go/tls/`
 
 ### Requirement: TLS certificate is provisioned via Let's Encrypt
-The trojan role SHALL use certbot to obtain a TLS certificate for the landing server's domain, with automatic renewal.
+The trojan role SHALL use certbot to obtain a TLS certificate for the landing server's domain, with automatic renewal. After provisioning or renewal, the certificate and key SHALL be copied to a trojan-owned directory (`/etc/trojan-go/tls/`) so the service user can read them.
 
 #### Scenario: Certificate provisioning
 - **WHEN** the trojan role runs with a configured domain name
 - **THEN** certbot obtains a TLS certificate for that domain
-- **THEN** the certificate and key are accessible to the Trojan service
+- **THEN** the certificate and key are copied to `/etc/trojan-go/tls/` owned by the trojan user
 
 #### Scenario: Certificate auto-renewal
 - **WHEN** the certificate is within 30 days of expiry
 - **THEN** certbot renews it automatically via systemd timer or cron
+- **THEN** a deploy-hook copies the renewed certs to `/etc/trojan-go/tls/`
 - **THEN** the Trojan service is reloaded after renewal
 
 ### Requirement: Trojan listens on port 443 with TLS

+ 15 - 1
roles/trojan/tasks/main.yml

@@ -92,14 +92,28 @@
 
 - name: Deploy certbot renewal hook for trojan
   ansible.builtin.copy:
-    dest: /etc/letsencrypt/renewal-hooks/post/restart-trojan.sh
+    dest: /etc/letsencrypt/renewal-hooks-post/trojan-go.sh
     content: |
       #!/bin/bash
+      mkdir -p /etc/trojan-go/tls
+      cp /etc/letsencrypt/live/{{ trojan_domain }}/fullchain.pem /etc/trojan-go/tls/fullchain.pem
+      cp /etc/letsencrypt/live/{{ trojan_domain }}/privkey.pem /etc/trojan-go/tls/privkey.pem
+      chown -R {{ trojan_user }}:{{ trojan_user }} /etc/trojan-go/tls
       systemctl reload trojan-go
     owner: root
     group: root
     mode: "0755"
 
+- name: Copy initial TLS certificates to trojan-owned directory
+  ansible.builtin.shell: |
+    mkdir -p /etc/trojan-go/tls
+    cp /etc/letsencrypt/live/{{ trojan_domain }}/fullchain.pem /etc/trojan-go/tls/fullchain.pem
+    cp /etc/letsencrypt/live/{{ trojan_domain }}/privkey.pem /etc/trojan-go/tls/privkey.pem
+    chown -R {{ trojan_user }}:{{ trojan_user }} /etc/trojan-go/tls
+  args:
+    creates: /etc/trojan-go/tls/privkey.pem
+  notify: restart trojan
+
 - name: Deploy trojan-go configuration
   ansible.builtin.template:
     src: trojan-config.json.j2

+ 2 - 2
roles/trojan/templates/trojan-config.json.j2

@@ -8,8 +8,8 @@
         "{{ trojan_password }}"
     ],
     "ssl": {
-        "cert": "{{ tls_cert_path }}",
-        "key": "{{ tls_key_path }}",
+        "cert": "/etc/trojan-go/tls/fullchain.pem",
+        "key": "/etc/trojan-go/tls/privkey.pem",
         "sni": "{{ trojan_domain }}"
     },
     "router": {

+ 1 - 0
roles/trojan/templates/trojan.service.j2

@@ -12,6 +12,7 @@ Restart=on-failure
 RestartSec=5
 LimitNOFILE=65536
 AmbientCapabilities=CAP_NET_BIND_SERVICE
+CapabilityBoundingSet=CAP_NET_BIND_SERVICE
 
 [Install]
 WantedBy=multi-user.target