Skip to content

Conversation

@Chr1st0ph3rTurn3r
Copy link
Contributor

No description provided.


## Download Failover Resiliency

SSR images can be downloaded from a variety of sources, depending on software access mode (eg. internet-only, prefer-conductor, conductor-only, offline-mode): the HA peer, both conductor nodes, artifactory, and the mist proxy to artifactory (cloud deployments only).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should Mist be capitalized?


SSR images can be downloaded from a variety of sources, depending on software access mode (eg. internet-only, prefer-conductor, conductor-only, offline-mode): the HA peer, both conductor nodes, artifactory, and the mist proxy to artifactory (cloud deployments only).

To improve resiliency to network connectivity issues, the SSR queries available versions from all sources before beginning the download. It compiles a list of sources where the requested version is available and begins the download. If more than 50% of requests to a source fail within a window of 10 requests, the SSR marks that source unavailable and moves on to the next source. The following priority order is used for sources:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my mind the size of the window is more of an implementation detail and may be subject to change based on tuning. We may want to be less specific about that in case we decide to adjust it in the future. But this may be fine too. Not sure how likely we are to need to adjust it


In the event that all sources have reached the threshold of consecutive failures and a download attempt has returned an error, the SSR can be configured to wait for a specified amount of time and then retry the download. If a connection is successfully made, the download will resume where it left off.

When the timeout is enabled, the SSR waits for a configurable amount of time (default is 10800s) for the download to complete. When the timeout value is reached, the download is marked as **Failed** and the retry delay begins.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not quite accurate. The retry delay will begin once we have marked all download sources as unavailable, as described in the failover resilience section. If enabled, once this timeout is hit, the download will be entirely stopped and marked as a failure. Or in other words, the retries happen inside of this timeout, not after it.


### Sequenced HA Download

The SSR supports sequenced downloading; one node of an HA pair downloads an image from the remote repository, and the other node waits for it to complete. Once that download is complete, the second node downloads it from the first. When targeting an HA router, the download is sequenced by default. To disable this sequencing, use `request system software download simultaneous disable`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once note about the second node downloads it from the first. The peer is the first place that an HA router will attempt to download from, so in most cases this would be the case, but if for whatever reason the connection to the peer went down, the router would move on and continue downloading from the conductor or remote sources. Not sure if that needs to be clarified or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the download happen over the HA sync connection or the HA fabric?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's the HA sync connection


## Configuration

Three components: Onboarding conductor, router, Operational conductor.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a customer specific tpopology. We shoudn;t limit this doc to just this use case. The doc should only talk about the router and conductor.


The next step in the process is to generate an onboarding token from conductor Web interface, command line, or using APIs. The generated tokens are signed by the conductor’s private key so that they cannot be altered once generated. The SSR supports two modes; Authority Wide and Router Specific tokens. These are mutually exclusive and are defined in the configuration.

#### Authority-Wide Tokens
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This concept is removed from the FS and should be deleted from the doc. We will only support per router tokens.

@Chr1st0ph3rTurn3r Chr1st0ph3rTurn3r requested review from BenMatase and agrawalkaushik and removed request for plessard128 November 24, 2025 18:13
Copy link
Contributor

@BenMatase BenMatase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like there is some duplicate information in sco doc


Secure Conductor Onboarding (SCO) provides the ability to onboard a router to a conductor ensuring that each device proves possession of a private key, and that the connection is trusted and authenticated. SCO employs asymmetric cryptography (RSA key pairs) to perform digital signatures and verification. The secure conductor onboarding process leverages the physical or virtual TPM module for mutual authentication.

When a router has SCO enabled, asset-id based onboarding is disabled. Ports 4505 and 4506 are disabled on the conductor, so any devices not using this feature will fail to onboard to the conductor. In addition, if an SCO enabled device attempts to onboard using the legacy method, the onboarding is rejected.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't automatically being done yet. Its a manual step. Do we want to call out the caveat?


### Prerequisites

- The `secure-conductor-onboarding mode` must be enabled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes it sound like there is only at the authority level. We don't have a mode at the authority at this time


To provide a secure and mutually authenticated onboarding mechanism, the following information must be configured.

- Pre-shared key: The onboarding pre-shared key is a 48-character alpha-numeric string, configured at the authority or the router level. This key is mandatory for the SCO process.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at the authority level for now


- Pre-shared key: The onboarding pre-shared key is a 48-character alpha-numeric string, configured at the authority or the router level. This key is mandatory for the SCO process.
- Conductor Public certificate: A public-private key certificate.
- Conductor CA certificate: Optionally, you can configure a public certificate signed by a preferred CA signing authority.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not optional

After the user generates an onboarding token, enter the token and other onboarding details in the onboarding UI or using CLI commands. There are two methods to onboard a router:
- Using the Command line: `secure-conductor-onboarding-token` command and `onboarding-config.json`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- Using the Command line: `create secure-conductor-onboarding token` command and `onboarding-config.json`.

4. The router connects to the conductor over port 930 using the SSH keys exchanged in previous steps.
5. The router is prepped and initialized by the conductor. During this process, the system goes through the reboot cycle.
Once the secure SSH tunnels are established, the SCO workflow concludes. All future communication between the router and conductor will occur on standard SSR to conductor ports such as 930, 4505, 4506, etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If SCO happens, won't use 4505/4506 from that point on. Everything is over 930

`configure authority router secure-conductor-onboarding pre-shared-secret`
The pre-shared secret is a 48-character alpha-numeric string. When enabled, any empty PSK will auto generate a random 48-byte alphanumeric string using the FIPS-approved, highly secure DRBG function from OpenSSL. Once generated, the key does not automatically change. It can be updated by the user if necessary.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not complete yet

### Token Contents
The next step in the process is to generate an onboarding token from the conductor Web interface, command line, or using APIs. The generated tokens are signed by the conductor’s private key so that they cannot be altered once generated. The SSR supports two modes; Authority-wide and Router-specific tokens. These are mutually exclusive and are defined in the configuration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this doc needs to be scrubbed of "authority wide" tokens for now

The following parameters are required, and are configured at the Router level.
`configure authority router secure-conductor-onboarding mode`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might not match the func spec exactly, but the router level path is at configure authority router system secure-conductor-onboarding. This applies to the other paths in the doc

### Auto-resume Download on WAN Failures

In the event that all sources have reached the threshold of consecutive failures and a download attempt has failed, the SSR can be configured to wait for a specified amount of time and then retry the download. If a connection is successfully made, the download will resume where it left off.
In the event that all sources have reached the threshold of consecutive failures and a download attempt has returned an error, the SSR can be configured to wait for a specified amount of time and then retry the download. If a connection is successfully made, the download will resume where it left off. Use the `software-update download enable-timeout` command to enable the retry feature.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The enable-timeout field is separate from retries. The only thing it enables is the timeout described in the next paragraph, and retries will happen regardless of whether the timeout is enabled

In the event that all sources have reached the threshold of consecutive failures and a download attempt has returned an error, the SSR can be configured to wait for a specified amount of time and then retry the download. If a connection is successfully made, the download will resume where it left off. Use the `software-update download enable-timeout` command to enable the retry feature.

When the timeout is enabled, the SSR waits for a configurable amount of time (default is 10800s) for the download to complete. When the timeout value is reached, the download is marked as **Failed** and the retry delay begins.
When the timeout is enabled (software-update download enable-timeout true) the SSR will wait for a configurable amount of time (default is 10800s) for the download to complete. If the timeout value is reached without successfully downloading the software, the download is marked as "Failed".

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth noting that the timeout is enabled by default?

The retry delay time is the longest time to wait between retry attempts. For example, the initial retry delay starts at 30 seconds. With each failure the delay is increased exponentially. However, when that calculated value reaches the maximum retry delay time, successive wait times for additional attempts do not exceed the maximium retry delay time. The default is 3600 seconds. A maximum number of times to retry can also be configured.

The retry timeout can be disabled. If it is disabled, the download will retry indefinitely.
If the retry timeout is disabled, the download will retry indefinitely

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above, the timeout is a separate mechanism from the retries, so I wouldn't necessarily describe it as a retry timeout. And the download would only retry indefinitely if both the timeout is disabled and the attempts is configured to 0.


### Sequenced HA Download

The SSR supports sequenced downloading; one node of an HA pair downloads an image from the remote repository, and the other node waits for it to complete. Once that download is complete, the second node downloads it from the first. When targeting an HA router, the download is sequenced by default. To disable this sequencing, use `request system software download simultaneous disable`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: I ended up making the download unsequenced by default. I may change that in the future, but in the beta we're giving Swift, it will be unsequenced.
In order to do a sequenced download, you would use request system software download router RouterName version SSR-X.Y.Z sequenced

After the user generates an onboarding token, enter the token and other onboarding details in the onboarding UI or using CLI commands. There are two methods to onboard a router:
- Using the Command line: `create secure-conductor-onboarding-token` command and `onboarding-config.json`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The command still needs to be fixed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is wrong with it? I copied your command from the earlier review. Am I missing something?

To enable this feature on the conductor, verify the following:
- The `secure conductor onboarding mode` should not be disabled (see above).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line should be removed. The conductor/whole authority doesn't have a mode

The CA certificate is read from disk at the location given in `secure-conductor-onboarding ca-certificate`.
## Token Management
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is a dup of the Token Creation section and can be removed

In the event that all sources have reached the threshold of consecutive failures and a download attempt has returned an error, the SSR can be configured to wait for a specified amount of time and then retry the download. If a connection is successfully made, the download will resume where it left off.

When the timeout is enabled (software-update download enable-timeout true) the SSR will wait for a configurable amount of time (default is 10800s) for the download to complete. If the timeout value is reached without successfully downloading the software, the download is marked as "Failed".
The timeout is enabled by default (`software-update download enable-timeout true`). The SSR waits for a configurable amount of time (default is 10800s) for the download to complete. If the timeout value is reached without successfully downloading the software, the download is marked as "Failed".

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is accurate, but something I hadn't thought of when reviewing before is that the retry configuration in the paragraph below is probably more significant than the timeout configuration, so I might swap the two paragraphs.

When the timeout is enabled (software-update download enable-timeout true) the SSR will wait for a configurable amount of time (default is 10800s) for the download to complete. If the timeout value is reached without successfully downloading the software, the download is marked as "Failed".
The timeout is enabled by default (`software-update download enable-timeout true`). The SSR waits for a configurable amount of time (default is 10800s) for the download to complete. If the timeout value is reached without successfully downloading the software, the download is marked as "Failed".

The retry delay time is the longest time to wait between retry attempts. For example, the initial retry delay starts at 30 seconds. With each failure the delay is increased exponentially. However, when that calculated value reaches the maximum retry delay time, successive wait times for additional attempts do not exceed the maximium retry delay time. The default is 3600 seconds. A maximum number of times to retry can also be configured.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo in maximium


If the retry timeout is disabled, the download will retry indefinitely

Use the command `configure authority router system software-update download enable-timeout [enabled]` to enable auto-resume. The command parameters are listed below:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The enable-timeout field doesn't really enable auto-resume. It's just a way you can tune the behavior to meet your needs. Maybe something along the lines of this would be more accurate?

Use the command configure authority router system software-update download to adjust the download retry behavior. The command parameters are listed below:

- `enable-timeout`: True/false, default is true. This enables a time limit for the overall download.
- `timeout`: Amount of time in seconds that the SSR waits for the software download to complete. When the timeout value is reached the download is marked as **Failed**, and the retry delay begins. The default download wait time is 10800s. Range is 1800s - 604800s.
- `attempts`: The maximum number of attempts to download before considering the download as failed. If set to 0, the SSR will retry the download until the timeout is hit. Default is 10.
- `max retry delay`: The maximum amount of time in seconds to wait in between retry attempts. The retry delay will start off low and back off exponentially up to this duration. Range is 0 to 86400s. Default is 3600s.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maximum-retry-delay

4. Decapsulation: The receiver uses their private keys (SK1,SK2, etc.) in reverse order to decrypt each layer:
- Cn−1 = Decapsulate(Cn, SKn). This is repeated until the original symmetric key `K` is retrieved.

### Certificate Considerations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ML-DSA, needed for PQC certificates, is not yet supported in the product and is not scheduled for 7.1.3-R2. I'm of the opinion that we should not be discussing PQC certificates here. ML-KEM does not need PCQ certificates to operate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mike I'm not sure what in this section you are referring to. I'm not familiar with PQC (post quantum certificates?) or where ML-DSA is referenced. Can you clarify a bit? Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the entire ### Certificate Considerations section needs to be removed. In conjunction with ML-KEM, an X.509 certificate must be selected to align with the cryptographic requirements of the protocol. is false as there are no certificate considerations taking place here for this release.


Modern compliance requirements and regulatory frameworks mandate encryption-at-rest for sensitive data, particularly in industries handling financial transactions, healthcare records, or government communications. High-security customers in financial services, government, and healthcare sectors require robust protection against data exfiltration to maintain their security posture and meet regulatory obligations. These requirements have evolved beyond simple access controls to demand cryptographic protection of stored credentials, configuration data, and private key material that could be exploited to compromise broader network infrastructure.

SSR Configuration Integrity prevents unauthorized access to SSR configuration files when the system is powered off and physically compromised, ensuring that sensitive routing configurations, authentication credentials, and network topology information cannot be extracted through direct storage access. The system protects private keys and certificates from extraction via physical storage access, preventing attackers from impersonating network nodes or intercepting encrypted communications. Most importantly, it meets compliance requirements for encryption-at-rest without impacting runtime performance, allowing organizations to satisfy regulatory mandates while maintaining the high-performance networking capabilities that SSR devices are designed to provide.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SSR Configuration Integrity prevents unauthorized access to SSR configuration files when the system is powered off and physically compromised, ensuring

This reads to me as muddling two different behaviors into one comment.

  • It prevents access when the system is powered off. (e.g. encryption at rest prevents data scrubbing)
  • It prevents SSR from running when physically compromised.

Perhaps break this apart along the lines of:
SSR Configuration Integrity prevents unauthorized access to SSR files when the system is powered off and prevents SSR operations when the system is compromised, ensuring...

Comment on lines 50 to 52
### Hardware Bootstrapper

The hardware bootstrapper is an existing component of the SSR, responsible for the initialization of the SSR during its first boot. It performs the enablement requirements check (detailed below) to verify that the system can support Configuration Integrity, and if so, go through the steps to enable it for the lifetime of the system. This workflow will be explained in more detail in a later section.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The integrity handler is responsible for almost everything that you wrote here, not the HWB. I think this entire paragraph can be removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, lets remove the hardware bootstrapper section

3. Pass unencrypted FEMK to fscrypt.
4. fscrypt uses the FEMK to automatically unlock the necessary encrypted directories.

If any of these steps fail, it is interpreted as an integrity event, an emergency log is generated (which is also broadcast to all consoles on the system) that the system has had its integrity compromised and it must be reprovisioned. The SSR will repeatedly try to start the integrity service to unlock the encrypted directories and fail, each time writing the emergency log.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably mention that it blocks the operation of networking in the event of a failure, because we can not operate normally if the integrity of the system has been violated. Any recovery steps require physical access, e.g. to reimage the system with a fresh ISO.


### Logging

Logging is handled through existing system components rather than a dedicated log category. During initial system provisioning, the Hardware Bootstrapper handles all Configuration Integrity initialization logging as part of its standard provisioning process. On subsequent boots, the systemd service that is responsible for unlocking encrypted directories logs all unlock operations and service status information through the systemd journal. This provides comprehensive visibility into the operational state of the encryption system during the boot sequence.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove HWB from this as well. Also should probably include journalctl -u integrity-handler.

3. Pass unencrypted FEMK to fscrypt.
4. fscrypt uses the FEMK to automatically unlock the necessary encrypted directories.

If any of these steps fail, it is interpreted as an integrity event, an emergency log is generated (which is also broadcast to all consoles on the system) that the system has had its integrity compromised and it must be reprovisioned. The SSR will repeatedly try to start the integrity service to unlock the encrypted directories and fail, each time writing the emergency log.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the output of the broadcast message to this section.


If any of these steps fail, it is interpreted as an integrity event, an emergency log is generated (which is also broadcast to all consoles on the system) that the system has had its integrity compromised and it must be reprovisioned. The SSR will repeatedly try to start the integrity service to unlock the encrypted directories and fail, each time writing the emergency log.

## Troubleshooting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a subsection to troubleshooting for what to do when a system has been compromised: Zeroize -> Factory Reset, or RMA.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No factory reset unfortunately, because some of the directories we need for the factory reset will be locked, and we won't be able to unlock them due to the compromise. The only options are clean install or RMA.

Chr1st0ph3rTurn3r and others added 4 commits December 10, 2025 09:13
Co-authored-by: Mike Adams <75860404+madamsJuniper@users.noreply.github.com>
Co-authored-by: Adam Drescher <adrescher@juniper.net>
Co-authored-by: Adam Drescher <adrescher@juniper.net>
Co-authored-by: Adam Drescher <adrescher@juniper.net>
4. Decapsulation: The receiver uses their private keys (SK1,SK2, etc.) in reverse order to decrypt each layer:
- Cn−1 = Decapsulate(Cn, SKn). This is repeated until the original symmetric key `K` is retrieved.

### Certificate Considerations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the entire ### Certificate Considerations section needs to be removed. In conjunction with ML-KEM, an X.509 certificate must be selected to align with the cryptographic requirements of the protocol. is false as there are no certificate considerations taking place here for this release.

Comment on lines 50 to 52
### Hardware Bootstrapper

The hardware bootstrapper is an existing component of the SSR, responsible for the initialization of the SSR during its first boot. It performs the enablement requirements check (detailed below) to verify that the system can support Configuration Integrity, and if so, go through the steps to enable it for the lifetime of the system. This workflow will be explained in more detail in a later section.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, lets remove the hardware bootstrapper section

MichaelBaj
MichaelBaj previously approved these changes Dec 11, 2025
4. fscrypt uses the FEK to automatically unlock the necessary encrypted directories.

This systemd service handles the subsequent boots of the SSR after Configuration Integrity has been enabled. It runs a series of integrity checks, and identifies when the system is ready to continue operation after successful unlocking of the encrypted directories. When it is run, it performs the following sequence:
If any of these steps fail, it is interpreted as an integrity event. Network activities are blocked. An emergency log is generated and broadcast to all consoles on the system that the system integrity is compromised and it must be reprovisioned. The SSR will repeatedly try to start the integrity service to unlock the encrypted directories and fail, each time writing the emergency log.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra space after "system"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.