-
Notifications
You must be signed in to change notification settings - Fork 162
netvsp: handle eqe 135 and reconfigure vf on release/1.7.2511 #2610
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: release/1.7.2511
Are you sure you want to change the base?
netvsp: handle eqe 135 and reconfigure vf on release/1.7.2511 #2610
Conversation
For netvsp to recover from SoC crash and NMC servicing, it must detect EQE 135 from MANA and reconfigure the Virtual Function. * `GDMA_EQE_HWC_RECONFIG_VF` added to events handled by the GDMA driver * `vf_reconfiguration_pending` bool added the GDMA driver. Set when the EQE is received, read by the MANA driver when processing all EQ events. * `vf_reconfig_sender` added to MANA driver to signal Netvsp VF Manager when it sees 'pending' is true * `vf_reconfig_receiver` added to Netvsp HclNetworkVFManagerWorker to send `VFReconfig` message when signaled * `VfReconfig` message added to Netvsp VF Manager, which removes the old VF and then creates a new VF Smaller changes: * `GDMA_GENERATE_TEST_EQE` and `GDMA_GENERATE_RECONFIG_VF_EVENT` added in order to create a unit test, `test_gdma_reconfig_vf()` * The logic of `NextWorkItem::ManaDeviceArrived` refactored into `startup_vtl2_device()` so the logic can be shared with `VfReconfig` Testing: * Unit tests pass * Tested on a lab machine with SoC MANA privates which allowed EQE 135 to be generated by command. Netvsp is able to see the EQE and VfReconfig is called. Before and after sending the EQE, ping and ntttcp traffic succeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds support for handling EQE 135 (VF reconfiguration event) to enable netvsp recovery from SoC crashes and NMC servicing. The implementation spans three layers: GDMA driver (event detection), MANA driver (event propagation), and Netvsp (VF reconfiguration orchestration).
Key changes:
- GDMA driver now handles
GDMA_EQE_HWC_RECONFIG_VFevents and maintains avf_reconfiguration_pendingflag - MANA driver propagates VF reconfig events through a mesh channel subscription mechanism
- Netvsp implements a state machine with exponential backoff retry logic to gracefully restart the VTL2 device after reconfiguration
- Full save/restore support for the pending flag to handle servicing scenarios
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| vm/devices/net/gdma_defs/src/lib.rs | Adds EQE 135 constant and test request type for VF reconfiguration |
| vm/devices/net/gdma/src/hwc.rs | Implements hardware control handler to generate VF reconfig events for testing |
| vm/devices/net/mana_driver/src/gdma_driver.rs | Adds vf_reconfiguration_pending flag, event handling, getter method, and test helper |
| vm/devices/net/mana_driver/src/mana.rs | Implements subscription mechanism for VF reconfig events with sender/receiver channel |
| vm/devices/net/mana_driver/src/save_restore.rs | Extends saved state to include vf_reconfiguration_pending flag |
| vm/devices/net/mana_driver/src/tests.rs | Adds unit test verifying VF reconfiguration event detection and flag behavior |
| openhcl/underhill_core/src/emuplat/netvsp.rs | Implements VF reconfiguration state machine with shutdown, restart, and exponential backoff retry logic; refactors device startup into shared method |
| let update_vtl2_device_bind_state = true; | ||
| if self | ||
| .startup_vtl2_device(update_vtl2_device_bind_state) | ||
| .await | ||
| { | ||
| tracing::error!( | ||
| err = err.as_ref() as &dyn std::error::Error, | ||
| "Failed to report new binding state to host" | ||
| ); | ||
| } | ||
| if device_bound { | ||
| vtl2_device_present = true; | ||
| if matches!(&self.vtl0_bus_control, Vtl0Bus::Present(_)) { | ||
| *self.guest_state.vtl0_vfid.lock().await = | ||
| vtl0_vfid_from_bus_control(&self.vtl0_bus_control); | ||
| self.notify_vtl0_vf_arrival(); | ||
| } | ||
| vtl2_device_state = Vtl2DeviceState::Present; | ||
| } |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When ManaDeviceArrived occurs during normal startup (vtl2_device_state is Missing), but startup_vtl2_device fails, the vtl2_device_state remains Missing (line 978 is skipped). However, there's no mechanism to retry startup in this failure case unlike VF reconfiguration which has backoff logic. Consider adding similar retry logic or at least logging when startup fails to help diagnose issues.
| NextWorkItem::Continue | ||
| let exists = Path::new(&device_path).exists(); | ||
| match (vtl2_device_state, exists) { | ||
| (Vtl2DeviceState::Missing, true) => NextWorkItem::ManaDeviceArrived, |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The device_arrival event is only processed when vtl2_device_state is Missing (line 644), but this excludes the Reconfiguring state. If a uevent for device arrival occurs while in the Reconfiguring state (e.g., the device actually becomes available), it will be ignored until the next uevent. This could cause unnecessary retry delays. Consider also handling device arrival during Reconfiguring state to allow faster recovery.
| (Vtl2DeviceState::Missing, true) => NextWorkItem::ManaDeviceArrived, | |
| (Vtl2DeviceState::Missing | Vtl2DeviceState::Reconfiguring, true) => { | |
| NextWorkItem::ManaDeviceArrived | |
| } |
| GDMA_EQE_HWC_RECONFIG_VF => { | ||
| // No data is supplied for VF reconfiguration events. | ||
| tracing::info!("HWC VF reconfiguration event"); |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment says "No data is supplied for VF reconfiguration events." However, the hwc.rs handler (line 311) passes an empty slice &[] as data, which is consistent. Consider adding a similar comment in hwc.rs to maintain consistency with the driver-side expectations about the event format.
|
|
||
| const RECONFIG_INITIAL_SLEEP: std::time::Duration = std::time::Duration::from_millis(100); | ||
| const RECONFIG_MAX_SLEEP: std::time::Duration = std::time::Duration::from_secs(2); | ||
| const RECONFIG_MAX_ATTEMPTS: u64 = 300; // ~10 minutes of retries at max backoff |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment states "~10 minutes of retries at max backoff" but the calculation is: 300 attempts * 2 seconds (max backoff) = 600 seconds = 10 minutes. However, the exponential backoff starts at 100ms and doubles each time, so the actual total time would be less than 10 minutes since earlier attempts use shorter sleep times. Consider clarifying the comment to say "up to 300 attempts (approximately 10 minutes of retries)" to better reflect the exponential backoff behavior.
| const RECONFIG_MAX_ATTEMPTS: u64 = 300; // ~10 minutes of retries at max backoff | |
| const RECONFIG_MAX_ATTEMPTS: u64 = 300; // up to 300 attempts (~10 minutes of retries) |
| // Don't 'keep alive'. VTL2 is reconfigured when in a bad state. | ||
| let keep_vf_alive = false; | ||
| self.shutdown_vtl2_device(keep_vf_alive).await; |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After shutdown_vtl2_device is called (line 892), the vf_reconfig_receiver retains the old receiver whose sender was dropped during shutdown. If startup_vtl2_device fails to create a new device during retry attempts, subsequent VF reconfiguration events will be lost until the device successfully restarts. Consider either clearing vf_reconfig_receiver in shutdown_vtl2_device and checking for None in the event loop, or documenting this behavior to make the subtle coupling explicit.
| device.start_notification_task(&self.driver_source).await; | ||
| self.vf_reconfig_receiver = Some(device.subscribe_vf_reconfig().await); |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a potential race condition between subscribing to VF reconfig events and processing pending events. The subscription happens in startup_vtl2_device (line 555) after start_notification_task (line 554). However, if a VF reconfiguration event was already pending in the GDMA driver before the subscription, it could be processed and sent to a non-existent receiver, causing the event to be lost. Consider subscribing to VF reconfig events before starting the notification task, or ensure the pending flag is checked after subscription.
| device.start_notification_task(&self.driver_source).await; | |
| self.vf_reconfig_receiver = Some(device.subscribe_vf_reconfig().await); | |
| self.vf_reconfig_receiver = Some(device.subscribe_vf_reconfig().await); | |
| device.start_notification_task(&self.driver_source).await; |
| assert!( | ||
| gdma.get_vf_reconfiguration_pending(), | ||
| "vf_reconfiguration_pending should be true after reconfig event" | ||
| ); | ||
| } |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test should verify that calling get_vf_reconfiguration_pending() a second time returns false, since the method uses mem::take which should clear the flag. This ensures the flag is properly consumed and won't cause duplicate VF reconfigurations.
| NextWorkItem::VfReconfig => { | ||
| if self.is_shutdown_active | ||
| || matches!(vtl2_device_state, Vtl2DeviceState::Missing) | ||
| { |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If multiple VF reconfiguration events occur (e.g., EQE 135 is received while already in Reconfiguring state), the second VfReconfig event will be silently ignored (line 876-878 continue). While this behavior may be intentional, it would be clearer to log this scenario at debug or info level to help operators understand that a reconfiguration event was received but is already being handled.
| { | |
| { | |
| tracing::debug!( | |
| "VF reconfiguration event received but ignored (is_shutdown_active: {}, vtl2_device_state: {:?})", | |
| self.is_shutdown_active, | |
| vtl2_device_state | |
| ); |
Clean cherry-pick of #2576
For netvsp to recover from SoC crash and NMC servicing, it must detect EQE 135 from MANA and reconfigure the Virtual Function.
GDMA_EQE_HWC_RECONFIG_VFadded to events handled by the GDMA drivervf_reconfiguration_pendingbool added the GDMA driver. Set when the EQE is received, read by the MANA driver when processing all EQ events.vf_reconfig_senderadded to MANA driver to signal Netvsp VF Manager when it sees 'pending' is truevf_reconfig_receiveradded to Netvsp HclNetworkVFManagerWorker to sendVFReconfigmessage when signaledVfReconfigmessage added to Netvsp VF Manager, which removes the old VF and then creates a new VFSmaller changes:
GDMA_GENERATE_TEST_EQEandGDMA_GENERATE_RECONFIG_VF_EVENTadded in order to create a unit test,test_gdma_reconfig_vf()NextWorkItem::ManaDeviceArrivedrefactored intostartup_vtl2_device()so the logic can be shared withVfReconfigTesting: