Skip to content

Conversation

@weizhouapache
Copy link
Member

Description

This PR fixes: #12107 #11879

Step to reproduce the issue

  • create VPC with redundant offering
  • create vpc tier and vm
  • check /etc/dnsmasq.d/cloud.conf

expected: VPC tier gateway as the first option in the line for DNS

dhcp-option=tag:interface-eth2-0,6, VPC tier gateway, DNS1, DNS2

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • Build/CI
  • Test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

How did you try to break this feature and the system with this change?

@codecov
Copy link

codecov bot commented Nov 28, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 17.78%. Comparing base (b5fd39f) to head (11a14ae).
⚠️ Report is 56 commits behind head on 4.22.

Additional details and impacted files
@@             Coverage Diff              @@
##               4.22   #12161      +/-   ##
============================================
+ Coverage     17.59%   17.78%   +0.18%     
- Complexity    15601    15923     +322     
============================================
  Files          5910     5911       +1     
  Lines        529780   539168    +9388     
  Branches      64729    68896    +4167     
============================================
+ Hits          93226    95873    +2647     
- Misses       426060   432683    +6623     
- Partials      10494    10612     +118     
Flag Coverage Δ
uitests 3.56% <ø> (-0.05%) ⬇️
unittests 18.87% <ø> (+0.20%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@weizhouapache
Copy link
Member Author

@blueorangutan package

@blueorangutan
Copy link

@weizhouapache a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 15857

@weizhouapache
Copy link
Member Author

@blueorangutan test

@blueorangutan
Copy link

@weizhouapache a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@Jayd603
Copy link

Jayd603 commented Nov 28, 2025

This patch worked for me.

@weizhouapache weizhouapache marked this pull request as ready for review November 28, 2025 16:38
@weizhouapache
Copy link
Member Author

@blueorangutan test

@blueorangutan
Copy link

@weizhouapache a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

@weizhouapache weizhouapache force-pushed the 4.22-fix-vpc-rvr-dns-list branch from f5b4060 to e5d7cf2 Compare November 30, 2025 12:59
@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 15869

@blueorangutan
Copy link

[SF] Trillian test result (tid-14892)
Environment: kvm-ol8 (x2), zone: Advanced Networking with Mgmt server ol8
Total time taken: 51401 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr12161-t14892-kvm-ol8.zip
Smoke tests completed. 149 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

@apache apache deleted a comment from blueorangutan Dec 1, 2025
@apache apache deleted a comment from blueorangutan Dec 1, 2025
@DaanHoogland
Copy link
Contributor

@weizhouapache , does this need testing for non-vpc non-redundant routers? cc @nvazquez @Pearl1594 .

@weizhouapache
Copy link
Member Author

@weizhouapache , does this need testing for non-vpc non-redundant routers? cc @nvazquez @Pearl1594 .

yes @DaanHoogland

the change was introduced in #9102, but may be related to the Netris plugin too (#10458)
@nvazquez @Pearl1594

@Jayd603
Copy link

Jayd603 commented Dec 3, 2025

I just hit the below error trying to deploy an instance with this fix in place. 169.254.39.227 is the current BACKUP VPC router. Will try to debug further.

2025-12-03 16:52:27,797 DEBUG [c.c.a.t.Request] (AgentManager-Handler-9:[]) (logid:) Seq 28-8558246666887546921: Processing:  { Ans: , MgmtId: 90520744930075, via: 28, Ver: v1, Flags: 10, [{"com.cloud.agent.api.routing.GroupAnswer":{"results":["null - success: Creating file in VR, with ip: 169.254.39.227, file: monitor_service.json.23cd5b3b-c47e-457e-bc28-9bb079d34f9d","null - failed: java.io.IOException: Stream closed

Upon second deployment attempt, it tried the PRIMARY router and also failed. hmm.

@weizhouapache
Copy link
Member Author

monitor_service.json.23cd5b3b-c47e-457e-bc28-9bb079d34f9d

@Jayd603 can you run the following commands inthe VPC VR ?

cd /var/cache/cloud
cp processed/monitor_service.json.23cd5b3b-c47e-457e-bc28-9bb079d34f9d.gz .
gzip -dk monitor_service.json.23cd5b3b-c47e-457e-bc28-9bb079d34f9d.gz
update_config.py monitor_service.json.23cd5b3b-c47e-457e-bc28-9bb079d34f9d

@Jayd603
Copy link

Jayd603 commented Dec 3, 2025

monitor_service.json.23cd5b3b-c47e-457e-bc28-9bb079d34f9d

@Jayd603 can you run the following commands inthe VPC VR ?

cd /var/cache/cloud
cp processed/monitor_service.json.23cd5b3b-c47e-457e-bc28-9bb079d34f9d.gz .
gzip -dk monitor_service.json.23cd5b3b-c47e-457e-bc28-9bb079d34f9d.gz
update_config.py monitor_service.json.23cd5b3b-c47e-457e-bc28-9bb079d34f9d

I noticed I used tabs in the python file, durr, fixed that now other errors:

(r-95-VM) Resource [Host:28] is unreachable: Host 28: Unable to start instance due to Unable to start VM:fcdb10f4-1bde-4cc0-a04f-afd20a2392f2 due to error in finalizeStart, not retrying

I fixed the .py files and attempted router reboot - totally fails now.

also - that file you pasted does not exist.

I'm re-deploying fresh routers without your patch - after that, other than modifying the .py files on each router, what else do I need to do to test this?

@Jayd603
Copy link

Jayd603 commented Dec 3, 2025

monitor_service.json.23cd5b3b-c47e-457e-bc28-9bb079d34f9d

@Jayd603 can you run the following commands inthe VPC VR ?

After deploying fresh routers, i was able to apply your patch and reboot them both, then was able to deploy with password functionality. Not sure what happened but seems ok now . I'm doing more testing and will report back if I notice anything.

@weizhouapache
Copy link
Member Author

monitor_service.json.23cd5b3b-c47e-457e-bc28-9bb079d34f9d

@Jayd603 can you run the following commands inthe VPC VR ?

After deploying fresh routers, i was able to apply your patch and reboot them both, then was able to deploy with password functionality. Not sure what happened but seems ok now . I'm doing more testing and will report back if I notice anything.

good, good to know it.

@weizhouapache
Copy link
Member Author

@blueorangutan package

@weizhouapache
Copy link
Member Author

verified by
#11879 (comment)

@Pearl1594 Pearl1594 self-assigned this Jan 26, 2026
@Pearl1594
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@Pearl1594 a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✖️ debian ✔️ suse15. SL-JID 16537

@Pearl1594 Pearl1594 removed their assignment Jan 26, 2026
Copy link
Contributor

@Pearl1594 Pearl1594 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears to work as intended, adding the VPC tier gateway to the DNS list in /etc/dnsmasq.d/cloud.conf for both standard and redundant VPC setups. However, for redundant VPCs, only the primary VR includes the tier gateway in the DNS line. When I stop the primary VR and the backup becomes primary, the VPC tier gateway is no longer present in the dhcp-option line for DNS. I'm not sure if that's expected behaviour.
Also, regarding a comment on this possibly impacting NSX - it may not, as we have a parameter when creating a VPC network using NSX provider - userouteripresolver - which when set to true, uses the VR IP as nameserver, otherwise uses DNS1 and DNS2" cc @nvazquez

@weizhouapache
Copy link
Member Author

@Pearl1594 @nvazquez

can you have a look how to make it working with both nsx/netris and regular vpc ?

I think the 4.19 code works for regular vpc

if gn.get_dns() and device:
sline = "dhcp-option=tag:interface-%s-%s,6" % (device, idx)
dns_list = [x for x in gn.get_dns() if x]
if self.config.is_dhcp() and not self.config.use_extdns():
guest_ip = self.config.address().get_guest_ip()
if guest_ip and guest_ip in dns_list and ip not in dns_list:
# Replace the default guest IP in VR with the ip in additional IP ranges, if shared network has multiple IP ranges.
dns_list.remove(guest_ip)
dns_list.insert(0, ip)
line = "dhcp-option=tag:interface-%s-%s,6,%s" % (device, idx, ','.join(dns_list))
self.conf.search(sline, line)

closing this PR

@Pearl1594
Copy link
Contributor

this fix partly work @weizhouapache - I only had a confusion regarding redundant VRs - when the backup VR becomes the primary one, it doesn't have the tier gateway in the dns list.

@weizhouapache
Copy link
Member Author

weizhouapache commented Jan 27, 2026

this fix partly work @weizhouapache - I only had a confusion regarding redundant VRs - when the backup VR becomes the primary one, it doesn't have the tier gateway in the dns list.

@Pearl1594
we should fix it then

@Pearl1594
Copy link
Contributor

I think I made a mistake while testing - my bad, I had copied the cloud-scripts.tgz and agent.zip files to the hosts and tested it (on an existing env), while one of the VRs had the fix, the other didn't, leading the the discrepancy I observed. On testing on an env built from this PR - it works as expected. Apologies @weizhouapache. I'm re-opening this PR.

@Pearl1594 Pearl1594 reopened this Jan 27, 2026
Copy link
Contributor

@Pearl1594 Pearl1594 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tested - lgtm

@Pearl1594
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@Pearl1594 a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

Redundant VPC - cloud-init can no longer retrieve passwords from VPC router password server

5 participants