User Details
- User Since
- Apr 3 2017, 6:23 PM (452 w, 2 d)
- Availability
- Available
- IRC Nick
- xionox
- LDAP User
- Ayounsi
- MediaWiki User
- AYounsi (WMF) [ Global Accounts ]
Yesterday
Thanks ! Those are still alerting in eqiad :
I was looking into that for the LLDP issue, here are some Redfish path that could be useful in that context :
Tue, Dec 2
The loopbacks are also in Puppet : https://github.com/search?q=repo%3Awikimedia%2Foperations-puppet%20198.35.26.193&type=code
Mon, Dec 1
I guess we're good here.
I think this is all done, we have cookbooks in place.
I have no idea why restbase would be excluded; Cassandra should be fine, and this is the only Cassandra cluster like this. That just leaves the restbase software itself (yes, the service still runs on the cluster nodes 😢). Can we canary one or two nodes before letting it rip on the full cluster?
Thu, Nov 27
Sure, thanks ! I also discovered it just now by reviewing the sub-tasks of T253173: Some clusters do not have DNS for IPv6 addresses (TRACKING TASK).
Only Data-Persistence services are left in the IPv6-less world (see subtasks).
@jcrespo do you know if that bug got fixed and if we could have that daemon listen on IPv6 now ?
All solved.
As far as I know this is all solved now. At least according to the Netbox "network" report.
We're good for thanos !
Paging alerting added. I won't disable the LibreNMS one for now, but only in the future to make sure the new one works fine.
Fri, Nov 21
The current cable is already to lsw1-a2 (https://netbox.wikimedia.org/dcim/cables/7147/) so probably a3 is the next one.
ganeti is fully dual stack, nice !!
New updated list, we're at 642 hosts if the MySQL query from the initial task is still the proper way.
Up from 90 in 2020, probably because we've been ramping up our 10G hosts.
Thanks, then as it's only a few, option 1 seems best to me. Much less complex to setup and maintain.
Setup the new FQDN, ask people to migrate, give them X months, check activity on the old one, send reminder email, move to the CDN.
management routers are physically single homed in the old design (eqsin, codfw, eqiad), probably because it was best to not over engineer it, and mgmt network is not critical to the infra.
Thu, Nov 20
Thanks, yeah that must be the reason :
>>> spicerack.redfish('sretest1005').hw_model
9
>>> spicerack.redfish('sretest2004').hw_model
9
>>> spicerack.redfish('sretest2004').generation
16
>>> spicerack.redfish('sretest1005').generation
14It seems almost certain this is some bug in their HTTP client, presumably they do the HEAD request to set up the file read at the application layer, and when this fails to happen the client does not properly initialise itself to read the downloaded file.
Yep, fully agree with that !
@Papaul, could you have a look at the BIOS of sretest1005 ?
Not sure if it has been discussed but what do you think of using Calico's VXLAN or IP-IP overlay ?
For what I understand it seems like the perfect solution for this usecase, and is quite similar to what we would have wanted of CloudVPS if we were to redo it from scratch.
One liners:
>>> spicerack.redfish('sretest2004').scp_dump().components['NIC.Integrated.1-1-1'].get('Broadcom_LLDPNearestBridge') 'Disabled' >>> spicerack.redfish('cirrussearch2115').scp_dump().components['NIC.Integrated.1-1-1'].get('Broadcom_LLDPNearestBridge') 'Enabled'
Wed, Nov 19
Thanks, looks like I missed it in my first look but it seems doable through Redfish on Dell :
>>> dump.set('NIC.Integrated.1-2-1', 'Broadcom_LLDPNearestBridge', 'Disabled')
Updated value for attribute NIC.Integrated.1-2-1 -> Broadcom_LLDPNearestBridge: Enabled => DisabledBroadcom_LLDPNearestNonTPMRBridge exists too
Haven't dug yet, but maybe an option is to install Broadcom's niccli tool : https://docs.broadcom.com/docs/Linux_Niccli-233.0.198.0
Looks like it was a false hope, I looked at cirrussearch2115 which is showing the same behavior:
lsw1-d3-codfw> show lldp neighbors | match xe-0/0/43 xe-0/0/43 - 04:32:01:db:9c:10 NIC 1/10Gb SFP+ DA Broadcom Adv. Dual 10Gb Ethernet fw_version:AFW_229.2.52.0 xe-0/0/43 - d0:c1:b5:00:3c:ae eno12399np0 cirrussearch2115.codfw.wmnet
I might have found something in Redfish for Dell:
r = spicerack.redfish('sretest2004') dump = r.scp_dump() dump.config['SystemConfiguration']['Components'][6]['Attributes'][689] {'Name': 'NIC.1#TopologyLldp', 'Value': 'Disabled', 'Set On Import': 'True', 'Comment': 'Read and Write'}
As the IPs are already available, we should change the cr3/cr4/mr1 loopbacks ahead of time, in a different maintenance window, so it's one less thing to worry about during the maintenance.
Same for the cr3-cr4 link and maybe the cr3-mr1, cr4-mr1 but those will move to the switches shortly after.
Lots great thanks !
Thanks for the great writeup. We should unfortunately look at upgrading Netbox first.
TBD if we need to spend time on a workaround.
Let's open a different task for magru. drmrs is more urgent as they're end of support (and older). magru is to be done when we have time (lower priority).
Tue, Nov 18
How did these hosts get pushed into production if the PXE is set incorrectly? How could they have been installed at all?
It's possible that their config got changed after the most recent re-image, for example by moving them to their 10G NIC after the initial re-image on the 1G NIC without updating PXE.
It also mean that they will fail their next re-image.
Looks like it's slowly getting better as well.
@ssingh started working on this with https://gerrit.wikimedia.org/r/1206424 in T410047: No free IPs on public1-ulsfo vlan (Nov 2025) boldly assigning the task to him :)
Mon, Nov 17
You can use 198.35.26.5/28. It's marked as reserved for infra, but we don't need it (and we will even less need it after the network upgrade).
My bad ! I turned them off after adding the transit/peering saturation alerts. Forgetting transport and core links.... I'll take care of them.
I personally prefer to use the first (ok second) address in each v6 subnet as the gateway, i.e. 2a02:ec80:400:1::1/64
Sounds good to me.
Oops, I'm still catching up. Sounds great to minimize user impact. do we know how many systems pull from our rsync ? Maybe it's not worth the hassle of the tcp-proxy if the number is low enough.
Could option 3 be something like what's currently being done for Gerrit ? https://phabricator.wikimedia.org/T365259
@Papaul is that something you could look into ? Is there is a way to disable the NIC's LLDP through the BIOS menu ?
Maybe some solution from the last comment on that thread https://www.dell.com/community/en/conversations/rack-servers/how-to-disable-lldp-on-broadcom-57414-nic/647f8904f4ccf8a8de88349b?commentId=647f9badf4ccf8a8defa5418
Once the router change is done, therefore, we need to somehow adjust the netmask on all the existing hosts on the vlan. Probably the simplest way to do this is for us to go through them one-by-one, change the netmask in /etc/network/interfaces, and reboot the host.
Then update the host IP in Netbox or better, run the Netbox puppetdb import script for each hosts for a proper sync up.
Sep 12 2025
Feel free to re-open if you disagree, but looks like we might not need to get to that heavy port (and tooling) change. Load has been acceptable for a while.
We still have logs full of:
SSHD_LOGIN_FAILED: Login failed for user 'root' from host 'xxxx' but it doesn't seem to be an actual issue (or too heavy flood).
Sep 11 2025
Above patch worked successfully: https://netbox.wikimedia.org/extras/changelog/?request_id=7940ab40-742b-47fb-98c6-fba8e4e2989b
Those alerts got moved to AM for the core routers and switches. They are not alerting for management routers anymore.
Sep 10 2025
We need to allow port number 48 on the Nokias, but not port number 0 as they start from 1
We already (and lazily) do : min_value=0, max_value=48 which was to accommodate both SONiC and Junos. We could add additional validators per platforms though.
https://github.com/wikimedia/operations-software-netbox-extras/blob/master/customscripts/provision_server.py#L136
There is no particular rush, let's say before the end of 2025 ?
Sep 9 2025
The anchor doesn't contain any sensitive data, so yep it can be unplugged and recycled anytime.
All the tooling, metrics and examples are there for the service owners to setup their alerting like traffic did for DNS.
Closing that parent task to focus on the remaining sub task.
All good!
The physical anchor has been replaced by a VM, moving that task to DCops to recycle the failed hardware : https://netbox.wikimedia.org/dcim/devices/1287/
Closing that never-ending tracking task to focus on more specific sub-tasks now that all the ground work is done.
@Vgutierrez @ssingh could that be a good opportunity to see how drmrs handles the loss of a switch/rack ?
Well, we managed to get Rancid to work with Nokia so that's not really needed.
cloudsw2-d5-eqiad is now gone.
@Papaul would you be ok to take care of that ?
Evaluation is done and @jhathaway has rolled out UUID + MAC fallback DHCP (with the --no82 cookbook parameter). Next step will be to make it the default.