I have no idea why restbase would be excluded; Cassandra should be fine, and this is the only Cassandra cluster like this. That just leaves the restbase software itself (yes, the service still runs on the cluster nodes 😢). Can we canary one or two nodes before letting it rip on the full cluster?

Mon, Dec 1, 6:55 AM · Data-Persistence, IPv6

Thu, Nov 27

ayounsi created T411195: codfw: cleanup Interface enabled but not connected alert.

Thu, Nov 27, 2:42 PM · SRE, ops-codfw, DC-Ops

ayounsi updated the task description for T411194: eqiad: cleanup Interface enabled but not connected alert.

Thu, Nov 27, 2:41 PM · SRE, DC-Ops, ops-eqiad

ayounsi updated the task description for T411194: eqiad: cleanup Interface enabled but not connected alert.

Thu, Nov 27, 2:40 PM · SRE, DC-Ops, ops-eqiad

ayounsi created T411194: eqiad: cleanup Interface enabled but not connected alert.

Thu, Nov 27, 2:40 PM · SRE, DC-Ops, ops-eqiad

ayounsi added a comment to T253986: update bacula-sd config so that it listens on IPv6.

Sure, thanks ! I also discovered it just now by reviewing the sub-tasks of T253173: Some clusters do not have DNS for IPv6 addresses (TRACKING TASK).

Thu, Nov 27, 12:38 PM · Data-Persistence-Backup, SRE, IPv6

ayounsi closed T253173: Some clusters do not have DNS for IPv6 addresses (TRACKING TASK) as Resolved.

Only Data-Persistence services are left in the IPv6-less world (see subtasks).

Thu, Nov 27, 12:31 PM · Infrastructure-Foundations, IPv6, User-jbond

ayounsi added a comment to T253986: update bacula-sd config so that it listens on IPv6.

@jcrespo do you know if that bug got fixed and if we could have that daemon listen on IPv6 now ?

Thu, Nov 27, 12:28 PM · Data-Persistence-Backup, SRE, IPv6

ayounsi closed T271139: Some WMCS clusters apparently do not support IPv6, a subtask of T253173: Some clusters do not have DNS for IPv6 addresses (TRACKING TASK), as Resolved.

Thu, Nov 27, 11:43 AM · Infrastructure-Foundations, IPv6, User-jbond

ayounsi closed T271139: Some WMCS clusters apparently do not support IPv6 as Resolved.

All solved.

Thu, Nov 27, 11:43 AM · cloud-services-team, Infrastructure-Foundations, IPv6, User-crusnov, SRE-tools

ayounsi closed T312556: Some Core Platform clusters have inconsistent AAAA DNS records for the primary IPv6 of the hosts, a subtask of T253173: Some clusters do not have DNS for IPv6 addresses (TRACKING TASK), as Resolved.

Thu, Nov 27, 11:42 AM · Infrastructure-Foundations, IPv6, User-jbond

ayounsi closed T312556: Some Core Platform clusters have inconsistent AAAA DNS records for the primary IPv6 of the hosts as Resolved.

As far as I know this is all solved now. At least according to the Netbox "network" report.

Thu, Nov 27, 11:42 AM · Data-Engineering, Platform Team Workboards (Platform Engineering Reliability), Dumps-Generation, Platform Engineering, IPv6

ayounsi updated subscribers of T271140: Some Data Persistence clusters apparently do not support IPv6.

We're good for thanos !

Thu, Nov 27, 11:41 AM · Data-Persistence, IPv6

ayounsi closed T409330: Transport link saturation not alerting as Resolved.

Paging alerting added. I won't disable the LibreNMS one for now, but only in the future to make sure the new one works fine.

Thu, Nov 27, 10:54 AM · Infrastructure-Foundations, SRE, Traffic, netops

ayounsi closed T410073: Netbox Cable report - incorrectly parsing Nokia power supplies as Resolved.

All good ! https://netbox.wikimedia.org/extras/scripts/results/274160/

Thu, Nov 27, 8:19 AM · SRE, Infrastructure-Foundations, DC-Ops, netops, ops-eqiad

Fri, Nov 21

ayounsi added a comment to T410717: mr1-codfw: add second uplink to lsw1-a2-codfw.

The current cable is already to lsw1-a2 (https://netbox.wikimedia.org/dcim/cables/7147/) so probably a3 is the next one.

Fri, Nov 21, 2:20 PM · DC-Ops, ops-codfw, netops, Infrastructure-Foundations, SRE

ayounsi added a comment to T253173: Some clusters do not have DNS for IPv6 addresses (TRACKING TASK).

ganeti is fully dual stack, nice !!

Fri, Nov 21, 10:25 AM · Infrastructure-Foundations, IPv6, User-jbond

ayounsi added a comment to T250367: Servers exposing incorrect LLDP info.

New updated list, we're at 642 hosts if the MySQL query from the initial task is still the proper way.
Up from 90 in 2020, probably because we've been ramping up our 10G hosts.

Fri, Nov 21, 9:38 AM · Patch-For-Review, Infrastructure-Foundations, netops, SRE

ayounsi added a comment to T306550: Move dumps.wikimedia.org HTTP service behind CDN edge.

Thanks, then as it's only a few, option 1 seems best to me. Much less complex to setup and maintain.
Setup the new FQDN, ask people to migrate, give them X months, check activity on the old one, send reminder email, move to the CDN.

Fri, Nov 21, 9:07 AM · cloud-services-team, Data-Services

ayounsi added a comment to T407488: mr1-codfw is single-homed to lsw1-a2-codfw.

management routers are physically single homed in the old design (eqsin, codfw, eqiad), probably because it was best to not over engineer it, and mgmt network is not critical to the infra.

Fri, Nov 21, 9:02 AM · netops, Infrastructure-Foundations, SRE

Thu, Nov 20

ayounsi added a comment to T250367: Servers exposing incorrect LLDP info.

Thanks, yeah that must be the reason :

>>> spicerack.redfish('sretest1005').hw_model
9
>>> spicerack.redfish('sretest2004').hw_model
9
>>> spicerack.redfish('sretest2004').generation
16
>>> spicerack.redfish('sretest1005').generation
14

Thu, Nov 20, 4:58 PM · Patch-For-Review, Infrastructure-Foundations, netops, SRE

ayounsi added a comment to T383173: Supermicro: UEFI HTTP boot request hangs on cold boot.

It seems almost certain this is some bug in their HTTP client, presumably they do the HEAD request to set up the file read at the application layer, and when this fails to happen the client does not properly initialise itself to read the downloaded file.

Yep, fully agree with that !

Thu, Nov 20, 2:11 PM · Infrastructure-Foundations

ayounsi added a comment to T250367: Servers exposing incorrect LLDP info.

@Papaul, could you have a look at the BIOS of sretest1005 ?

Thu, Nov 20, 1:47 PM · Patch-For-Review, Infrastructure-Foundations, netops, SRE

ayounsi added a comment to T407140: Plan networking for Toolforge-on-Metal experiment.

Not sure if it has been discussed but what do you think of using Calico's VXLAN or IP-IP overlay ?
For what I understand it seems like the perfect solution for this usecase, and is quite similar to what we would have wanted of CloudVPS if we were to redo it from scratch.

Thu, Nov 20, 1:18 PM · Infrastructure-Foundations, netops, Toolforge, tools-infrastructure-team

ayounsi added a comment to T250367: Servers exposing incorrect LLDP info.

One liners:

>>> spicerack.redfish('sretest2004').scp_dump().components['NIC.Integrated.1-1-1'].get('Broadcom_LLDPNearestBridge')
'Disabled'
>>> spicerack.redfish('cirrussearch2115').scp_dump().components['NIC.Integrated.1-1-1'].get('Broadcom_LLDPNearestBridge')
'Enabled'

Thu, Nov 20, 9:58 AM · Patch-For-Review, Infrastructure-Foundations, netops, SRE

Wed, Nov 19

ayounsi added a comment to T250367: Servers exposing incorrect LLDP info.

Thanks, looks like I missed it in my first look but it seems doable through Redfish on Dell :

>>> dump.set('NIC.Integrated.1-2-1', 'Broadcom_LLDPNearestBridge', 'Disabled')
Updated value for attribute NIC.Integrated.1-2-1 -> Broadcom_LLDPNearestBridge: Enabled => Disabled

Broadcom_LLDPNearestNonTPMRBridge exists too

Wed, Nov 19, 5:59 PM · Patch-For-Review, Infrastructure-Foundations, netops, SRE

ayounsi added a comment to T250367: Servers exposing incorrect LLDP info.

Haven't dug yet, but maybe an option is to install Broadcom's niccli tool : https://docs.broadcom.com/docs/Linux_Niccli-233.0.198.0

Wed, Nov 19, 11:07 AM · Patch-For-Review, Infrastructure-Foundations, netops, SRE

ayounsi added a comment to T250367: Servers exposing incorrect LLDP info.

Looks like it was a false hope, I looked at cirrussearch2115 which is showing the same behavior:

lsw1-d3-codfw> show lldp neighbors | match xe-0/0/43      
xe-0/0/43          -                   04:32:01:db:9c:10   NIC 1/10Gb SFP+ DA Broadcom Adv. Dual 10Gb Ethernet fw_version:AFW_229.2.52.0
xe-0/0/43          -                   d0:c1:b5:00:3c:ae   eno12399np0        cirrussearch2115.codfw.wmnet

Wed, Nov 19, 10:48 AM · Patch-For-Review, Infrastructure-Foundations, netops, SRE

ayounsi added a comment to T250367: Servers exposing incorrect LLDP info.

I might have found something in Redfish for Dell:

r = spicerack.redfish('sretest2004')
dump = r.scp_dump()
dump.config['SystemConfiguration']['Components'][6]['Attributes'][689]
{'Name': 'NIC.1#TopologyLldp', 'Value': 'Disabled', 'Set On Import': 'True', 'Comment': 'Read and Write'}

Wed, Nov 19, 10:23 AM · Patch-For-Review, Infrastructure-Foundations, netops, SRE

ayounsi added a comment to T408892: ULSFO: New switch configuration.

As the IPs are already available, we should change the cr3/cr4/mr1 loopbacks ahead of time, in a different maintenance window, so it's one less thing to worry about during the maintenance.
Same for the cr3-cr4 link and maybe the cr3-mr1, cr4-mr1 but those will move to the switches shortly after.

Wed, Nov 19, 9:36 AM · SRE, Infrastructure-Foundations, DC-Ops, netops, ops-ulsfo

ayounsi added a comment to T408511: ULSFO:Switch refresh diagram.

Lots great thanks !

Wed, Nov 19, 8:41 AM · SRE, Infrastructure-Foundations, DC-Ops, netops, ops-ulsfo

ayounsi added a comment to T410455: lsw1-d6-eqiad outage Nov 18 2025.

Thanks for the great writeup. We should unfortunately look at upgrading Netbox first.
TBD if we need to spend time on a workaround.

Wed, Nov 19, 8:29 AM · netops, Infrastructure-Foundations, SRE

ayounsi added a subtask for T410455: lsw1-d6-eqiad outage Nov 18 2025: T371889: Upgrade Netbox to 4.3.x.

Wed, Nov 19, 8:28 AM · netops, Infrastructure-Foundations, SRE

ayounsi added a parent task for T371889: Upgrade Netbox to 4.3.x: T410455: lsw1-d6-eqiad outage Nov 18 2025.

Wed, Nov 19, 8:28 AM · netbox, Infrastructure-Foundations

ayounsi added a comment to T390813: Upgrade End Of Support Junos.

Let's open a different task for magru. drmrs is more urgent as they're end of support (and older). magru is to be done when we have time (lower priority).

Wed, Nov 19, 6:52 AM · Traffic, netops, Infrastructure-Foundations

Tue, Nov 18

ayounsi added a comment to T396717: Fix PXE miss-configurations.

How did these hosts get pushed into production if the PXE is set incorrectly? How could they have been installed at all?

It's possible that their config got changed after the most recent re-image, for example by moving them to their 10G NIC after the initial re-image on the 1G NIC without updating PXE.
It also mean that they will fail their next re-image.

Tue, Nov 18, 2:48 PM · SRE, ops-eqiad, ops-codfw, DC-Ops

ayounsi added a comment to T409924: High pod latency affecting several dse-k8s-worker nodes in eqiad C/D rows.

Looks like it's slowly getting better as well.

Tue, Nov 18, 10:27 AM · Essential-Work, Infrastructure-Foundations, Data-Platform-SRE (2025.11.07 - 2025.11.28)

ayounsi added a comment to T410047: No free IPs on public1-ulsfo vlan (Nov 2025).

Tue, Nov 18, 8:53 AM · Patch-For-Review, Traffic, netops, Infrastructure-Foundations, SRE

ayounsi assigned T367732: POPs LVS : remove public vlan trunking to ssingh.

@ssingh started working on this with https://gerrit.wikimedia.org/r/1206424 in T410047: No free IPs on public1-ulsfo vlan (Nov 2025) boldly assigning the task to him :)

Tue, Nov 18, 8:50 AM · netops, Infrastructure-Foundations, Traffic

Mon, Nov 17

ayounsi added a comment to T410047: No free IPs on public1-ulsfo vlan (Nov 2025).

You can use 198.35.26.5/28. It's marked as reserved for infra, but we don't need it (and we will even less need it after the network upgrade).

Mon, Nov 17, 2:49 PM · Patch-For-Review, Traffic, netops, Infrastructure-Foundations, SRE

ayounsi claimed T409330: Transport link saturation not alerting.

My bad ! I turned them off after adding the transit/peering saturation alerts. Forgetting transport and core links.... I'll take care of them.

Mon, Nov 17, 1:01 PM · Infrastructure-Foundations, SRE, Traffic, netops

ayounsi added a comment to T408892: ULSFO: New switch configuration.

I personally prefer to use the first (ok second) address in each v6 subnet as the gateway, i.e. 2a02:ec80:400:1::1/64

Sounds good to me.

Mon, Nov 17, 12:54 PM · SRE, Infrastructure-Foundations, DC-Ops, netops, ops-ulsfo

ayounsi added a comment to T306550: Move dumps.wikimedia.org HTTP service behind CDN edge.

Oops, I'm still catching up. Sounds great to minimize user impact. do we know how many systems pull from our rsync ? Maybe it's not worth the hassle of the tcp-proxy if the number is low enough.

Mon, Nov 17, 12:50 PM · cloud-services-team, Data-Services

ayounsi added a comment to T306550: Move dumps.wikimedia.org HTTP service behind CDN edge.

Could option 3 be something like what's currently being done for Gerrit ? https://phabricator.wikimedia.org/T365259

Mon, Nov 17, 12:35 PM · cloud-services-team, Data-Services

ayounsi assigned T250367: Servers exposing incorrect LLDP info to Papaul.

@Papaul is that something you could look into ? Is there is a way to disable the NIC's LLDP through the BIOS menu ?
Maybe some solution from the last comment on that thread https://www.dell.com/community/en/conversations/rack-servers/how-to-disable-lldp-on-broadcom-57414-nic/647f8904f4ccf8a8de88349b?commentId=647f9badf4ccf8a8defa5418

Mon, Nov 17, 12:01 PM · Patch-For-Review, Infrastructure-Foundations, netops, SRE

ayounsi added a comment to T410047: No free IPs on public1-ulsfo vlan (Nov 2025).

Once the router change is done, therefore, we need to somehow adjust the netmask on all the existing hosts on the vlan. Probably the simplest way to do this is for us to go through them one-by-one, change the netmask in /etc/network/interfaces, and reboot the host.

Then update the host IP in Netbox or better, run the Netbox puppetdb import script for each hosts for a proper sync up.

Mon, Nov 17, 11:53 AM · Patch-For-Review, Traffic, netops, Infrastructure-Foundations, SRE

Sep 12 2025

ayounsi closed T277438: Move management routers ssh port as Declined.

Feel free to re-open if you disagree, but looks like we might not need to get to that heavy port (and tooling) change. Load has been acceptable for a while.
We still have logs full of:
SSHD_LOGIN_FAILED: Login failed for user 'root' from host 'xxxx' but it doesn't seem to be an actual issue (or too heavy flood).

Sep 12 2025, 12:29 PM · Infrastructure-Foundations, SRE, netops

Sep 11 2025

ayounsi added a comment to T404146: Netbox: General updates for Nokia switch support.

Above patch worked successfully: https://netbox.wikimedia.org/extras/changelog/?request_id=7940ab40-742b-47fb-98c6-fba8e4e2989b

Sep 11 2025, 2:14 PM · netops, Infrastructure-Foundations, SRE

ayounsi created T404355: secure-cookbook doesn't allow for --dry-run.

Sep 11 2025, 2:08 PM · Infrastructure-Foundations, SRE-tools

ayounsi moved T367732: POPs LVS : remove public vlan trunking from This quarter to Next quarter on the netops board.

Sep 11 2025, 9:43 AM · netops, Infrastructure-Foundations, Traffic

ayounsi moved T367732: POPs LVS : remove public vlan trunking from Backlog to This quarter on the netops board.

Sep 11 2025, 9:42 AM · netops, Infrastructure-Foundations, Traffic

ayounsi closed T281055: mr1 port utilization alerts shouldn't mention hash page in their IRC logs as Resolved.

Those alerts got moved to AM for the core routers and switches. They are not alerting for management routers anymore.

Sep 11 2025, 9:34 AM · Infrastructure-Foundations, SRE, netops

Sep 10 2025

ayounsi updated the task description for T404146: Netbox: General updates for Nokia switch support.

Sep 10 2025, 12:59 PM · netops, Infrastructure-Foundations, SRE

ayounsi added a comment to T404146: Netbox: General updates for Nokia switch support.

We need to allow port number 48 on the Nokias, but not port number 0 as they start from 1

We already (and lazily) do : min_value=0, max_value=48 which was to accommodate both SONiC and Junos. We could add additional validators per platforms though.
https://github.com/wikimedia/operations-software-netbox-extras/blob/master/customscripts/provision_server.py#L136

Sep 10 2025, 9:25 AM · netops, Infrastructure-Foundations, SRE

ayounsi moved T390813: Upgrade End Of Support Junos from Next quarter to This year on the netops board.

Sep 10 2025, 8:25 AM · Traffic, netops, Infrastructure-Foundations

ayounsi added a comment to T390813: Upgrade End Of Support Junos.

There is no particular rush, let's say before the end of 2025 ?

Sep 10 2025, 8:25 AM · Traffic, netops, Infrastructure-Foundations

Sep 9 2025

ayounsi added a comment to T382519: WMF RIPE Atlas probe in Eqsin offline.

The anchor doesn't contain any sensitive data, so yep it can be unplugged and recycled anytime.

Sep 9 2025, 5:26 PM · SRE, ops-eqsin

ayounsi moved T300877: Remove static routes for LVS VIPs from core routers from Backlog to Watching on the netops board.

Sep 9 2025, 10:56 AM · Traffic, SRE, netops, Infrastructure-Foundations

ayounsi closed T374619: Alert when anycast-healthchecker withdraws BGP route as Resolved.

All the tooling, metrics and examples are there for the service owners to setup their alerting like traffic did for DNS.

Sep 9 2025, 10:47 AM · Traffic, Infrastructure-Foundations, netops, SRE

ayounsi removed projects from T377534: Prepare/deploy new IPs for codfw cp nodes: netops, Infrastructure-Foundations.

Sep 9 2025, 10:43 AM · Traffic

ayounsi moved T390044: Classify ceph traffic flows for network prioritization from Backlog to Watching on the netops board.

Sep 9 2025, 10:42 AM · Data-Platform-SRE, Infrastructure-Foundations, netops, SRE

ayounsi closed T384731: Prevent BGP alerts triggering when K8s host maintenance is being done as Resolved.

Closing that parent task to focus on the remaining sub task.

Sep 9 2025, 10:38 AM · SRE Observability, Prod-Kubernetes, serviceops-radar, observability, netops, Infrastructure-Foundations, SRE

ayounsi moved T357543: clouddb: evaluate moving them into cloud-private from Backlog to Watching on the netops board.

Sep 9 2025, 10:17 AM · Data-Services, cloud-services-team, netops, User-aborrero, Infrastructure-Foundations

ayounsi closed T362522: mr1-eqsin performance issue as Resolved.

All good!

Sep 9 2025, 10:17 AM · Infrastructure-Foundations, netops

ayounsi moved T390813: Upgrade End Of Support Junos from Backlog to Next quarter on the netops board.

Sep 9 2025, 10:15 AM · Traffic, netops, Infrastructure-Foundations

ayounsi assigned T382519: WMF RIPE Atlas probe in Eqsin offline to RobH.

The physical anchor has been replaced by a VM, moving that task to DCops to recycle the failed hardware : https://netbox.wikimedia.org/dcim/devices/1287/

Sep 9 2025, 10:15 AM · SRE, ops-eqsin

ayounsi closed T369384: Productionize gnmic network telemetry pipeline as Resolved.

Closing that never-ending tracking task to focus on more specific sub-tasks now that all the ground work is done.

Sep 9 2025, 10:12 AM · netops, Infrastructure-Foundations, SRE

ayounsi updated the task description for T402259: Migrating esams to routed Ganeti.

Sep 9 2025, 10:09 AM · Patch-For-Review, Ganeti, Infrastructure-Foundations, SRE

ayounsi updated subscribers of T390813: Upgrade End Of Support Junos.

@Vgutierrez @ssingh could that be a good opportunity to see how drmrs handles the loss of a switch/rack ?

Sep 9 2025, 10:09 AM · Traffic, netops, Infrastructure-Foundations

ayounsi closed T361252: Replace Rancid with Oxidized as Declined.

Well, we managed to get Rancid to work with Nokia so that's not really needed.

Sep 9 2025, 10:04 AM · Infrastructure-Foundations, netops

ayounsi updated the task description for T390813: Upgrade End Of Support Junos.

Sep 9 2025, 10:01 AM · Traffic, netops, Infrastructure-Foundations

ayounsi closed T387018: gNMIc connection not working for cloudsw2-d5-eqiad as Resolved.

cloudsw2-d5-eqiad is now gone.

Sep 9 2025, 10:00 AM · Infrastructure-Foundations, netops, SRE

ayounsi assigned T390813: Upgrade End Of Support Junos to Papaul.

@Papaul would you be ok to take care of that ?

Sep 9 2025, 9:56 AM · Traffic, netops, Infrastructure-Foundations

ayounsi moved T294845: Management routers: use BGP instead of OSPF from Backlog to This quarter on the netops board.

Sep 9 2025, 9:52 AM · SRE, Infrastructure-Foundations, netops

ayounsi moved T361549: Automatically run Capirca Netbox script regularly from This quarter to Next quarter on the netops board.

Sep 9 2025, 9:51 AM · netbox, Infrastructure-Foundations, netops

ayounsi removed projects from T388039: HIDDENPARMA feature: superset link → requestctl rule: Infrastructure-Foundations, netops.

Sep 9 2025, 8:57 AM · Hiddenparma

ayounsi updated the task description for T390813: Upgrade End Of Support Junos.

Sep 9 2025, 8:56 AM · Traffic, netops, Infrastructure-Foundations

ayounsi closed T396712: Evaluate automatic MAC-based DHCP for production servers as Resolved.

Evaluation is done and @jhathaway has rolled out UUID + MAC fallback DHCP (with the --no82 cookbook parameter). Next step will be to make it the default.

Sep 9 2025, 8:48 AM · Infrastructure-Foundations, netops, SRE-tools

ayounsi moved T401297: Move pfw1b-codfw to rack F5 from Backlog to Next quarter on the netops board.

Sep 9 2025, 8:44 AM · Infrastructure-Foundations, fundraising-tech-ops, netops

ayounsi moved T402577: Homer: Add Python modules to configure Nokia SR Linux switches from Backlog to This quarter on the netops board.

Sep 9 2025, 8:43 AM · netops, Infrastructure-Foundations, SRE

ayounsi moved T402588: Eqiad: row C/D switch refresh configuration task from Backlog to This quarter on the netops board.

Sep 9 2025, 8:43 AM · netops, Infrastructure-Foundations, SRE

ayounsi moved T402590: codfw expansion: configure new Nokia switches in rows E/F from Backlog to This quarter on the netops board.

Sep 9 2025, 8:43 AM · netops, Infrastructure-Foundations, SRE

ayounsi closed Restricted Task, a subtask of T403965: decommission frdata2001.frack.codfw.wmnet, as Resolved.

Sep 9 2025, 8:02 AM · SRE, DC-Ops, ops-codfw, decommission-hardware

ayounsi closed Restricted Task, a subtask of T403970: decommission frmx2001.frack.codfw.wmnet, as Resolved.