Timeline for Ops in event-driven paradigm

Current License: CC BY-SA 4.0

9 events

when toggle format	what		by	license	comment
Jun 23, 2020 at 5:25	comment	added	Yuri		@9000 Yes, I mentioned a correlation ID such as `X-request-id`. Although, it is a very basic utility. It doesn't seem to really help much when an event travels though multiple stages of 1-to-many queues/topics/?, right?
Jun 23, 2020 at 5:19	vote	accept	Yuri
Jun 23, 2020 at 1:24	comment	added	9000		A very simple thing we use is passing an `x-request-id` header across all requests, and making it available inside every logger in each request handler. The header is the industry standard, Envoy proxy sets it unless the caller already did. Traceability skyrocketed: just filter by the value of this header. For best results, pass a sequence ID, because clocks on several nodes of a distributed system can diverge by tens or hundreds of milliseconds, so timestamps don't give you the correct sequence. Good logging is helpful, too.
Jun 22, 2020 at 21:54	answer	added	Kain0_0		timeline score: 3
Jun 19, 2020 at 18:51	comment	added	Yuri		@Kain0_0 Interesting ideas! Although I'm still not sure about quantifying the tooling-related costs/savings. Can you post an answer so I can accept it?
Jun 18, 2020 at 7:05	comment	added	Kain0_0		I would argue that the software hasn't been properly developed if it cannot be operated in production efficiently. Contrast `X developers * Y hours` vs. `A incidents * B operators * C hours + Overtime + Downtime`. The business is not paying to build software, but to operate it. The upfront cost has expense, but running costs will almost always swamp those. Also that same tooling if done well can also be used to test integration systems which makes testing (an expensive component of development) to be done with greater certainty and lower cost.
Jun 18, 2020 at 6:48	comment	added	Yuri		Private facing tooling which is deveveloped along with the application seems to fix the ops issues. However, the additional cost of such solution must surely be rather huge.
Jun 18, 2020 at 6:41	comment	added	Kain0_0		Tooling. You are going to need to get the developers on board with providing you with tools to help aggregate and make a picture out of the data, along with the ability to replay events (or generate new messages to get the system repaired). You are also going to need to push back on the business to get the funds/bandwidth to get these tools and keep them current with new work.
Jun 17, 2020 at 22:19	history	asked	Yuri	CC BY-SA 4.0