Timeline for Ops in event-driven paradigm
Current License: CC BY-SA 4.0
9 events
| when toggle format | what | by | license | comment | |
|---|---|---|---|---|---|
| Jun 23, 2020 at 5:25 | comment | added | Yuri |
@9000 Yes, I mentioned a correlation ID such as X-request-id. Although, it is a very basic utility. It doesn't seem to really help much when an event travels though multiple stages of 1-to-many queues/topics/?, right?
|
|
| Jun 23, 2020 at 5:19 | vote | accept | Yuri | ||
| Jun 23, 2020 at 1:24 | comment | added | 9000 |
A very simple thing we use is passing an x-request-id header across all requests, and making it available inside every logger in each request handler. The header is the industry standard, Envoy proxy sets it unless the caller already did. Traceability skyrocketed: just filter by the value of this header. For best results, pass a sequence ID, because clocks on several nodes of a distributed system can diverge by tens or hundreds of milliseconds, so timestamps don't give you the correct sequence. Good logging is helpful, too.
|
|
| Jun 22, 2020 at 21:54 | answer | added | Kain0_0 | timeline score: 3 | |
| Jun 19, 2020 at 18:51 | comment | added | Yuri | @Kain0_0 Interesting ideas! Although I'm still not sure about quantifying the tooling-related costs/savings. Can you post an answer so I can accept it? | |
| Jun 18, 2020 at 7:05 | comment | added | Kain0_0 |
I would argue that the software hasn't been properly developed if it cannot be operated in production efficiently. Contrast X developers * Y hours vs. A incidents * B operators * C hours + Overtime + Downtime. The business is not paying to build software, but to operate it. The upfront cost has expense, but running costs will almost always swamp those. Also that same tooling if done well can also be used to test integration systems which makes testing (an expensive component of development) to be done with greater certainty and lower cost.
|
|
| Jun 18, 2020 at 6:48 | comment | added | Yuri | Private facing tooling which is deveveloped along with the application seems to fix the ops issues. However, the additional cost of such solution must surely be rather huge. | |
| Jun 18, 2020 at 6:41 | comment | added | Kain0_0 | Tooling. You are going to need to get the developers on board with providing you with tools to help aggregate and make a picture out of the data, along with the ability to replay events (or generate new messages to get the system repaired). You are also going to need to push back on the business to get the funds/bandwidth to get these tools and keep them current with new work. | |
| Jun 17, 2020 at 22:19 | history | asked | Yuri | CC BY-SA 4.0 |