Skip to main content
some formatting to better see the points
Source Link
Hans-Martin Mosner
  • 18.6k
  • 1
  • 37
  • 48

Based on my experience, the answer is the usual, it depends. Even given the specific scenario you described, it depends on the exact scope and what you want to monitor and troubleshoot. Starting to monitor "everything" is quite demanding. Do you want to know how frequently such events occur? Let's go with metrics Do you want to know the exact reason such events couldn't be processed? Traces enriched with custom tags Do you want to add domain information? Structured logs. Information related to what is going on with the transport of your data is available out of the box with traces, adding logs for this kind of info is expensive for costs, resources consumed, and maintenance. Sampling

  • Do you want to know how frequently such events occur? Let's go with metrics.
  • Do you want to know the exact reason such events couldn't be processed? Traces enriched with custom tags.
  • Do you want to add domain information? Structured logs. Information related to what is going on with the transport of your data is available out of the box with traces, adding logs for this kind of info is expensive for costs, resources consumed, and maintenance.

Sampling is indeed an issue, not of the tool, but of how much you invest in it. With AppInsight and Datadog you can just pay more and you will not encounter this issue. But most of the time, it is just better to reduce the amount of data stored, just save the telemetry that you actually need. Still, selecting data to save could be hard depending on the system you are working on, an alternative could be not relying on external products but having your own monitoring platform. Prometheus, Grafana, Tempo, Loki, Elastic, Kibana, Logstash. I would avoid custom solutions with generic tools, or I would use it just if I don't plan to invest/expand it. Somewhere you have to invest time, money, resources. Once

Once you define what and how you need to monitor your flows, all the rest will follow. And, in my personal opinion, start small. Just with metrics or traces. Once people will start using the monitoring platform, more requests will come. Just like a product with customer requests, it is a flow feature>user>feedback, don't expect it to be a time-boxed activity, it is a constant process.

Based on my experience, the answer is the usual, it depends. Even given the specific scenario you described, it depends on the exact scope and what you want to monitor and troubleshoot. Starting to monitor "everything" is quite demanding. Do you want to know how frequently such events occur? Let's go with metrics Do you want to know the exact reason such events couldn't be processed? Traces enriched with custom tags Do you want to add domain information? Structured logs. Information related to what is going on with the transport of your data is available out of the box with traces, adding logs for this kind of info is expensive for costs, resources consumed, and maintenance. Sampling is indeed an issue, not of the tool, but of how much you invest in it. With AppInsight and Datadog you can just pay more and you will not encounter this issue. But most of the time, it is just better to reduce the amount of data stored, just save the telemetry that you actually need. Still, selecting data to save could be hard depending on the system you are working on, an alternative could be not relying on external products but having your own monitoring platform. Prometheus, Grafana, Tempo, Loki, Elastic, Kibana, Logstash. I would avoid custom solutions with generic tools, or I would use it just if I don't plan to invest/expand it. Somewhere you have to invest time, money, resources. Once you define what and how you need to monitor your flows, all the rest will follow. And, in my personal opinion, start small. Just with metrics or traces. Once people will start using the monitoring platform, more requests will come. Just like a product with customer requests, it is a flow feature>user>feedback, don't expect it to be a time-boxed activity, it is a constant process.

Based on my experience, the answer is the usual, it depends. Even given the specific scenario you described, it depends on the exact scope and what you want to monitor and troubleshoot. Starting to monitor "everything" is quite demanding.

  • Do you want to know how frequently such events occur? Let's go with metrics.
  • Do you want to know the exact reason such events couldn't be processed? Traces enriched with custom tags.
  • Do you want to add domain information? Structured logs. Information related to what is going on with the transport of your data is available out of the box with traces, adding logs for this kind of info is expensive for costs, resources consumed, and maintenance.

Sampling is indeed an issue, not of the tool, but of how much you invest in it. With AppInsight and Datadog you can just pay more and you will not encounter this issue. But most of the time, it is just better to reduce the amount of data stored, just save the telemetry that you actually need. Still, selecting data to save could be hard depending on the system you are working on, an alternative could be not relying on external products but having your own monitoring platform. Prometheus, Grafana, Tempo, Loki, Elastic, Kibana, Logstash. I would avoid custom solutions with generic tools, or I would use it just if I don't plan to invest/expand it. Somewhere you have to invest time, money, resources.

Once you define what and how you need to monitor your flows, all the rest will follow. And, in my personal opinion, start small. Just with metrics or traces. Once people will start using the monitoring platform, more requests will come. Just like a product with customer requests, it is a flow feature>user>feedback, don't expect it to be a time-boxed activity, it is a constant process.

Source Link

Based on my experience, the answer is the usual, it depends. Even given the specific scenario you described, it depends on the exact scope and what you want to monitor and troubleshoot. Starting to monitor "everything" is quite demanding. Do you want to know how frequently such events occur? Let's go with metrics Do you want to know the exact reason such events couldn't be processed? Traces enriched with custom tags Do you want to add domain information? Structured logs. Information related to what is going on with the transport of your data is available out of the box with traces, adding logs for this kind of info is expensive for costs, resources consumed, and maintenance. Sampling is indeed an issue, not of the tool, but of how much you invest in it. With AppInsight and Datadog you can just pay more and you will not encounter this issue. But most of the time, it is just better to reduce the amount of data stored, just save the telemetry that you actually need. Still, selecting data to save could be hard depending on the system you are working on, an alternative could be not relying on external products but having your own monitoring platform. Prometheus, Grafana, Tempo, Loki, Elastic, Kibana, Logstash. I would avoid custom solutions with generic tools, or I would use it just if I don't plan to invest/expand it. Somewhere you have to invest time, money, resources. Once you define what and how you need to monitor your flows, all the rest will follow. And, in my personal opinion, start small. Just with metrics or traces. Once people will start using the monitoring platform, more requests will come. Just like a product with customer requests, it is a flow feature>user>feedback, don't expect it to be a time-boxed activity, it is a constant process.