Skip to main content
13 votes

Grafana - detecting abnormal behavior of applications

Welcome to the world of data visualization and pattern identification. To be short: there's no solution for the problem you are encountering. You say: Abnormal behavior can easily be recognized in ...
Arseni Mourzenko's user avatar
9 votes
Accepted

Reliability vs Fault Tolerance

I havent got the book, but the first page has this: Seems to me that unless one of the chapters specifically defines "Fault Tolerance" somewhere they are just using "reliability" ...
Ewan's user avatar
  • 84.4k
7 votes
Accepted

Building a program that truly deletes everything

You can very difficultly obtain a true irrecoverable deletion of data. This is not related to algorithms but to physical properties of storage media. You can only hope to reduce the risk (or ...
Christophe's user avatar
  • 82.2k
7 votes
Accepted

How do I ensure my product is correct the first time?

The solution is actually to hire software developers who have been doing that kind of work before, and to prepare for an enormous bill. If you are asking for help here, then frankly you don’t have a ...
gnasher729's user avatar
  • 49.4k
5 votes
Accepted

Best practices for Heartbeat in distributed systems

Your solution is the obvious one. When each service receives a heartbeat from one of it's sources, note the source and time, and when that service would send a heartbeat (to it's sinks), it checks ...
Caleth's user avatar
  • 12.4k
5 votes

Testing can detect the presence of error but not the absence of error, why?

I write a C function to return the sum of two integers. uint64_t sum (uint64_t x, uint64_t y) { if (x == 928349189543712948 && y == 1037485168329895349) return x + y - 1; ...
gnasher729's user avatar
  • 49.4k
5 votes
Accepted

When should I be worried of Time of check time of use vulnerabilities during database queries?

So, the best option that came to my mind is asking myself whether the portion of code would really harm if exploited. In this case the user may delete a post milliseconds after another process strips ...
JimmyJames's user avatar
  • 30.9k
5 votes

Grafana - detecting abnormal behavior of applications

Not sure about grafana, but most logging stacks offer some sort of machine learning anomaly detection these days eg: https://www.elastic.co/docs/explore-analyze/machine-learning/anomaly-detection ...
Ewan's user avatar
  • 84.4k
4 votes

What is the crux of difference between N version programming and self monitoring architecture?

The difference is in what is done if the outputs are different: In the self-monitoring architecture, if the outputs are different then a fault is indicated; no recovery is possible - i.e. this is a ...
Philip Kendall's user avatar
4 votes

Reliability vs Fault Tolerance

A reliable nuclear reactor keeps producing power without a life threatening meltdown. Reliability engineering is a sub-discipline of systems engineering that emphasizes the ability of equipment to ...
candied_orange's user avatar
3 votes

Building a program that truly deletes everything

This is a matter of opinion and/or marketing. In Linux such a program is called a shredder. Overwriting with random data prior to overwriting with 0s is recommended Such programs don't usually claim ...
Tulains Córdova's user avatar
3 votes
Accepted

How to prevent bugs in business-level configurations with similar discipline as in source code?

The iron rule of software is: Garbage in, garbage out To cope with this hard fact of life, you need to address the requirements that you've discovered. Configuration process The configuration ...
Christophe's user avatar
  • 82.2k
2 votes

Are the terms stable and reliable interchangeable?

In the context of evaluating libraries, the terms mean completely different things. A reliable library is one that does its job without intermittent failures. A stable library is one that doesn't ...
Sebastian Redl's user avatar
2 votes

How to prevent bugs in business-level configurations with similar discipline as in source code?

You wrote we do code review, unit testing and integration testing (and I guess you also use source control). All those techniques can be applied to configuration files (or schedules) as well - at ...
Doc Brown's user avatar
  • 220k
2 votes

When should I be worried of Time of check time of use vulnerabilities during database queries?

using the one which is the safest would slow down the code. If you think correct code is slow, you want to see the performance of incorrect code, once you factor all the business malfunction, ...
Steve's user avatar
  • 12.6k
2 votes

Defining SLI / SLO for ETL and Reporting Application

The terms "SLI," "SLO," and "SLA" have precise meanings that apply across the spectrum of scale, domain, and abstraction. Although most literature focuses on ...
asthasr's user avatar
  • 3,469
2 votes

Testing can detect the presence of error but not the absence of error, why?

This is a question related to proof and evidence. When you have a test suite to help you in the verification and validation, you cannot be sure that the tests cover all the potential situations ...
Christophe's user avatar
  • 82.2k
1 vote
Accepted

Thoughts of Google Cloud App Engine Reliability

If you need high availability where one minute of downtime is not acceptable a single cloud provider is not enough. You need multiple providers to have high availability at that level, even then it's ...
Ryathal's user avatar
  • 13.5k
1 vote

How to prevent bugs in business-level configurations with similar discipline as in source code?

Some people don't realize this, but handling configuration is a software problem of its own, with it's own set of design challenges. Sadly, because your software is unique to your company, your ...
Diane M's user avatar
  • 2,116
1 vote

How to ensure that every log event will be delivered to the GrayLog

Have a look at the Graylog Extended Log Format (GELF). It supports TCP, although only for uncompressed data. You must trade off network bandwidth versus logging reliability and perform some tests of ...
helb's user avatar
  • 1,420
1 vote

Best practices for Heartbeat in distributed systems

A "heartbeat" is the solving the wrong problem. The consumer of the micro services needs to guard against serving stale data when any one of the micro services goes down. In fact, a heartbeat, even ...
Greg Burghardt's user avatar
1 vote

Best practices for Heartbeat in distributed systems

OK so. As I understand it you have this: DataSource - pushes occasional messages to Clients Client - Listens for datasource messages Problem: Because the DataSource sends messages intermittently, if ...
Ewan's user avatar
  • 84.4k

Only top scored, non community-wiki answers of a minimum length are eligible