13
votes
Grafana - detecting abnormal behavior of applications
Welcome to the world of data visualization and pattern identification. To be short: there's no solution for the problem you are encountering.
You say:
Abnormal behavior can easily be recognized in ...
9
votes
Accepted
Reliability vs Fault Tolerance
I havent got the book, but the first page has this:
Seems to me that unless one of the chapters specifically defines "Fault Tolerance" somewhere they are just using "reliability" ...
7
votes
Accepted
Building a program that truly deletes everything
You can very difficultly obtain a true irrecoverable deletion of data. This is not related to algorithms but to physical properties of storage media.
You can only hope to reduce the risk (or ...
7
votes
Accepted
How do I ensure my product is correct the first time?
The solution is actually to hire software developers who have been doing that kind of work before, and to prepare for an enormous bill. If you are asking for help here, then frankly you don’t have a ...
5
votes
Accepted
Best practices for Heartbeat in distributed systems
Your solution is the obvious one. When each service receives a heartbeat from one of it's sources, note the source and time, and when that service would send a heartbeat (to it's sinks), it checks ...
5
votes
Testing can detect the presence of error but not the absence of error, why?
I write a C function to return the sum of two integers.
uint64_t sum (uint64_t x, uint64_t y) {
if (x == 928349189543712948 && y == 1037485168329895349)
return x + y - 1;
...
5
votes
Accepted
When should I be worried of Time of check time of use vulnerabilities during database queries?
So, the best option that came to my mind is asking myself whether the portion of code would really harm if exploited. In this case the user may delete a post milliseconds after another process strips ...
5
votes
Grafana - detecting abnormal behavior of applications
Not sure about grafana, but most logging stacks offer some sort of machine learning anomaly detection these days
eg: https://www.elastic.co/docs/explore-analyze/machine-learning/anomaly-detection
...
4
votes
What is the crux of difference between N version programming and self monitoring architecture?
The difference is in what is done if the outputs are different:
In the self-monitoring architecture, if the outputs are different then a fault is indicated; no recovery is possible - i.e. this is a ...
4
votes
Reliability vs Fault Tolerance
A reliable nuclear reactor keeps producing power without a life threatening meltdown.
Reliability engineering is a sub-discipline of systems engineering that emphasizes the ability of equipment to ...
3
votes
Building a program that truly deletes everything
This is a matter of opinion and/or marketing. In Linux such a program is called a shredder.
Overwriting with random data prior to overwriting with 0s is recommended
Such programs don't usually claim ...
3
votes
Accepted
How to prevent bugs in business-level configurations with similar discipline as in source code?
The iron rule of software is:
Garbage in, garbage out
To cope with this hard fact of life, you need to address the requirements that you've discovered.
Configuration process
The configuration ...
2
votes
Are the terms stable and reliable interchangeable?
In the context of evaluating libraries, the terms mean completely different things.
A reliable library is one that does its job without intermittent failures.
A stable library is one that doesn't ...
2
votes
How to prevent bugs in business-level configurations with similar discipline as in source code?
You wrote
we do code review, unit testing and integration testing
(and I guess you also use source control). All those techniques can be applied to configuration files (or schedules) as well - at ...
2
votes
When should I be worried of Time of check time of use vulnerabilities during database queries?
using the one which is the safest would slow down the code.
If you think correct code is slow, you want to see the performance of incorrect code, once you factor all the business malfunction, ...
2
votes
Defining SLI / SLO for ETL and Reporting Application
The terms "SLI," "SLO," and "SLA" have precise meanings that apply across the spectrum of scale, domain, and abstraction. Although most literature focuses on ...
2
votes
Testing can detect the presence of error but not the absence of error, why?
This is a question related to proof and evidence.
When you have a test suite to help you in the verification and validation, you cannot be sure that the tests cover all the potential situations ...
1
vote
Accepted
Thoughts of Google Cloud App Engine Reliability
If you need high availability where one minute of downtime is not acceptable a single cloud provider is not enough. You need multiple providers to have high availability at that level, even then it's ...
1
vote
How to prevent bugs in business-level configurations with similar discipline as in source code?
Some people don't realize this, but handling configuration is a software problem of its own, with it's own set of design challenges. Sadly, because your software is unique to your company, your ...
1
vote
How to ensure that every log event will be delivered to the GrayLog
Have a look at the Graylog Extended Log Format (GELF). It supports TCP, although only for uncompressed data.
You must trade off network bandwidth versus logging reliability and perform some tests of ...
1
vote
Best practices for Heartbeat in distributed systems
A "heartbeat" is the solving the wrong problem.
The consumer of the micro services needs to guard against serving stale data when any one of the micro services goes down.
In fact, a heartbeat, even ...
1
vote
Best practices for Heartbeat in distributed systems
OK so. As I understand it you have this:
DataSource - pushes occasional messages to Clients
Client - Listens for datasource messages
Problem: Because the DataSource sends messages intermittently, if ...
Only top scored, non community-wiki answers of a minimum length are eligible
Related Tags
system-reliability × 25testing × 3
monitoring × 3
design × 2
security × 2
software × 2
logging × 2
operating-systems × 2
distributed-system × 2
quality × 2
google × 2
design-patterns × 1
database × 1
database-design × 1
terminology × 1
development-process × 1
multithreading × 1
compiler × 1
windows × 1
deployment × 1
server × 1
aws × 1
postgres × 1
bug × 1
business × 1