Hottest 'system-reliability' Answers

13 votes

Grafana - detecting abnormal behavior of applications

Welcome to the world of data visualization and pattern identification. To be short: there's no solution for the problem you are encountering. You say: Abnormal behavior can easily be recognized in ...

Arseni Mourzenko

139k

answered Jun 15 at 16:08

9 votes

Accepted

Reliability vs Fault Tolerance

I havent got the book, but the first page has this: Seems to me that unless one of the chapters specifically defines "Fault Tolerance" somewhere they are just using "reliability" ...

Ewan

84.4k

answered Mar 29 at 12:47

7 votes

Accepted

Building a program that truly deletes everything

You can very difficultly obtain a true irrecoverable deletion of data. This is not related to algorithms but to physical properties of storage media. You can only hope to reduce the risk (or ...

Christophe

82.2k

answered Jun 17, 2021 at 13:39

7 votes

Accepted

How do I ensure my product is correct the first time?

The solution is actually to hire software developers who have been doing that kind of work before, and to prepare for an enormous bill. If you are asking for help here, then frankly you don’t have a ...

gnasher729

49.4k

answered Jul 7, 2019 at 16:45

5 votes

Accepted

Best practices for Heartbeat in distributed systems

Your solution is the obvious one. When each service receives a heartbeat from one of it's sources, note the source and time, and when that service would send a heartbeat (to it's sinks), it checks ...

Caleth

12.4k

answered Apr 6, 2018 at 10:10

5 votes

Testing can detect the presence of error but not the absence of error, why?

I write a C function to return the sum of two integers. uint64_t sum (uint64_t x, uint64_t y) { if (x == 928349189543712948 && y == 1037485168329895349) return x + y - 1; ...

gnasher729

49.4k

answered Nov 24, 2019 at 19:40

5 votes

Accepted

When should I be worried of Time of check time of use vulnerabilities during database queries?

So, the best option that came to my mind is asking myself whether the portion of code would really harm if exploited. In this case the user may delete a post milliseconds after another process strips ...

JimmyJames

30.9k

answered Feb 14, 2024 at 19:08

5 votes

Grafana - detecting abnormal behavior of applications

Not sure about grafana, but most logging stacks offer some sort of machine learning anomaly detection these days eg: https://www.elastic.co/docs/explore-analyze/machine-learning/anomaly-detection ...

Ewan

84.4k

answered Jun 16 at 8:01

4 votes

What is the crux of difference between N version programming and self monitoring architecture?

The difference is in what is done if the outputs are different: In the self-monitoring architecture, if the outputs are different then a fault is indicated; no recovery is possible - i.e. this is a ...

Philip Kendall

26.1k

answered Oct 17, 2021 at 15:03

4 votes

Reliability vs Fault Tolerance

A reliable nuclear reactor keeps producing power without a life threatening meltdown. Reliability engineering is a sub-discipline of systems engineering that emphasizes the ability of equipment to ...

candied_orange

120k

answered Mar 29 at 12:47

3 votes

Building a program that truly deletes everything

This is a matter of opinion and/or marketing. In Linux such a program is called a shredder. Overwriting with random data prior to overwriting with 0s is recommended Such programs don't usually claim ...

Tulains Córdova

39.6k

answered Jun 17, 2021 at 12:47

3 votes

Accepted

How to prevent bugs in business-level configurations with similar discipline as in source code?

The iron rule of software is: Garbage in, garbage out To cope with this hard fact of life, you need to address the requirements that you've discovered. Configuration process The configuration ...

Christophe

82.2k

answered Sep 15, 2019 at 8:14

2 votes

Are the terms stable and reliable interchangeable?

In the context of evaluating libraries, the terms mean completely different things. A reliable library is one that does its job without intermittent failures. A stable library is one that doesn't ...

Sebastian Redl

15.6k

answered May 17, 2019 at 11:58

2 votes

How to prevent bugs in business-level configurations with similar discipline as in source code?

You wrote we do code review, unit testing and integration testing (and I guess you also use source control). All those techniques can be applied to configuration files (or schedules) as well - at ...

Doc Brown

220k

answered Sep 15, 2019 at 7:01

2 votes

When should I be worried of Time of check time of use vulnerabilities during database queries?

using the one which is the safest would slow down the code. If you think correct code is slow, you want to see the performance of incorrect code, once you factor all the business malfunction, ...

Steve

12.6k

answered Feb 14, 2024 at 14:31

2 votes

Defining SLI / SLO for ETL and Reporting Application

The terms "SLI," "SLO," and "SLA" have precise meanings that apply across the spectrum of scale, domain, and abstraction. Although most literature focuses on ...

asthasr

3,469

answered Sep 8, 2022 at 18:21

2 votes

Testing can detect the presence of error but not the absence of error, why?

This is a question related to proof and evidence. When you have a test suite to help you in the verification and validation, you cannot be sure that the tests cover all the potential situations ...

Christophe

82.2k

answered Nov 24, 2019 at 14:18

1 vote

Accepted

Thoughts of Google Cloud App Engine Reliability

If you need high availability where one minute of downtime is not acceptable a single cloud provider is not enough. You need multiple providers to have high availability at that level, even then it's ...

Ryathal

13.5k

answered Feb 14, 2020 at 13:52

1 vote

How to prevent bugs in business-level configurations with similar discipline as in source code?

Some people don't realize this, but handling configuration is a software problem of its own, with it's own set of design challenges. Sadly, because your software is unique to your company, your ...

Diane M

2,116

answered Sep 15, 2019 at 0:57

1 vote

How to ensure that every log event will be delivered to the GrayLog

Have a look at the Graylog Extended Log Format (GELF). It supports TCP, although only for uncompressed data. You must trade off network bandwidth versus logging reliability and perform some tests of ...

helb

1,420

answered Dec 14, 2018 at 10:08

1 vote

Best practices for Heartbeat in distributed systems

A "heartbeat" is the solving the wrong problem. The consumer of the micro services needs to guard against serving stale data when any one of the micro services goes down. In fact, a heartbeat, even ...

Greg Burghardt

46.2k

answered Apr 6, 2018 at 16:15

1 vote

Best practices for Heartbeat in distributed systems

OK so. As I understand it you have this: DataSource - pushes occasional messages to Clients Client - Listens for datasource messages Problem: Because the DataSource sends messages intermittently, if ...

Ewan

84.4k

answered Apr 6, 2018 at 14:33

Stack Exchange Network

Tag Info

Hot answers tagged system-reliability

Grafana - detecting abnormal behavior of applications

Reliability vs Fault Tolerance

Building a program that truly deletes everything

How do I ensure my product is correct the first time?

Best practices for Heartbeat in distributed systems

Testing can detect the presence of error but not the absence of error, why?

When should I be worried of Time of check time of use vulnerabilities during database queries?

Grafana - detecting abnormal behavior of applications

What is the crux of difference between N version programming and self monitoring architecture?

Reliability vs Fault Tolerance

Building a program that truly deletes everything

How to prevent bugs in business-level configurations with similar discipline as in source code?

Are the terms stable and reliable interchangeable?

How to prevent bugs in business-level configurations with similar discipline as in source code?

When should I be worried of Time of check time of use vulnerabilities during database queries?

Defining SLI / SLO for ETL and Reporting Application

Testing can detect the presence of error but not the absence of error, why?

Thoughts of Google Cloud App Engine Reliability

How to prevent bugs in business-level configurations with similar discipline as in source code?

How to ensure that every log event will be delivered to the GrayLog

Best practices for Heartbeat in distributed systems

Best practices for Heartbeat in distributed systems

Tag Info

Hot answers tagged system-reliability

Related Tags