added 1342 characters in body

Source Link

edited Sep 21, 2023 at 8:27

255
2
8

Edit:

Thanks for all your answers so far. I was aware that it might be a little bit too unclear what I'm asking for. I think your thoughts help me to rephrase what I'm looking for. I know that such a problem is complex and that there are many aspects to it like financial, organizational ones etc. Some of them are off-topic here I guess.

So, what I am looking for are technical actions that a motivated dev-team could implement within, very roughly speaking, a few weeks. Either on the database layer or as an "extra" application layer or on "something in between the application and database layer".

In order to demonstrate what I was fantasizing about I'm making up something now:

"This problem is called XY and the state-of-the-art solution is: Put a sample of correct data together, pass it through tool XY which will deduce some generic rules from the data, then set up an XY server which will regularly make two copies of all your databases: one copy for the correct data, one for the wrong data. Then you can decide from here what to do with them."

If there is any generic solution along those lines out there, then I'm not aware of it and would be thankful for any tips or even hints on what keywords to google for.

"No, something like this does not exist", would thus be an answer too.

Became Hot Network Question

occurred Sep 20, 2023 at 21:49

Source Link

asked Sep 20, 2023 at 13:47

cis

255
2
8

How to ensure data consistency in system with multiple databases?

Let's say, in a rather big application suite with multiple more or less integrated products, data is stored across multiple databases. Some of them are SQL-ish DB clusters, some are MongoDB clusters.

Some entities (= "rows" or "documents", depending on the type of DB) are stored in several databases. Many (if not all) entities depend on entities of another kind.

Now, the problem is: Data consistency and validation. There is a lot of data that is not in sync with other data and/or not in sync with the intended schema or the intended application logic.

Let's take an example and say the system is about pets. We have an SQL table called pet and a row for dogs. In MongoDB, there is a column for food containing document IDs of a collection named petFood. However, it could happen that:

The document with the linked ID does not exist in petFood collection.
Some petFood documents have no pet linked to them.
The data might contradict each other, like the dogs row could be linked to a document with suitableForDogs: false property.

Additionally, there might be data consistency problems within one MongoDB document itself. E.g MyFood could be set both to availableInAsia: false and have a distributorInAsia: ACME LTD property (which doesn't make sense because something which is not available in Asia cannot have a distributor there`).

You may ask how come that the data has these issues. Well, there could be various reasons:

In some situations, stuff is changed directly in DB or with an ad-hoc script instead of in the application.
There is old data not up to date with current application logic.
Bugs in the application code leave some data in the wrong state.
DB migration problems happen, backup restores only work partially, and stuff like that.
... etc etc.

These things do happen and they will happen. Hence, manually cleaning up once and then hoping that the data will never become messy again, is pointless.

The same goes for the well-meant advice you always hear regarding MongoDB: "Enforce schema and validity on application level". It simply does not work like that.

So, the question is: What to do? I need to find a solution that makes sure that the data stays consistent and valid over time.

The best solution I could come up with

Programmatically define all the rules that play a role regarding data consistency/validity.
Running a regular "checker script" which checks that all the rules are followed everywhere.
If not: The checker notifies the responsible persons like "Hey, the row dogs in pet table is assigned to a petFood document which has suitableForDogs: false. Please fix it!"
Maybe add some kind of threshold like: A problem must appear twice in a row. (In order to exclude cases where the check happens to run during an async operation.)

But that means a lot of work and there is no technical guarantee that the people who get these notifications will react accordingly.

So, what would be a better technical approach?

(I'm not asking for organizational measurements like 'Take way database access from people notoriously and regularly messing up the data'.)

Stack Exchange Network

Return to Question

How to ensure data consistency in system with multiple databases?