Let's say, in a rather big application suite with multiple more or less integrated products, data is stored across multiple databases. Some of them are SQL-ish DB clusters, some are MongoDB clusters.
Some entities (= "rows" or "documents", depending on the type of DB) are stored in several databases. Many (if not all) entities depend on entities of another kind.
Now, the problem is: Data consistency and validation. There is a lot of data that is not in sync with other data and/or not in sync with the intended schema or the intended application logic.
Let's take an example and say the system is about pets. We have an SQL table called pet and a row for dogs. In MongoDB, there is a column for food containing document IDs of a collection named petFood. However, it could happen that:
- The document with the linked ID does not exist in petFoodcollection.
- Some petFooddocuments have nopetlinked to them.
- The data might contradict each other, like the dogsrow could be linked to a document withsuitableForDogs: falseproperty.
Additionally, there might be data consistency problems within one MongoDB document itself. E.g MyFood could be set both to availableInAsia: false and have a distributorInAsia: ACME LTD property (which doesn't make sense because something which is not available in Asia cannot have a distributor there`).
You may ask how come that the data has these issues. Well, there could be various reasons:
- In some situations, stuff is changed directly in DB or with an ad-hoc script instead of in the application.
- There is old data not up to date with current application logic.
- Bugs in the application code leave some data in the wrong state.
- DB migration problems happen, backup restores only work partially, and stuff like that.
- ... etc etc.
These things do happen and they will happen. Hence, manually cleaning up once and then hoping that the data will never become messy again, is pointless.
The same goes for the well-meant advice you always hear regarding MongoDB: "Enforce schema and validity on application level". It simply does not work like that.
So, the question is: What to do? I need to find a solution that makes sure that the data stays consistent and valid over time.
The best solution I could come up with
- Programmatically define all the rules that play a role regarding data consistency/validity.
- Running a regular "checker script" which checks that all the rules are followed everywhere.
- If not: The checker notifies the responsible persons like "Hey, the row dogsinpettable is assigned to apetFooddocument which hassuitableForDogs: false. Please fix it!"
- Maybe add some kind of threshold like: A problem must appear twice in a row. (In order to exclude cases where the check happens to run during an async operation.)
But that means a lot of work and there is no technical guarantee that the people who get these notifications will react accordingly.
So, what would be a better technical approach?
(I'm not asking for organizational measurements like 'Take way database access from people notoriously and regularly messing up the data'.)