Duplicate BLOB in Postgres duplicate the storage requirements?

Question

I uploaded a huge 2GB file in postgres in a Blob tuple field. In example for id #1

If I upload the same file a second time (in example for id #2). Do I need the double space requirements?

As far as I know the Postgres file is stored as a OID-reference-value.

After uploading a duplicate, I noticed the OID differ for same BLOBs.

Does that mean a 100% duplicate BLOB is stored on the storage-device?

J.D. · Accepted Answer · 2023-08-07 12:31:13Z

0

You generally want to avoid storing files in a database system, that's what a file system is best for. Typically it's better to store a reference to the location of the file in the database instead. This makes management of the database better, and reduces your overhead (e.g. with backups, etc).

But if you must store the file in the database, you should do it in a normalized way if the same file will be referenced by multiple rows of data. To accomplish this, you can have a separate Files table, and the main table has a foreign key reference to that table, so multiple rows can reference the same file without storing redundant data.

And your application should do an UPSERT on the Files table to ensure it never INSERTs the same file twice.

answered Aug 7, 2023 at 12:31

J.D.

41.1k12 gold badges64 silver badges145 bronze badges

Why do you think a filesystem is better for files? I see only downsides.

Grim
– Grim

2023-08-07 12:52:38 +00:00
Commented Aug 7, 2023 at 12:52
@Grim Well, the name "file system" should be a giveaway that it's the right system for managing files. The pros of storing the files on disk and referencing their location in the database are a number of things inclusive of: not bloating your database, faster querying, not bloating the backups of the database, faster recovery time, reduced redundancy of the file itself, easier management of the files themselves, etc.

J.D.
– J.D.

2023-08-07 17:29:59 +00:00
Commented Aug 7, 2023 at 17:29
My dbs queries does not going slower. I would not say it bloats the database, I would say it sovles the single-source-of-truth. I am happy with the fact that a backup of the database also backups the referenced files, so the backup have no broken-file-references, let me phrase it as corrupted foreign-keys. The redundancy is a db-design problem. A file forces you to solve the owner/permission-problem, it is not acid since its there (or not) indepnednendly of the commit, it forces you to have a directory, it forces you to think about vulnerability (unpack petabytes in a 1kb zip, ADS bullshit etc

Grim
– Grim

2023-08-07 23:01:09 +00:00
Commented Aug 7, 2023 at 23:01
You dont need a complex secure and backup plan than "just backup the db", you automatically have a network-filesystem, you have easy statictis source for the overall-storage-requirement, you can balance the datasource across nodes easier than on file-systems, you can order, index and find files by name or/and size without having os-specific find commands faster and easier. You can consistent guarantee no ambiguity in a db but never using files. A unique filename is no precondition in the same folder. Windows does not create unexpected thumbs.db files, linux does not create .trash-folders.

Grim
– Grim

2023-08-07 23:16:21 +00:00
Commented Aug 7, 2023 at 23:16
@Grim There's unfortunately too much subjectivity & incorrectness with what you said, that I don't have the energy to reply to it all. But an example of what you said that is incorrect is "you can order, index and find files by name or/and size without having os-specific find commands faster and easier". This is no different than if you stored the files on disk and the paths in the database, indexed accordingly. Again, backups become bloated and restore times take longer. Regardless of your personal opinion, the general professional consensus is that files are better stored in a file system.

J.D.
– J.D.

2023-08-08 01:40:48 +00:00
Commented Aug 8, 2023 at 1:40

| Show 7 more comments

Grim · Accepted Answer · 2023-08-07 13:36:36Z

On insert without select oid from other table, yes.

If I update the blob from id #1 with the bytes from id #2, it does not duplicate the bytes, instead the OID are the same for #1 and #2.

Example (163125 has 2.734 bytes and 162540 has 11.093.682 bytes):

rm=# SELECT pg_database_size('rm') / 1024 / 1024;
 ?column? 
----------
    29236
(1 Zeile)

rm=# 
rm=# 
rm=# 
rm=# 
rm=# update "DATA_KONTOR_ENTRY_VERSION" set entry_bytes =(select entry_bytes from "DATA_KONTOR_ENTRY_VERSION" where id = 162540) where id = 163125;
UPDATE 1
rm=# 
rm=# 
rm=# 
rm=# 
rm=# SELECT pg_database_size('rm') / 1024 / 1024;
 ?column? 
----------
    29236
(1 Zeile)

rm=# select entry_bytes from "DATA_KONTOR_ENTRY_VERSION" where id = 163125 or id = 162540;
 entry_bytes 
-------------
      536951
      536951
(2 Zeilen)

rm=#

Stack Exchange Network

Duplicate BLOB in Postgres duplicate the storage requirements?

2 Answers 2

Hot Network Questions

Duplicate BLOB in Postgres duplicate the storage requirements?

2 Answers 2

Related

Hot Network Questions