I was wondering, without using any kind of user info (id, nickname, age, etc), what would be the best way to assure filename uniqueness using php in a large database with high traffic and most probably simultaneously between many users? I am using $file = time() for example but I would like to know if this should suffice whenever two users might go over this code at the same time (at a large large large scale, 10000 users at the same time and the same function ran 200 at the same time).
4 Answers
If you don't want to hash the entire file to generate a unique id, a couple of other ways come to mind:
You can use an 'auto increment' column in a database. Each insert gives you a unique id, managed by the database. Then base your file name off of that.
You can create a unique identifier from the existing session id or remote ip address and the time. You might even use the file size as well. Concatinating them together should prevent file name collisions at a much better reliability than just using the time.
You can implement some other single process service that distributes unique id's upon request. The requesting PHP script would request an id and wait until one was returned before proceeding. Its unlikely simply distributing id's would be a bottleneck even at very high levels of traffic.
Use a GUID. This is the canonical solution to this problem. See https://en.wikipedia.org/wiki/Globally_unique_identifier. Also https://www.rfc-editor.org/rfc/rfc4122.
A GUID is a 128-bit identifier that you can use to uniquely identify just about anything you like. The large number of bits reduces the risk of collision to the point where it can be ignored. You'll never see one.
Find a reputable algorithm, or use a library function from your existing libraries. A quick web search found several in PHP.
Convert it to string in the usual way eg {21EC2020-3AEA-1069-A2DD-08002B30309D}. Problem solved.
You can use tempnam(). It generates a filename that is guaranteed to be unique and not used in a given directory. It is atomic and so is free of any race conditions.
-
However, one call might return “abc” and a call by another process might also return “abc”. Don’t know php enough, but there should be a call that creates a file with a unique name in an atomic way - and then a call from another process cannot use the same name because the first file exists.gnasher729– gnasher7292025-10-11 15:17:49 +00:00Commented Oct 11 at 15:17
-
1@gnasher729: another process would not be able to return the same value, as the file would already be created in the directory. Essentially, the directory structure serves as a sort of a global lock here. Naturally, two machines may end up with same names (I think it is possible given the current implementation, although I don't understand it very well). An easy way to avoid that would be to pass the machine name as a prefix to the
tempnam()function.Arseni Mourzenko– Arseni Mourzenko2025-10-11 19:25:58 +00:00Commented Oct 11 at 19:25
In simple terms a multitasking space there is something called an atom, a semaphore or a spin-lock. They are somewhat different from each other and I won't get into the details of that, but they are all similar concepts to essentially ensure that only one thing at a time can happen in critical code. They all rely on the system being designed so one physical resource is locked by a caller in such a way that it prevents any others from locking it until the caller releases it. Then another caller can lock it while it does it's "critical" thing. It is something that comes right down to one single CPU machine instruction being the lock.
Now having said all that, while it is possible to approach this sort of absolute lock with a time based value, and thus an approximation of a lock, the down side is that as the number of simultaneously executing processes increase, so does the probability of a collision. Which is I think one of your concerns. This is true no matter how fine a clock resolution you choose.
In terms of performance and scalability there is a big trade off between the first method above and the second, with the first method creating an absolute bottle neck that all processes must pass thru, and the second method almost no bottleneck.
So in terms of 'best way", well that depends..., on if you need a really absolute lock, or if just a very good lock that can fail within a certain statistical probability will suffice.
(One of the more cool pieces of software I wrote years ago was the guts of multitasking operating system. Disk access was shared in common between all tasks so had to be absolutely single threaded for at time. It was important to keep that locked section of code as thin as possible to keep the bottle neck as unblocking as possible.)
As a side note, there is an interesting similar hardware related problem with the possible collision of electrical signals which you might find interesting to read about to reflect light on your question. It's called meta-stability.
$file = time(). I'd rather usemicrotime()instead"slartibartfast"plus a time stamp should work great. Or you know, a sane directory structure such that you're not that worried about giving a file a name that suggest what its contents are.