[nycphp-talk] [OT] number of files in a directory?
Marc Antony Vose
suzerain at suzerain.com
Tue Jan 3 02:18:31 EST 2006
Hey y'all:
Thanks for the thoughtful notes. As usual, people here have given me
entirely new ways to think about things.
In my case, I'm not accepting uploads from web site visitors; all
these images are coming from the owner of the site. They are a store
that sells one-of-a-kind rare items, and at any given time there will
be an inventory of ~75,000.
Each item in the inventory will have something like 6-10 images, each
used for different purposes (for example, we will be displaying
rotatable versions of the products via Flash).
However, they've indicated a future desire to be able to offer people
order histories, and so forth, so for the time being sold items will
be kept in the system. So the probability of getting into the
millions pretty quickly with images is definitely there.
It seems, then, like the most sensible idea for me is to create md5()
hashed directories based on the product ID number (rather than image
ID number), as I don't think it's necessary for me to store
information about every image in the database.
So, if I translated the product ID number into something like
IMAGES_DIR/ab5/81d
I could then store inside that directory any images which are needed
for the particular product. Something like:
12345_signature.jpg
12345_rotate_1.jpg
12345_rotate_2.jpg
12345_rotate_3.jpg
12345_rotate_4.jpg
12345_rotate_5.jpg
12345_rotate_6.jpg
12345_thumb.jpg
I suppose it is probable under this scenario that two different
products will end up with the same path, since we're only using the
first 6 characters of the hash. but it shouldn't really matter as
long as I have the images keyed with the product ID as well.
Anyone see any major red flags with this strategy?
Cheers,
Marc
>max goldberg wrote:
> > The downside is that you have to make sure your code really keeps track
>> of your file system and you aren't accessing it by hand. Another thing
>> you might worry about using md5 is collisions. If this is a mission
>> critical system, you may want to avoid md5 as it is possible (but
>> somewhat unlikely) you will encounter collisions. I've read anyone with
>> a decent computer can create an md5 collision in about an hour, so
>> that's something to keep in mind.
>
>Yeah, this is probably the best the solution. To avoid collisions what
>you want to do is assign a unique database ID to every asset, use that
>ID to create the MD5 hash, then store the asset with a filename
>containing that unique ID. That should eliminate collisions. The worst
>that can happen is that you'll have two different files in the same
>directory but with different filenames, which is cool.
>
>A function like this could be used to both plant the file in the MD5
>filesystem and extract its path later on based on that unique ID:
>
>function get_upload_target($file_id) {
> $hash_id = md5($file_id);
> $subdir = substr($hash_id, 0, 3) .
> '/' .
> substr($hash_id, 3, 3);
> return $subdir;
> }
>
>Use case: someone uploads the file "mykitty.jpg" and it's inserted into
>the database as id=1234. get_upload_target(1234) returns:
>
> 81d/c9b
>
>The file is then written as $ASSET_DIR/81d/c9b/1234
>
>Or 1234.jpg, or 1234.mykitty.jpg, whatever. I like to give the file a
>recognizable file type extension.
>
>To extract that file later, just run the ID through get_upload_target()
>again to build the filesystem path.
More information about the talk
mailing list