NYCPHP Meetup

NYPHP.org

[nycphp-talk] [OT] number of files in a directory?

Anirudh Zala arzala at gmail.com
Tue Jan 3 00:42:04 EST 2006


On Tue, 03 Jan 2006 06:06:58 +0530, max goldberg <max.goldberg at gmail.com> wrote:

> I started a site a couple years ago which had a similar problem. Users can
> create pages and each page had one or many assets.
>
> At first I just did /content/page_id/assets.ext, after I got to 32,000
> directories, it stopped working. At that point I moved to a system where I
> had 250 directories like, 0, 1000, 2000, 3000, etc. Each of those
> directories had 1000 sub directories in them, each named after the page_id.
> After I got to around 300,000 sub directories and a little over a million
> files, I moved to a completely md5 based system.
>
> I tried to avoid keeping track of files in the database as it tends to get
> messy, but I suppose it's inevitable. The benefits of an md5 system for me
> outweigh a non md5 system. This may not be the case for you but the main
> drawing points to me were:

Sometimes you must need to store those values related to images into database because you need to store various information of images like size, height, width, original file name etc, for which DB storage is inevitable. I also prefer to store various image information like size, height, width into db directly to display it anywhere rather than finding those values on the fly using functions like "getimagesize()" etc. But if you think you will not need those information regarding images and you just need to display it, then you can avoid usage of DB.

>
> 1) Lowered server I/O, there were quite a few duplicated files. Each time
> two identical files were read, they were pulled from two different spots on
> the disk. My site quickly grew to using huge amounts of I/O. (around
> 15,000-20,000 hits a minute on the content server). This definitely helped
> out for me as some files were duplicated over 1,000 times.
>
> 2) Made it a lot easier to keep track of what was being used and what
> wasn't. Without a DB back end I couldn't tell which files I could delete and
> which I needed to keep without writing a script that basically checked every
> directory for a matching entry in the database.
>
> 3) Lowered disk space. Again duplicate files.
>
> 4) Allowed me to ban certain images and other files, mass delete things that
> had a certain md5 attached to it. This is very useful if you will ever need
> to moderate or have troublesome users.
>
> The downside is that you have to make sure your code really keeps track of
> your file system and you aren't accessing it by hand. Another thing you
> might worry about using md5 is collisions. If this is a mission critical
> system, you may want to avoid md5 as it is possible (but somewhat unlikely)
> you will encounter collisions. I've read anyone with a decent computer can
> create an md5 collision in about an hour, so that's something to keep in
> mind.
>

I don't understand how can there be collisions with generating randome hash? Consider below method to generate 16 digit random hash.

$hash=substr(md5(md5(time().rand().$GLOBALS['REMOTE_ADDR'].microtime()).time()),0,16);

Can you ever get duplicate hash by above method? I don't think so. However usage of "md5()" only might do that. However main purpose of using hash value in filename is to avoid stealing of them because by this way stealers can't directly downaload images by guessing directory and file name structure of your website. Moreover in file name like

\RECORDID_SOME>10CHARACTERHASH.EXT (i.e 123456_aswe34567bg.jpg)

where we use combination of "primary column ID 123456" of particualr record that is attached to this image and above mentioned 10 to 16 digit hash will solve both of our purposes. Since RECORDID is always unique, you will never have duplication of images, that is for sure.

> The way I structured my file system was three levels of single character
> directories.
>
> /content/a-f0-9/a-f0-9/a-f0-9/filename.ext (4096 directories (16*16*16))
>
> This way I can take any asset md5 and figure out it's location on the file
> system without database access, and leaves ample room for expansion, as well
> as moving large (or small) chunks to other servers. At this point I am using
> this system for over a million files and most of the sub directories only
> have a few hundred files in them tops.

This is another good mechanism of storing large number of files efficiently. But I would like to know a real example of this "filename.ext". I assume if it is just like "ASSET.ext" i.e 1.jpg, 1234.jpg, 4567.jpg then I would say that your system is prone to stealing of your images. md5 or another kind of hash, here, gives you protection against it since stealers can't guess exact file name of your images and even if they try hard, they wont get benefited much.

>
> If you decided to use a single directory approach you will most likely run
> into quite a few problems. I remember when I had around 15-20,000 files,
> everything I did in that directory became extra slow. With the setup I'm
> using now I don't really get any lag.
>
> Hope that helped.
> -Max
>
>
>
>
>
>
>
>
>
>
>
>
>
> On 12/31/05, Marc Antony Vose <suzerain at suzerain.com> wrote:
>>
>> Hey all:
>>
>> First of all:  Happy New Year!
>>
>> Secondly: I am rebuilding a site that was coded somewhat sloppily,
>> and they have product images all stored in one directory (a script
>> that I am not writing auto-uploads them to the web server from
>> elsewhere).  Presently, this directory contains about 33,000 files.
>> It will be more like 75,000 when the site launches, if things remain
>> the same.
>>
>> The question is:  should I be worried about this, or was this only a
>> problem several years ago? (I remember people at one time attempting
>> to not put too many files in one place.)
>>
>> If I should be worried, what could happen?  Will we ever reach a hard
>> limit of files per directory?
>>
>> Is it better if each product instead has its own directory inside
>> there (i.e., 75,000 directories), each with as many files as we need
>> inside, or is that just the same problem?
>>
>> Cheers,
>>
>> --
>> Marc Antony Vose
>> http://www.suzerain.com/
>>
>> Imagination is more important than knowledge.
>> -- Albert Einstein
>> _______________________________________________
>> New York PHP Talk Mailing List
>> AMP Technology
>> Supporting Apache, MySQL and PHP
>> http://lists.nyphp.org/mailman/listinfo/talk
>> http://www.nyphp.org
>>
>



-- 
-----------------------------------------------------
Anirudh Zala (Production Manager)
ASPL, http://www.aspl.in
Ph: +91 281 245 1894
arzala at gmail.com
-----------------------------------------------------



More information about the talk mailing list