[nycphp-talk] [OT] number of files in a directory?
tedd
tedd at sperling.com
Mon Jan 2 19:57:13 EST 2006
>I started a site a couple years ago which had a similar problem.
>Users can create pages and each page had one or many assets.
>
>At first I just did /content/page_id/assets.ext, after I got to
>32,000 directories, it stopped working. At that point I moved to a
>system where I had 250 directories like, 0, 1000, 2000, 3000, etc.
>Each of those directories had 1000 sub directories in them, each
>named after the page_id. After I got to around 300,000 sub
>directories and a little over a million files, I moved to a
>completely md5 based system.
>
>I tried to avoid keeping track of files in the database as it tends
>to get messy, but I suppose it's inevitable. The benefits of an md5
>system for me outweigh a non md5 system. This may not be the case
>for you but the main drawing points to me were:
>
>1) Lowered server I/O, there were quite a few duplicated files. Each
>time two identical files were read, they were pulled from two
>different spots on the disk. My site quickly grew to using huge
>amounts of I/O. (around 15,000-20,000 hits a minute on the content
>server). This definitely helped out for me as some files were
>duplicated over 1,000 times.
>
>2) Made it a lot easier to keep track of what was being used and
>what wasn't. Without a DB back end I couldn't tell which files I
>could delete and which I needed to keep without writing a script
>that basically checked every directory for a matching entry in the
>database.
>
>3) Lowered disk space. Again duplicate files.
>
>4) Allowed me to ban certain images and other files, mass delete
>things that had a certain md5 attached to it. This is very useful if
>you will ever need to moderate or have troublesome users.
>
>The downside is that you have to make sure your code really keeps
>track of your file system and you aren't accessing it by hand.
>Another thing you might worry about using md5 is collisions. If this
>is a mission critical system, you may want to avoid md5 as it is
>possible (but somewhat unlikely) you will encounter collisions. I've
>read anyone with a decent computer can create an md5 collision in
>about an hour, so that's something to keep in mind.
>
>The way I structured my file system was three levels of single
>character directories.
>
>/content/a-f0-9/a-f0-9/a-f0-9/filename.ext (4096 directories (16*16*16))
>
>This way I can take any asset md5 and figure out it's location on
>the file system without database access, and leaves ample room for
>expansion, as well as moving large (or small) chunks to other
>servers. At this point I am using this system for over a million
>files and most of the sub directories only have a few hundred files
>in them tops.
>
>If you decided to use a single directory approach you will most
>likely run into quite a few problems. I remember when I had around
>15-20,000 files, everything I did in that directory became extra
>slow. With the setup I'm using now I don't really get any lag.
>
>Hope that helped.
>-Max
>
>
This topic is beginning to sound like a problem that a binary tree
might provide a solution. Anyone have any references for php b-trees?
tedd
--
--------------------------------------------------------------------------------
http://sperling.com/
More information about the talk
mailing list