NYCPHP Meetup

NYPHP.org

[nycphp-talk] Best way to accomplish this task

Mitch Pirtle mitch.pirtle at gmail.com
Mon Feb 15 09:32:35 EST 2010


On Sun, Feb 14, 2010 at 8:49 PM, Anthony Papillion <papillion at gmail.com> wrote:
> Hello Everyone,
>
> I'm designing a system that will work on a schedule. Users will submit data
> for processing into the database and then, every minute, a PHP script will
> pass through the db looking for unprocessed rows (marked pending) and
> process them.
>
> The problem is, I may eventually have a few million records to process at a
> time. Each record could take anywhere from a few seconds to a few minutes to
> perform the required operations on. My concern is making sure that the
> script, on the next scheduled pass, doesn't grab the records currently being
> processed and start processing them again.
>
> Right now, I'm thinking of accomplishing this by updating a 'status' field
> in the database. So unprocessed records would have a status of 'pending',
> records being processed would have a status of 'processing' and completly
> processed record will have a status of 'complete'.
>
> For some reason, I see this as ugly but that's the only way I can think of
> making sure that records aren't duplicatly processed. So when I select
> records to process, I'm ONLY selecting one's with the status of 'pending'
> which means they are new, unprocessed.
>
> Is there a better, more eleqent way of doing this or is this pretty much it?

Hey Anthony,

I'd add two columns:

thingy_status
0 - not processed
1 - in process
2 - processed

process_pid
(int)

Basically I'm making an assumption that you're running your PHP code
on a *nix machine (linux, osx, commercial unix...) and each process
has a PID. Store the PID of the process acting on that row, that way
when the next worker looks for a row to process it can also check to
see if any pending records have been abandoned or left in an in
process state.

Keeps your logic clean and simple. Anything more sophisticated than
this and you might as well look into Amazon Queues. :-)

-- Mitch



More information about the talk mailing list