[nycphp-talk] Php in the twilight zone
Jack Scott
lists at jack-scott.com
Fri Apr 21 10:24:04 EDT 2006
On Fri, 2006-04-21 at 16:21 +0300, Iulian Manea wrote:
> The script is used for spidering a site, which is quite big .. so the 20
> minutes isn't that much. But each time the script finds a new link it
> flushes it to the browser, so the connection shouldn't timeout or anything
> ...
This doesn't fix your immediate problem, but if you are on *nix you
could run wget, lynx, or webBot to spider the site and then parse out
those results?
I have had to do this in the past and used wget to recursively spider a
site and create html files locally. Once that is done I grep the results
and pipe them to sed and/or (g,n)awk to fine tune the desired results.
There are a ton of similar windows utilities out there as well if that
is your platform.
Hope this helps,
Jack
More information about the talk
mailing list