[nycphp-talk] PCRE expression for tokenizing?
Dan Cech
dcech at phpwerx.net
Mon Jul 21 18:45:42 EDT 2008
Michael B Allen wrote:
> On Mon, Jul 21, 2008 at 6:08 PM, Dan Cech <dcech at phpwerx.net> wrote:
>> Michael B Allen wrote:
> So is there any way to say "capture anything that didn't match" (aside
> from created a sub-expression that explicitly excludes all of the
> tokens)?
Afaik no, you could probably do something like:
preg_match('@^(.*?)(~|\*\*|//|=====|====|===|==|=|$)@',$string,$m);
Which would give you anything before the first token (or end if there
are no more tokens) in $m[1] and the first token (or nothing if there
are no more tokens) in $m[2].
>>> Can someone recommend a good PCRE expression for tokenizing like this?
>> If you want to end up with everything in an array, you might want to look at
>> preg_split with the PREG_SPLIT_DELIM_CAPTURE argument.
>>
>> Something like:
>>
>> $tokens =
>> preg_split('@(~|\*\*|//|=====|====|===|==|=)@',$string,PREG_SPLIT_DELIM_CAPTURE);
>
> No, I need the tokens (or rather I need to know which token matched)
> for the state-machine that follows so preg_split probably isn't going
> to do the trick.
That's what the PREG_SPLIT_DELIM_CAPTURE flag does, it returns the
delimiters. You can iterate over the returned array and you'll get
either a token or text in each element.
An iterative preg_match setup will most likely be more memory efficient
but also slower.
Dan
More information about the talk
mailing list