[nycphp-talk] PCRE expression for tokenizing?
Dan Cech
dcech at phpwerx.net
Mon Jul 21 18:08:08 EDT 2008
Michael B Allen wrote:
> I trying to write a Wiki syntax tokenizer using preg_match. Meaning I
> want to match any token like '~', '**', '//', '=====', ... etc but if
> none of those tokens match I want to match any valid printable string.
>
> The expression I have so far is the following:
>
> @(~)|(\*\*)|(//)|(=====)|(====)|(===)|(==)|(=)|([[:print:]]*)@
>
> The problem with this is that the [[:print:]] class matches the entire
> input. Strangely if I use [a-zA-Z0-9 ]* instead it works (but of
> course I want to support more than ASCII and a space).
The reason for this is that your token characters are included in
[[:print:]] but not in [a-zA-Z0-9 ].
> Meaning given the input:
>
> [The **fox** jumped //over// the fence]
>
> I want each call to preg_match to return tokens (while advancing the
> offset accordingly of course):
>
> [The ]
> [**]
> [fox]
> [**]
> [ jumped ]
> [//]
> [over]
> [//]
> [ the fence]
>
> Can someone recommend a good PCRE expression for tokenizing like this?
If you want to end up with everything in an array, you might want to
look at preg_split with the PREG_SPLIT_DELIM_CAPTURE argument.
Something like:
$tokens =
preg_split('@(~|\*\*|//|=====|====|===|==|=)@',$string,PREG_SPLIT_DELIM_CAPTURE);
May do what you're after.
Dan
More information about the talk
mailing list