NYCPHP Meetup

NYPHP.org

[nycphp-talk] PCRE expression for tokenizing?

Michael B Allen ioplex at gmail.com
Mon Jul 21 17:48:06 EDT 2008


I trying to write a Wiki syntax tokenizer using preg_match. Meaning I
want to match any token like '~', '**', '//', '=====', ... etc but if
none of those tokens match I want to match any valid printable string.

The expression I have so far is the following:

  @(~)|(\*\*)|(//)|(=====)|(====)|(===)|(==)|(=)|([[:print:]]*)@

The problem with this is that the [[:print:]] class matches the entire
input. Strangely if I use [a-zA-Z0-9 ]* instead it works (but of
course I want to support more than ASCII and a space).

Meaning given the input:

  [The **fox** jumped //over// the fence]

I want each call to preg_match to return tokens (while advancing the
offset accordingly of course):

  [The ]
  [**]
  [fox]
  [**]
  [ jumped ]
  [//]
  [over]
  [//]
  [ the fence]

Can someone recommend a good PCRE expression for tokenizing like this?

Mike

-- 
Michael B Allen
PHP Active Directory SPNEGO SSO
http://www.ioplex.com/



More information about the talk mailing list