blog@mcZen

Software, life, and leisure

RegEx: Capturing between Matching Strings

Be the first to rate this post

  • Currently 0/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5

July 20, 2007 5:28 AM

mike

Regular Expressions are very powerful when parsing documents. There are many great tutorials out there that explain all the ins and outs. Wikipedia has a great list of links to help you figure out how to get started using regular expressions in your code. I’ve found regular-expressions.info to be particularly useful.

I’ve used regular expression most often when parsing HTML. Parsing out attributes is easy since quotes cannot be nested. The regular expression for this is:
1. \s(?’attribute’[a-zA-Z])\s*\=\s*\”(?’value’[^\”]*)\”
Of course life isn’t so easy and I ran into having to parse an html encoded html string. I didn’t want to decode the entire document and match strings. I started looking for a way to parse out the attributes while being encoded.
2. \s(?’attribute’[a-zA-Z])\s*\=\s*\&quot\;(?’value’.*?)\&quot\;
This works simply enough, but then I needed to validate that an anchor tag (a) had only a href attribute. The above match doesn’t filter this so well, so atomic grouping was what I needed. Atomic grouping (?>something) discards its backtracking matches once a success has been found. Here is the validating expression:
3. \&lt\;a\s+href\s*\=\s*\&quot\;(?>(?’value’.*?)\&quot\;)\s*>
This expression matches
<a href="http://www.mczen.com">
but does not match
<a href="http://www.mczen.com" target="_blank">
Of course if you think #2 shouldn’t match either, you’re wrong because it is greedy and when the > isn’t found it, it continues on to the " before the > which includes the target attribute. With #3, the reason the second doesn’t match resides with atomic grouping of “(?>(?’value’.*?)\&quot\;)”. Since the match was successful, it is discarded from backtracking. When > isn’t found next, the matching fails.

You can use this same technique when capturing between any matching strings. For instance if you want to find all text between mike in following statement, you could use the regex: mike(?>(?’match’.*?)mike)
When saying mike and mike, you should consider saying mike squared.
This would only match " and ". It would not match ", you should consider saying " because the second mike match was discarded already.

Powered by BlogEngine.NET 1.4.5.7