Leveraging regular expressions within MCMS implementations

What are regular expressions?

Regular expressions (regex) is one of the few things that even some of the more experienced developers haven’t wrapped their brain around. As Roy Osherove puts it:

Regular expressions are among the most “mysterious” and “black art-like” technologies out there for developers.
Reference: The Regulator help →

For those who aren’t familiar with regex, they provide a flexible and powerful text processor. You can use them to search for specific substrings within a string or do specialized search & replaces. Mastering them can dramatically improve the performance of your code as and speed up the completion of many of your tasks.

“I can already do that using string.Substring() and string.Replace()… why do I care about regex?”

Good point. Regex allows you to look for patterns within strings using expressions and conditionals within your search query. To demonstrate the power of regex consider the following: You need to replace all email addresses within a string with a hyperlinked string that when clicked, will trigger a new mail within the user’s default email program. Sure, you can do it without regex, but look how easy it is using the .NET built-in regex support:

public string EncodeEmailLinks(string stringToBeEncoded) {
  string regex = @"(?\w+[\w.]*@[\w\.]+\.\w+)";
  System.Text.RegularExpressions.RegexOptions options =
  (
    System.Text.RegularExpressions.RegexOptions.IgnorePatternWhitespace
    | System.Text.RegularExpressions.RegexOptions.Multiline
    | System.Text.RegularExpressions.RegexOptions.IgnoreCase
  );
  System.Text.RegularExpressions.Regex reg = new System.Text.RegularExpressions.Regex(regex, options);
  return reg.Replace("[email protected]", "${email}");
}

That only took 4 lines! Hopefully you can see the power of regular expressions… if not… well that’s not the point of this post; I want to share with you how I think it could help you with your MCMS sites.

How can regular expressions be used in a MCMS site to your advantage?

A powerful text processor can be applied to various applications in many different ways. As with MCMS, we could use it in many different ways, such as:

Validate a GUID
Extract a GUID from a URL
Find (and replace) specific strings within content for validation

The cases above are only a few of the possibilities. Sure, there are other ways to do it. In fact, regular expressions, as many technologies, aren’t always the best way o solve a problem. Let me show you one way I used regex within a recent MCMS project… you may even be doing the same thing we were doing before we implemented the regex solution!

First, a brief introduction: we were using XML, XSLT, and ASP.NET caching to deliver a very workable navigation solution for a MCMS site. This solution involved a lot of what I call “channel walking” where the code has to walk up and down the channel tree similar to what you may do with XML documents. After implementing a regex solution the code was much more readable and maintainable, and even after removing all caching, it was still noticeably faster!

The site’s channel structure was something like the tree to the right.

The SiteRoot is the root of the site. I’m going to refer to the channels under SiteRoot as sections. Within the Markets section each market was represented with its own channel. In addition a market may serve a non-English market so it may need one or more additional translations. Each translation of the market content was represented with its own channel.

The global navigation had the following requirements:

When the current posting was within the root of the site (within the SiteRoot channel), the navigation needed to show a list of all the top-level channels.
When the current posting was within the AboutUs, NewsAndEvents, or ContactUs channel, it needed to show the contents of the current active section. In addition, it needed to show a link back to the homepage of the site and links back to each of the other non-active sections
When the current posting was within a specific market, it needed to show the contents of the current market. In addition, it needed to show a link back to the homepage of the site and links back to each of the other non-active sections.
When the current posting was within a specific market that had multiple translations, it needed to show the contents of the current market’s language, as well as links to each of the other languages. In addition, it needed to show a link back to the homepage of the site and links back to each of the other non-active sections.

The language requirements for this site were very unique. Unlike many sites where an entire site is translated, the only parts that were translated were for the products of the market which needed the multiple translations. Every market that had multiple translations had at a minimum an English version. When a user clicked on this market, they would be taken to the English version by default and could switch to different languages from there.

Each time the navigation was generated, the following needed to be determined:

Which section is the current active posting in?
If the current active posting is within a market, is that market multilingual?
- If not, simply show all channels within that market, expanding the node that’s currently active.
- If so, show all channels within that language, expanding the node that’s currently active, and then add on “View in [language]” links to the end of the navigation for the user to switch between nodes

Note: this site didn’t utilize channel rendering scripts… an empty channel was never shown in the navigation and each channel had a default posting named “default”. So, all ChannelItem.Path’s had a posting name at the end of the path.

A site I recently worked on didn’t use any regex at all.

The old way of building the navigation involved a considerable amount of channel walking. Starting from the current active posting, the code walked up the tree to find the current section, and then back down to see what language (if any) it was in, etc. It then needed to get references to each of the sections that weren’t active as well as the SiteRoot channel. Once all that was determined, the code would then walk the tree some more and create an XML document that represented what the final navigation structure would look like for the currently active channel. This XML structure was then transformed to HTML using an XSLT document stored as an embedded resource within the project. The resulting HTML was then added to cache for 10 minutes, using the current channel’s unique GUID as the key.

Wow, that sounds like a pain… yes it was. The code was never easy to read and it wasn’t very flexible. The way the code determined if it was in a market and if that market was multilingual was with custom properties. This means there were quite a few custom properties that needed to be added and managed when certain channels were created. Because the MCMS PAPI doesn’t support programmatic creation of custom properties channels, this was a manual support issue. ~~not ideal~~

However, the solution worked, it was quite fast (thanks to the ASP.NET caching and the Output directive’s VaryByParm attribute). “If it isn’t broke, don’t fix it” could have easily been applied here… but I wanted to see what the impact of regex could do for this solution.

As with any MCMS site, some “channel walking” is necessary… it’s just not avoidable in large and complex sites. However, the navigation we built could have surely been easier to maintain. The old solution outlined above did a lot of Channel.Path analysis to determine where the process was within the channel tree. We removed this using regular expressions.

First, we created a static class that contained a list of regular expressions used to match specific cases.

For example, if the current active posting was “CompanyHistory” within the AboutUs section, the current ChannelItem.Path would be: /Channels/SiteRoot/AboutUs/CompanyHistory

We would use the following regex string to determine if we were in the AboutUs section: ^/Channels/SiteRoot/AboutUs. That regex string would match any and all postings within the AboutUs section.

Warning: Note

For these query samples, I’ll show how they work using The Regulator. The top pane of the IDE contains the regex query. The lower right panel contains the strings to test the query against and the lower left panel shows the strings that matched the reqex query.

Granted, that’s pretty straightforward and easy. So let’s pick one that’s slightly more complex. We’d use each regex query string within an IF statement to determine which condition it matched. We needed some regex strings that would determine the following:

Is the current posting in the root Markets section and ~~not~~ within a specific market?
Is the current posting within a specific market, but not a market that’s multilingual?
Is the current posting within a specific multilingual market?

The regex query string used to match the first condition would be: (^/Channels/SiteRoot/Markets/\\w+\[^(/\\w+)\]?)

The first part of this query, ^/Channels/SiteRoot/Markets/\\w+, matches any ChannelItem.Path within the Markets section. But we only want postings within the root of the Markets section, not deeper channels. The last part of the query, \[^(/\\w+)\], says “do not match any strings that have an additional ‘/sometext’ after the Markets section. That’s exactly what we needed! See this screenshot to see the query in action:

The next regex query we need is one that will determine when we’re within a specific market, but not at the market root or within a multilingual channel: ^/Channels/SiteRoot/Markets/\\w+/\\w+\[^(\\w{2}/)\]

This first part of this query, ^/Channels/SiteRoot/Markets/\\w+, looks just like the first query above. The next little bit, /\\w+, matches any postings or subchannels under the root Market section. Since the root Market section won’t have any postings under it and only contain subchannels, or markets, this will work for our situation. That last part of the query, \[^(\\w{2}/)\], tells the regex parser not to ignore any two-character string which we’re using for each language translation. But what if we had a valid two-character channel that wasn’t a language? I’ll come back to that in a minute. See this screenshot to see the query in action:

The last regex query we’d need is one that will determine when we’re within a multilingual market: ^/Channels/SiteRoot/Markets/\\w+/\\w{2}/\\w+

The first part of this query, ^/Channels/SiteRoot/Markets/\\w+, is straight from the first query above. The next part, /\\w{2}/\\w+, looks for two characters surrounded by forward slashes and another word. Perfect! See this to see the query in action.

Let’s jump back to our second example… what if we have non-language two-character channels? We need to test and see if these are valid channels or postings? One way is to use a feature provided by the regex processor to extract the exact two characters out of the string and then use the CultureInfo class to see if it’s a valid language. To do this, we need to use the groupings feature. Let’s change our second query to the following: ^/Channels/SiteRoot/Markets/\\w+/\\w+(\\w{2})*

Great, as this screenshot shows, we are now getting everything within a specific market!

Closer, but no cigar. What we need to get anything two-characters long, immediately within a specific market, surrounded by forward slashes. This is where named groups comes in handy: ^/Channels/SiteRoot/Markets/\\w+ (/(?\\w{2})/)+

You’ll see we added ? the query, just before the \\w{2}. As you can see from this screenshot, it found FL, en, and es as the potential languages (I added two state channels under NorthAmerica for testing purposes).

Now, we just take these potential languages and see if any errors are thrown when we try to load the CultureInfo class with one of these languages. The code below shows this test.

using System;using System.Text.RegularExpressions;
using System.Globalization;

public class RegularExpression {
  public static void Main() {
    string channelToTest = "/Channels/SiteRoot/Markets/NorthAmerica/FL/default";
    bool test = IsLanguage(channelToTest);
    if (test)
      Console.WriteLine("Channel contains a language");
    else
      Console.WriteLine("Channel does not contain a language");
    Console.ReadLine();
  }

  public static bool IsLanguage(string channelToTest) {
    string regex = @"^/Channels/SiteRoot/Markets/\w+/(?\w{2})/+";
    RegexOptions options = (
      RegexOptions.IgnorePatternWhitespace
      | RegexOptions.Multiline
      | RegexOptions.IgnoreCase
    );
    Regex reg = new Regex(regex, options);
    Match possibleLanguageMatch = reg.Match(channelToTest);
    if (possibleLanguageMatch.Success) {
      try {
        CultureInfo ci = new CultureInfo(possibleLanguageMatch.Groups["language"].Value.ToString());
        return true;
      } catch {
        return false;
      }
    } else
      return false;
  }
}

Summary

I hope this example has shown you show you can use regular expressions within your MCMS projects and how they can assist you. Since I’ve started using them my productivity has gone up and my applications are much more performant. I’ve listed a few resources below that will help you get off the ground with regex, just as they helped me.

Additional Resources:

ISerializable.com (Roy Osherove’s blog) – quite a few regex posts, two specific ones I want to point you towards:
ISerializable.com :: Introduction to Regular Expressions
ISerializable.com :: Practical Parsing Using Groups in Regular Expressions
The Regulator - Roy Osherove’s free regular expression testing tool (what I used in the screenshots)… invaluable!
MSDN: System.Text.RegularExpressions
MSDN: Regular Expression Language Elements
Tech Yourself Regular Expressions in 10 Minutes by Ben Forta (Sams, ISBN 0672325667) – fantastic “get off the ground” type book for regex… I constantly go back to it
Mastering Regular Expressions by Jeffery Fried (O’Reilly, ISBN 0596002890)

Leveraging regular expressions within MCMS implementations

What are regular expressions?

How can regular expressions be used in a MCMS site to your advantage?

MCMS Channels

Building MCMS navigation without regex

Building MCMS navigation with regex

Regulator - Finding Markets

Regular - Ignore languages

Regulator - Get specific market

Regulator - Adding '?'

Summary