Regular Expression Examples

One of the things that I love about string manipulation is the existence of regular expressions. For this reason, I have decided to share a few examples that may help those who are learning about regular expressions so that can understand them a bit better.


JavaScript – General Variable Name

In many languages, a variable must start off with a letter and may be followed by letters, numbers, and/or the underscore character.  Knowing this, we could use the following regular expression in JavaScript to match a variable name:

var expVarIFlag = /^[A-Z]\w*$/i;

Basically, the above regular expression will only match a string if it matches the pattern for a variable name. The reason that I only put [A-Z] and not [A-Za-z] is because in JavaScript you can specify an “i” flag after the regular expression which indicates that the expression will be case-insensitive. Another thing to note is that I used the \w class which basically represents a word. A word in regular expressions typically means any letter (A to Z regardless of case), any digit (0 to 9), or the underscore character. The reason I used the asterisk instead of the plus sign is because a variable may be just one letter.

NOTE: Although this regular expression may work for other languages, in JavaScript, a variable name can also start off with an underscore or dollar sign.


PostgreSQL – Date (MM/DD/YYYY)

Even though using a regular expression shouldn’t be the way to completely validate a date, you can do so partially with the following in PostgreSQL:

SELECT id, text
FROM answers
WHERE text ~ '^(0\\d|1[012])/([012]\\d|3[01])/\\d{4}$';

The above query will pull all of the answers with a text that matches the pattern to see if it looks like a valid date.

  1. First it specifies that the first two characters are a 0 followed by another digit or a 1 followed by either a 0, 1, or 2.
  2. Next should come a forward slash.
  3. Next should be either…
    1.  0, 1, or 2 followed by any digit
    2. or 3 followed by a 0 or 1
  4. Finally should be another forward slash followed by four digits.

One thing to notice is that in order to properly escape the class inside of a string (which is what we have to do here in PostgreSQL), you have to escape the backslash so that it will be interpreted as one backslash in front of the next character thus rendering “\\w” as “\w“.


PHP – Hexadecimal Color Code

In CSS, a color code can be in many different forms. One accepted form is hexadecimal. The hex form can be three characters or six characters long. It can start off with a number sign, but this symbol isn’t required. Knowing all of this, we could use the following in PHP to validate the hex color:

$pattern = '/^#?([0-9A-F]{3}){1,2}$/i';
$validHex = preg_match($pattern, $_GET['hex']);

The preg_match() function is used to validate the GET parameter called “hex” against our regular expression:

  1. First it specifies that the first character may be a number sign (#).
  2. Next I have defined a parenthesized group which matches any three hexadecimal digits.
  3. After that, I am specifying that my parenthesized group pattern may appear once or two times in a row and that no other characters should follow.
  4. Finally, you will notice that I am again using the “i” flag to indicate that this is a case-insensitive pattern.

Python – Simple Image File Names

Let’s use Python now to check to see if a file name looks like a valid image name:

# Import the regular expression library
import re

# Defining the compiled regular expression.
pat = "^[^/\\?%*:|\"<>]+\\.(jpg|png|gif|bmp)$"
reImg = re.compile(pat, re.I)

# Getting the file name from the user
fileName = raw_input("File name:  ")

# Determine if the file name is an image name
isImage = reImg.match(fileName) is not None

The regular expression created does the following:

  1. First makes sure that the string starts off with one more characters which are none of the following:  /  \  ?  %  *  :  |  “  <  >
  2. In the end it checks that a dot is found followed by one of the following extensions which must appear at the end of the string:
    1. jpg
    2. png
    3. gif
    4. bmp
  3. It is also important to note that by using “re.I“, I specified that casing would be ignored.

The code should basically prompt the user for a file name and then validate the string entered to determine if it matches the regular expression for an image.  The boolean value indicating whether or not it is an image is stored in the isImage variable.


VBScript – Format Large Integer With Commas

The following is how you could use a regular expression to insert commas into a number (integer):

' Setup the RegExp for testing if input is an integer.
Dim re : Set re = new RegExp
re.Pattern = "^(0|-?[1-9]\d*)$"

' Get the input integer from the user.
input = InputBox("Enter an integer", "Your Integer", 123456789)

' If the input is an integer...
If re.Test(input) Then
  ' Modify the pattern to input the commas correctly.
  re.Pattern = "(\d)(?=(\d{3})+$)"
  re.Global = True

  ' Reformat the integer, if given.
  newInput = re.Replace(input, "$1,")

  ' Display the input formatted with commas.
  MsgBox input & " became " & newInput

' If the input is not an integer, tell the user so.
Else
  MsgBox "The input given wasn't recognized as an integer."
End If

The first regular expression basically tests to make sure that the input is either simply a zero or one or more digits with the first one being non-zero. In other words, the first pattern makes sure that the input is an integer that doesn’t start with a zero (unless it is zero). The second regular expression is what is used to insert the comma(s) in the right place(s). It finds every instance in which one digit is followed by at least one group of three digits. By starting the group off with “?=” I am ensuring that the matched group will not be skipped on the next pass through.

JavaScript Snippet – RegExp.prototype.clone()

There have been many times when I needed to simply modify the flags (options) for a regular expression in JavaScript. Unfortunately, the global, ignoreCase, multiline, and sticky (for FireFox) flags are immutable. In addition, when creating a new regular expression from an old one, you can’t supply a regular expression as the first parameter of the constructor followed by a string indicating the desired flags. For this reason, I wrote a definition for a cloning function for regular expressions that provides the ability to supply different flags:

(function(stickySymbol) {
  RegExp.prototype.clone = function(options) {
    // If the options are not in string format...
    if(options + "" !== options) {
      // If the options evaluate to true, use the properties to construct
      // the flags.
      if(options) {
        options = (options.ignoreCase ? "i" : "")
          + (options.global ? "g" : "")
          + (options.multiline ? "m" : "")
          + (options.sticky ? "y" : "");
      }
      // If the options evaluate to false, use the current flags.
      else {
        options = (this + "").replace(/[\s\S]+\//, "");
      }
    }

    // Return the new regular expression, making sure to only include the
    // sticky flag if it is available.
    return new RegExp(this.source, options.replace("y", stickySymbol));
  };
})("sticky" in /s/ ? "y" : "");

After the above definition is executed, declarations such as the following can be made:

// regular expression that matches every lowercased vowel
var reLowerCaseVowels = /[aeiou]/g;

// clone of the regular expression that matches every lowercased vowel
var reLowerCaseVowels2 = reLowerCaseVowels.clone();

// regular expressions that match every vowel (case-insensitive),
// specifying the flags via string
var reVowels = reLowerCaseVowels.clone("ig");

// regular expressions that match every vowel (case-insensitive),
// specifying the flags via object literal form
var reVowels2 = reLowerCaseVowels.clone({
  ignoreCase : true,
  global : true
});

VBScript – RegExp Replace Using A Callback Function

One of the nice things about JavaScript is its functional nature. This is especially nice when it comes to dealing with strings and regular expressions. For instance, in JavaScript, you can use the following code to capitalize every other letter in a string:

var str = "where in the world is carmen sandiego?";
var strWeird = str.replace(/(.)(.)/g, function(a,b,c) {
  return b.toUpperCase() + c;
});
alert(strWeird); // WhErE In tHe wOrLd iS CaRmEn sAnDiEgO?

Cool stuff, right? Wouldn’t it be nice to be able to use a similar approach in VBScript? Believe it or not, you can? Here is the definition for a function which will allow you to do something similar:

Function RegExpReplace(re, str, replacement)
	' If replacement is a string, use the native RegExp.Replace function.
	If TypeName(replacement) = "String" Then
		RegExpReplace = re.Replace(str, replacement)
	' Since replacement is not a string, call replacement with every match
	' object and replace the match with the return value.
	Else
		Dim mc, m, ret, offset
		offset = 0
		Set mc = re.Execute(str)
		For Each m In mc
			ret = replacement(m)
			str = Left(str, m.FirstIndex - offset) & ret _
				& Mid(str, m.FirstIndex + m.Length - offset + 1)
			offset = offset + m.Length - Len(ret)
		Next
		RegExpReplace = str
	End If
End Function

The above function takes three parameters: the regular expression, the string that may be changed and the replacement function (or string). Now the question is, how do we pass the function (or at least a reference to it)? We can do this by taking the name of the function and using the GetRef function to get a reference to it. The following is the equivalent of what was done in JavaScript at the onset of this post:

Function fnUp1(objMatch)
  fnUp1 = UCase(m.Submatches(0)) & m.Submatches(1)
End Function

Dim re: Set re = New RegExp
re.Pattern = "(.)(.)"
re.Global = True

Dim str: str = "where in the world is carmen sandiego?";
Dim strWeird: strWeird = RegExpReplace(re, str, GetRef("fnUp1"))
MsgBox strWeird ' WhErE In tHe wOrLd iS CaRmEn sAnDiEgO?

Okay, of course the code is not as short in JavaScript, because of the way that regular expressions must be created and the fact that anonymous functions don’t exist in the language, but this is just a simple example. You may need to use this function in many different places in your code.

The other thing that I briefly mentioned is that the third parameter may be a string instead of a reference to a function. This is basically a shortcut for the RegExp.Replace function which natively exists.

Now you see that it is possible to script in a functional way with VBScript. Still, as is evidenced by the examples, JavaScript (and JScript) can usually accomplish the same thing with less code. :D

jPaq – Changes In Wildcard Expression Parsing

After reviewing this page, I realized that I need to fix some of the ways that jPaq converts wildcard expressions into regular expressions. The first thing I need to do is have the @ character act as a meta-character equivalent to the + meta-character in a regular expression. The next thing I need to change is the effect of prefixing a character with the ~ character. In reality, this character is actually doing what both the \ character and the ^ do. This means that the \ character will need to start acting like the escape character (which will not be hard since that automatically happens in regular expressions). That also means that the following translations will have to occur:

Wildcard RegExp
^t \t
^^ \^
^s \u00A0

Match Beginning & End of Words

One of the things I had to study up on while creating jPaq was regular expressions. The reason for this is because I wanted a way for those who know how to use wildcard expressions, but not regular expressions, to be able to use equivalent regular expressions. One of the more interesting tasks that I had was approximating the ability to use the word ending meta-characters:

  • <
  • >

The less than sign represents the beginning of a word, while the greater than sign represents the end of a word. The question is, how do you represent these two in the form of a regular expression. In order to do so, I am using a positive look-ahead grouping. Here is the regular expression for matching the beginning of a word:

/(?=\b\w)/

Here is the regular expression for matching the end of a word:

/(?=\b\W|\b$)/

As of right now, jPaq creates regular expressions which do the same thing, but these are actually better optimized. Therefore, in the next version, I will definitely use these to shorten and optimize the generated expressions. In fact, tomorrow I will talk about what else needs to change in jPaq to better approximate wildcard expressions that are available in Microsoft Word.