Previous section   Next section

10.4 The Regex Class

The .NET Framework provides an object-oriented approach to regular expression matching and replacement.

The Framework Class Library namespace System.Text.RegularExpressions is the home to all the .NET Framework objects associated with regular expressions. The central class for regular expression support is Regex, which represents an immutable, compiled regular expression. Example 10-9 rewrites Example 10-8 to use regular expressions and thus solve the problem of searching for more than one type of delimiter.

Example 10-9. Using the Regex class for regular expressions
Option Strict On
Imports System
Imports System.Text
Imports System.Text.RegularExpressions

Namespace RegularExpressions

    Class Tester

        Public Sub Run( )
            Dim s1 As String = "One,Two,Three Liberty Associates, Inc."
            Dim theRegex As New Regex(" |, |,")
            Dim sBuilder As New StringBuilder( )
            Dim id As Integer = 1

            Dim subString As String
            For Each subString In theRegex.Split(s1)
                id = id + 1
                sBuilder.AppendFormat("{0}: {1}"  _
                  & Environment.NewLine, id, subString)
            Next subString
            Console.WriteLine("{0}", sBuilder.ToString( ))
        End Sub 'Run

        Public Shared Sub Main( )
            Dim t As New Tester( )
            t.Run( )
        End Sub 'Main
    End Class 'Tester
End Namespace 'RegularExpressions

Output:
1: One
2: Two
3: Three
4: Liberty
5: Associates
6: Inc.

Example 10-9 begins by creating a string, s1, identical to the string used in Example 10-8:

Dim s1 As String = "One,Two,Three Liberty Associates, Inc."

and a regular expression that will be used to search that string:

Dim theRegex As New Regex(" |, |,")

One of the overloaded constructors for Regex takes a regular expression string as its parameter.

This can be a bit confusing. In the context of a VB.NET program, which is the regular expression: the text passed in to the constructor or the Regex object itself? It is true that the text string passed to the constructor is a regular expression in the traditional sense of the term. From an object-oriented VB.NET point of view, however, the argument to the constructor is just a string of characters; it is the Regex object that is the regular expression object.

The rest of the program proceeds like Example 10-8 except that rather than calling Split() on string s1, the Split( ) method of Regex is called. Regex.Split( ) acts in much the same way as String.Split( ), returning an array of strings as a result of matching the regular expression pattern within theRegex.

Regex.Split( ) is overloaded. The simplest version is called on an instance of Regex as shown in Example 10-9. There is also a shared version of this method, which takes a string to search and the pattern to search with, as illustrated in Example 10-10.

Example 10-10. Using the shared Split( ) method
Option Strict On
Imports System
Imports System.Text
Imports System.Text.RegularExpressions


Namespace RegularExpressions

    Class Tester

        Public Sub Run( )
            Dim s1 As String = "One,Two,Three Liberty Associates, Inc."
            Dim sBuilder As New StringBuilder( )
            Dim id As Integer = 1

            Dim subString As String
            For Each subString In Regex.Split(s1, " |, |,")
                id = id + 1
                sBuilder.AppendFormat("{0}: {1}" _
                  & Environment.NewLine, id, subString)
            Next subString
            Console.WriteLine("{0}", sBuilder.ToString( ))
        End Sub 'Run


        Public Shared Sub Main( )
            Dim t As New Tester( )
            t.Run( )
        End Sub 'Main
    End Class 'Tester
End Namespace 'RegularExpressions

Example 10-10 is identical to Example 10-9 except that the latter example does not instantiate an object of type Regex. Instead, Example 10-10 uses the shared version of Split( ), which takes two arguments: a string to be searched and a regular expression string that represents the pattern to match.

The instance method of Split( ) is also overloaded with versions that limit the number of times the split will occur and also that determine the position within the target string where the search will begin.

10.4.1 Using Match and MatchCollection

Two additional classes in the .NET RegularExpressions namespace allow you to search a string repeatedly and to return the results in a collection. The collection returned is of type MatchCollection, which consists of zero or more Match objects. Two important properties of a Match object are its length and its value, each of which can be read, as illustrated in Example 10-11.

Example 10-11. Using MatchCollection and Match
Option Strict On
Imports System
Imports System.Text
Imports System.Text.RegularExpressions

Namespace RegularExpressions

    Class Tester

        Public Sub Run( )
            Dim string1 As String = "This is a test string"
            Dim theReg As New Regex("(\S+)\s")

            Dim theMatches As MatchCollection = theReg.Matches(string1)

            Dim theMatch As Match
            For Each theMatch In theMatches

                Console.WriteLine("theMatch.Length: {0}", _
                   theMatch.Length)

                If theMatch.Length <> 0 Then
                    Console.WriteLine("theMatch: {0}", _
                       theMatch.ToString( ))
                End If

            Next theMatch

        End Sub 'Run

        Public Shared Sub Main( )
            Dim t As New Tester( )
            t.Run( )
        End Sub 'Main
    End Class 'Tester
End Namespace 'RegularExpressions


Output:     
theMatch.Length: 5
theMatch: This
theMatch.Length: 3
theMatch: is
theMatch.Length: 2
theMatch: a
theMatch.Length: 5
theMatch: test     

Example 10-11 creates a simple string to search:

Dim string1 As String = "This is a test string"

and a trivial regular expression to search it:

Dim theReg As New Regex("(\S+)\s")

The string \S finds nonwhitespace, and the plus sign indicates one or more. The string \s (note lowercase) indicates whitespace. Thus, together, this string looks for any nonwhitespace characters followed by whitespace.

The output shows that the first four words were found. The final word was not found because it is not followed by a space. If you insert a space after the word string and before the closing quote marks, this program will find that word as well.

The Length property is the length of the captured substring and will be discussed in Section 10.4.3, later in this chapter.

10.4.2 Using Regex Groups

It is often convenient to group subexpression matches together so that you can parse out pieces of the matching string. For example, you might want to match on IP addresses and group all IP addresses found anywhere within the string.

IP addresses are used to locate computers on a network, and typically have the form nnn.nnn.nnn.nnn (such as 209.204.146.22).

The Group class allows you to create groups of matches based on regular expression syntax, and represents the results from a single grouping expression.

A grouping expression names a group and provides a regular expression; any substring matching the regular expression will be added to the group. For example, to create an ip group you might write:

"(?<ip>(\d|\.)+)\s" 

The Match class derives from Group and has a collection called "Groups," which contains all the groups your Match finds.

Example 10-12 illustrates the creation and use of the Groups collection and Group classes.

Example 10-12. Using the Group class
Option Strict On
Imports System
Imports System.Text
Imports System.Text.RegularExpressions

Namespace RegularExpressions

    Class Tester

        Public Sub Run( )
            Dim string1 As String = _
             "04:03:27 127.0.0.0 LibertyAssociates.com"

            ' time = one or more digits or colons 
            ' followed by a space
            ' ip address = one or more digits or dots 
            ' followed by space
            ' site = one or more characters
            Dim regString As String = "(?<time>(\d|\:)+)\s" & _
            "(?<ip>(\d|\.)+)\s" & _
            "(?<site>\S+)"

            Dim theReg As New Regex(regString)
            Dim theMatches As MatchCollection = theReg.Matches(string1)

            Dim theMatch As Match
            For Each theMatch In theMatches
                If theMatch.Length <> 0 Then
                    Console.WriteLine( _
                        "theMatch: {0}", _
                        theMatch.ToString( ))
                    Console.WriteLine( _
                        "time: {0}", _
                       theMatch.Groups("time"))
                    Console.WriteLine( _
                         "ip: {0}", _
                        theMatch.Groups("ip"))
                    Console.WriteLine( _
                         "site: {0}", _
                        theMatch.Groups("site"))
                End If
            Next theMatch

        End Sub 'Run

        Public Shared Sub Main( )
            Dim t As New Tester( )
            t.Run( )
        End Sub 'Main
    End Class 'Tester
End Namespace 'RegularExpressions

Output:
theMatch: 04:03:27 127.0.0.0 LibertyAssociates.com
time: 04:03:27
ip: 127.0.0.0
site: LibertyAssociates.com 

Again, Example 10-12 begins by creating a string to search:

Dim string1 As String = _
             "04:03:27 127.0.0.0 LibertyAssociates.com"

This string might be one of many recorded in a web server log file or produced as the result of a search of the database. In this simple example there are three columns: one for the time of the log entry, one for an IP address, and one for the site, each separated by spaces; of course, in a real example solving a real-life problem, you might need to do more complex searches and choose to use other delimiters and more complex searches.

In Example 10-12, you create a single Regex object to search strings of this type and break them into three groups: time, ip address, and site. The regular expression string is fairly simple (as regular expressions go), so the example is easy to understand (however, keep in mind that in a real search, you would probably only use a part of the source string rather than the entire source string, as shown here):

Dim regString As String = "(?<time>(\d|\:)+)\s" & _
"(?<ip>(\d|\.)+)\s" & _
"(?<site>\S+)"

Let's focus on the characters that create the group:

(?<time>

The parentheses create a group. Everything between the opening parenthesis (just before the question mark) and the closing parenthesis (in this case, after the plus sign) is a single unnamed group.

("(?<time>(\d|\:)+)

The string ?<time> names that group time, and the group is associated with the matching text, the regular expression (\d|\:)+)\s". This regular expression can be interpreted as "one or more digits or colons followed by a space."

Similarly, the string ?<ip> names the ip group, and ?<site> names the site group. As Example 10-11 does, Example 10-12 asks for a collection of all the matches:

Dim theMatches As MatchCollection = theReg.Matches(string1)

Example 10-12 iterates through the Matches collection, finding each Match object.

If the Length of theMatch is greater than 0, a Match was found; then it prints the entire match:

If theMatch.Length <> 0 Then
    Console.WriteLine( _
        "theMatch: {0}", _
        theMatch.ToString( ))

Here's the output:

theMatch: 04:03:27 127.0.0.0 LibertyAssociates.com

It then gets the "time" group from theMatch.Groups collection and prints that value:

Console.WriteLine( _
    "time: {0}", _
   theMatch.Groups("time"))

This produces the output:

time: 04:03:27

The code then obtains ip and site groups:

Console.WriteLine( _
     "ip: {0}", _
    theMatch.Groups("ip"))
Console.WriteLine( _
     "site: {0}", _
    theMatch.Groups("site"))

This produces the output:

ip: 127.0.0.0
site: LibertyAssociates.com

In Example 10-12, the Matches collection has only one Match. It is possible, however, to match more than one expression within a string. To see this, modify string1 in Example 10-12 to provide several logFile entries instead of one, as follows:

Dim string1 As String = "04:03:27 127.0.0.0 LibertyAssociates.com " +
"04:03:28 127.0.0.0 foo.com " +
"04:03:29 127.0.0.0 bar.com " ;

This creates three matches in the MatchCollection, theMatches. Here's the resulting output:

theMatch: 04:03:27 127.0.0.0 LibertyAssociates.com
time: 04:03:27
ip: 127.0.0.0
site: LibertyAssociates.com
theMatch: 04:03:28 127.0.0.0 foo.com
time: 04:03:28
ip: 127.0.0.0
site: foo.com
theMatch: 04:03:29 127.0.0.0 bar.com
time: 04:03:29
ip: 127.0.0.0
site: bar.com

In this example, theMatches contains three Match objects. Each time through the outer For Each loop we find the next Match in the collection and display its contents:

For Each theMatch In theMatches

For each of the Match items found, you can print out the entire match, various groups, or both.

10.4.3 Using CaptureCollection

Each time a Regex object matches a subexpression, a Capture instance is created and added to a CaptureCollection collection. Each capture object represents a single capture. Each group has its own capture collection of the matches for the subexpression associated with the group.

A key property of the Capture object is its length, which is the length of the captured sub-string. When you ask Match for its length, it is Capture.Length that you retrieve because Match derives from Group, which in turn derives from Capture.

The regular expression inheritance scheme in .NET allows Match to include in its interface the methods and properties of these parent classes. In a sense, a Group is-a capture—it is a capture that encapsulates the idea of grouping subexpressions. A Match, in turn, is-a Group—it is the encapsulation of all the groups of subexpressions making up the entire match for this regular expression. (See Chapter 5 for more about the is-a relationship and other relationships.)

Typically, you will find only a single Capture in a CaptureCollection; but that need not be so. Consider what would happen if you were parsing a string in which the company name might occur in either of two positions. To group these together in a single match you create the ?<company> group in two places in your regular expression pattern:

Dim regString As String = "(?<time>(\d|\:)+)\s" & _
"(?<company>\S+)\s" & _
"(?<ip>(\d|\.)+)\s" & _
"(?<company>\S+)\s"

This regular expression group captures any matching string of characters that follows time, and also any matching string of characters that follows ip. Given this regular expression, you are ready to parse the following string:

Dim string1 As String = "04:03:27 Jesse 0.0.0.127 Liberty "

The string includes names in both the positions specified. Here is the result:

theMatch: 04:03:27 Jesse 0.0.0.127 Liberty
time: 04:03:27
ip: 0.0.0.127
Company: Liberty

What happened? Why is the Company group showing Liberty? Where is the first term, which also matched? The answer is that the second term overwrote the first. The group, however, has captured both; its Captures collection can show that to you, as illustrated in Example 10-13.

Example 10-13. Captures collection
Imports System
Imports System.Text
Imports System.Text.RegularExpressions

Namespace RegularExpressions

    Class Tester

        Public Sub Run( )
            Dim string1 As String = _
             "04:03:27 Jesse 0.0.0.127 Liberty  " 

            ' time = one or more digits or colons 
            ' followed by a space
            ' ip address = on ore more digits or dots 
            ' followed by space
            ' site = one or more characters
            Dim regString As String = "(?<time>(\d|\:)+)\s" & _
            "(?<company>\S+)\s" & _
            "(?<ip>(\d|\.)+)\s" & _
            "(?<company>\S+)\s"

            Dim theReg As New Regex(regString)
            Dim theMatches As MatchCollection = theReg.Matches(string1)

            Dim theMatch As Match
            For Each theMatch In theMatches
                If theMatch.Length <> 0 Then
                    Console.WriteLine( _
                        "theMatch: {0}", _
                        theMatch.ToString( ))
                    Console.WriteLine( _
                        "time: {0}", _
                       theMatch.Groups("time"))
                    Console.WriteLine( _
                         "ip: {0}", _
                        theMatch.Groups("ip"))
                    Console.WriteLine( _
                         "Company: {0}", _
                        theMatch.Groups("company"))

                    Dim cap As Capture
                    For Each cap In _
                       theMatch.Groups("company").Captures
                        Console.WriteLine( _
                           "cap: {0}", cap.ToString( ))
                    Next
                End If
            Next theMatch

        End Sub 'Run

        Public Shared Sub Main( )
            Dim t As New Tester( )
            t.Run( )
        End Sub 'Main
    End Class 'Tester
End Namespace 'RegularExpressions

Output:
theMatch: 04:03:27 Jesse 0.0.0.127 Liberty
time: 04:03:27
ip: 0.0.0.127
Company: Liberty
cap: Jesse
cap: Liberty

The code in bold iterates through the Captures collection for the Company group.

Dim cap As Capture
For Each cap In _
   theMatch.Groups("company").Captures

Let's review how this line is parsed. The compiler begins by finding the collection that it will iterate. theMatch is an object that has a collection named Groups. The Groups collection has a default property (as explained in the previous chapter) that takes a string and returns a single Group object. Thus, the following line returns a single Group object:

theMatch.Groups("company")

The Group object has a collection named Captures. Thus, the following line returns a Captures collection for the Group stored at Groups["company"] within the theMatch object:

theMatch.Groups("company").Captures

The For Each loop iterates over the Captures collection, extracting each element in turn and assigning it to the local variable cap, which is of type Capture. You can see from the output that there are two capture elements: Jesse and Liberty. The second one overwrites the first in the group, and so the displayed value is just Liberty, but by examining the Captures collection you can find both values that were captured.


  Previous section   Next section
Top