Writing regular expressions involves more than learning the mechanics. You not only have to learn how to describe patterns, you also have to recognize the context in which they appear. You have to be able to think through the level of detail that is necessary in a regular expression, based on the context in which the pattern will be applied.
The same thing that makes writing regular expressions difficult is what makes writing them interesting: the variety of occurrences or contexts in which a pattern appears. This complexity is inherent in language itself, just as you can't always understand anby looking up each word in the dictionary.
The process of writing a regular expression involves three steps:
Knowing what it is you want to match and how it might appear in the text.
Writing a pattern to describe what you want to match.
Testing the pattern to see what it matches.
This process is virtually the same kind of process that a programmer follows to develop a program. Step 1 might be considered the specification, which should reflect an understanding of the problem to be solved as well as how to solve it. Step 2 is analogous to the actual coding of the program, and step 3 involves running the program and testing it against the specification. Steps 2 and 3 form a loop that is repeated until the program works satisfactorily.
Testing your description of what you want to match ensures that the description works as expected. It usually uncovers a few surprises. Carefully examining the results of a test, comparing the output against the input, will greatly improve your understanding of regular expressions. You might consider evaluating the results of a pattern-matching operation as follows:
The lines that I wanted to match.
The lines that I didn't want to match.
The lines that I didn't match but wanted to match.
The lines that I matched but didn't want to match.
Trying to perfect your description of a pattern is something that you work at from opposite ends: you try to eliminate the "hits that should be misses" by limiting the possible matches and you try to capture the "misses that should be hits" by expanding the possible matches.
The difficulty is especially apparent when you must
describe patterns using fixed strings.
Each character you
remove from the fixed-string pattern increases the number of possible matches.
For instance, while searching for the string
you determine that you'd like to match
What as well.
The only fixed-string pattern that will
the longest string common to both.
It is obvious, though, that searching for
produce unwanted matches.
Each character you add to a fixed-string pattern decreases
the number of possible matches.
them is going to produce fewer matches than the string
Using metacharacters in patterns provides greater flexibility in extending or narrowing the range of matches. Metacharacters, used in combination with literals or other metacharacters, can be used to expand the range of matches while still eliminating the matches that you do not want.
- from O'Reilly & Associates' sed & awk