Presented by Matt Casto (http://google.com/profiles/mattcasto)
There are many different ways to write regular expressions to achieve the same result. Examples at http://www.regexlive.com
http://del.icio.us/mattcasto/regex for informative websites.
- http://regexpal.com –> JavaScript regular expression tester
- The Regulator – Open Source tool, has Intellisense –> Roy Osherove
Stephen Cole Kleene invented RegEx using Mathematic Regular Sets
Ken Thompson used regular sets for searching in QED and ed
grep – Global Regular Expression Print
Henry Spencer wrote the regex library that Perl and TCL languages used
Why Should You Care? Find duplicate words in a file
- Output lines that contain duplicate words
- Find doubled words that expand lines
- Ignore capitalization
- Ignore HTML tags
Using regular expressions reduces from 30+ lines to 5 lines of code
Mastering Regular Expression by Jeffrey Friedl very good for examples
Use System.Text.RegularExpressions for .NET help
Literal characters
- Any character except a small list of reserved characters (is, a, etc.)
- is –> Jack is a boy
- a –> Jack is a boy
- Literal characters ARE case sensitive – capitalization matters!
- RegexOptions.Compiled runs much faster as it compiles the RegEx, rather than running it all at runtime.
Special characters
- You can match special characters (like +) are escaped with backslash (\)
- + for instance, should be searched for with \+
- Some characters, such as { and } are only reserved depending on context
- Non-Printable characters
- \t –tab
- \r – carriage return
- etc.
- Period character matches any single character (dangerous because of overuse
- Character classes
- Used to match only one of the characters inside square braces
- [Gg]r[ae]y –> Grayson drives a grey car
- Hyphen is a reserved character inside a character class, indicates a range
- [0-9a-fA-F] – Matches all Hex codes
- Caret inside a character class negates the match
- q[^u] – Qatar Iraqi Iraq (Iraq not caught because it’s null after the q)
- Normal special characters are valid inside of character classes
- Shorthand character classes
- [\s] –whitespace or spce, tab, CR, LF
- [\w] – word or [A-Za-z0-9_]
- [\d] – digit or [0-9]
- [\D] – non-digit or [^\d]
- [\W] – non-word or [^\w]
- [\S] – non-space or [^\s]
- Repetition
- Asterisk repeats the preceding character class 0 or more times
- <[A-Za-z][A-Za-z0-9]*> –> <HTML>
- Plus repeats the preceding character class 1 or more times
- <[A-Za-z0-9]+> Matches <HTML> and <1> but not <>
- Question mark repeats the preceding character class 0 or 1 times, in effect making it optional.
- Anchors
- Caret anchor matches the position before the first character in a string
- ^vac –> vacation evacuation (doesn’t match evacuation because it’s not at the beginning of the string)
- Dollar sign matches position after the LAST character in a string
- tion$ –> vacation evacuation
- \A and \Z shorthand character classes only match the start and end of the string
- Word Boundaries
- \b shorthand character class matches
- [\b4\b] – 4 orders of 44lbs of C4
- \B negative of above