Introduction to Regular Expressions Session Notes

By riceboyler on Aug 14 2009 | 0 Comments

Presented by Matt Casto (http://google.com/profiles/mattcasto)

There are many different ways to write regular expressions to achieve the same result.  Examples at http://www.regexlive.com

http://del.icio.us/mattcasto/regex for informative websites.

  • http://regexpal.com –> JavaScript regular expression tester
  • The Regulator – Open Source tool, has Intellisense –> Roy Osherove

Stephen Cole Kleene invented RegEx using Mathematic Regular Sets

Ken Thompson used regular sets for searching in QED and ed

grep – Global Regular Expression Print

Henry Spencer wrote the regex library that Perl and TCL languages used

Why Should You Care?  Find duplicate words in a file

  • Output lines that contain duplicate words
  • Find doubled words that expand lines
  • Ignore capitalization
  • Ignore HTML tags

Using regular expressions reduces from 30+ lines to 5 lines of code

Mastering Regular Expression by Jeffrey Friedl very good for examples

Use System.Text.RegularExpressions for .NET help

Literal characters

  • Any character except a small list of reserved characters (is, a, etc.)
    • is –> Jack is a boy
    • a –> Jack is a boy
  • Literal characters ARE case sensitive – capitalization matters!
  • RegexOptions.Compiled runs much faster as it compiles the RegEx, rather than running it all at runtime.

Special characters

  • You can match special characters (like +) are escaped with backslash (\)
    • + for instance, should be searched for with \+
  • Some characters, such as { and } are only reserved depending on context
  • Non-Printable characters
    • \t –tab
    • \r – carriage return
    • etc.
  • Period character matches any single character (dangerous because of overuse
    • a.boy –> Jack is a boy
  • Character classes
    • Used to match only one of the characters inside square braces
      • [Gg]r[ae]y –> Grayson drives a grey car
    • Hyphen is a reserved character inside a character class, indicates a range
      • [0-9a-fA-F] – Matches all Hex codes
    • Caret inside a character class negates the match
      • q[^u] – Qatar Iraqi Iraq (Iraq not caught because it’s null after the q)
    • Normal special characters are valid inside of character classes
    • Shorthand character classes
      • [\s] –whitespace or spce, tab, CR, LF
      • [\w] – word or [A-Za-z0-9_]
      • [\d] – digit or [0-9]
      • [\D] – non-digit or [^\d]
      • [\W] – non-word or [^\w]
      • [\S] – non-space or [^\s]
  • Repetition
    • Asterisk repeats the preceding character class 0 or more times
      • <[A-Za-z][A-Za-z0-9]*> –> <HTML>
    • Plus repeats the preceding character class 1 or more times
      • <[A-Za-z0-9]+> Matches <HTML> and <1> but not <>
    • Question mark repeats the preceding character class 0 or 1 times, in effect making it optional.
  • Anchors
    • Caret anchor matches the position before the first character in a string
      • ^vac –> vacation evacuation (doesn’t match evacuation because it’s not at the beginning of the string)
    • Dollar sign matches position after the LAST character in a string
      • tion$ –> vacation evacuation
    • \A and \Z shorthand character classes only match the start and end of the string
  • Word Boundaries
    • \b shorthand character class matches
      • [\b4\b] – 4 orders of 44lbs of C4
    • \B negative of above
      • \Bat\B

Post info

Tags:
Categories:

Comments

Add comment


(Will show your Gravatar icon)  

  Country flag

biuquote
  • Comment
  • Preview
Loading