The Regular Expressions cheat sheet is a one-page reference sheet. It is a guide to patterns in regular expressions, and is not specific to any single language.
There are a few small changes from the first version of the Regular Expressions Cheat Sheet (which you can still download if you prefer). The most obvious change may be that it now looks different. Hopefully it's now clearer and a little easier to find the information you're looking for.
About This Guide
I have included a little more detail in this document where I felt it would be helpful to those less familiar with regular expressions, to demonstrate some of the items on the sheet. Please feel free to let me know if any additions would be helpful.
Please also note that not everything on this sheet will work with every language that has regular expression support. Different languages use regular expressions in different ways, and in some, support is incomplete.
Anchors in regular expressions refer to the start and end of things. This can be, for example, a string or word. These characters and symbols represent these anchors in regular expressions. For example, a pattern that matched a string that started with numbers might be the following, where "^" represents the start of the string.
Without the "^" symbol, the pattern would match any string with a digit in it.
Character Classes in regular expressions match a selection of characters at once. For example, "\d" will match any digit from 0 to 9 inclusive. "\w" will match letters and digits, and "\W" will match everything but letters and digits. A pattern to indentify letters, numbers or whitespace could be:
POSIX is a relatively new addition to the regular expressions family, and is quite similar to the idea behind character classes, allowing you to use a shortcut to represent a particular group of characters.
Almost everyone has some trouble with assertions at first. They are tricky to get to grips with, but once you are familiar with them, you will use them alarmingly often. They provide a way to say "I want to find out every word in this document with a q in it, as long as that q isn't followed by 'werty'".
The above code starts by matching non-whitespace characters ([^\s]*), then a q (err ... q). Then the parser reaches the lookahead assertion. This makes the q conditional. The q will only be matched if the assertion is true. In this case, the assertion is a negative assertion. It will be true if what it checks for is not found.
So, it checks the next few characters against the pattern it has (werty). If they are found, the assertion is false, and so it will "ignore" the q - it will not match. If it doesn't find "werty", the assertion is true, and the q is matched. It then carries on checking for non-whitespace characters.
Finally, there is a selection of sample patterns. These patterns are intended to allow you to look at how regular expressions might be used in day-to-day work, and the various ways you can use regular expressions. Please note, however, that they will not necessarily work in every language, as each has its own idiosyncracies and varying support for regular expressions.
Quantifiers allow you to specify a part of a pattern that must be matched a certain number of times. For example, if you wanted to find out if a document contained between 10 and 20 (inclusive) of the letter "a" in a row, you could use this pattern:
Quantifier are "greedy" by default. So the quantifier "+", which means "one or more", will match as many items as possible. This can be a problem on occasion, so you can tell a quantifier to not be greedy (to be "lazy"), using a modifier. Consider the following code:
This will match text contained in quotation marks. However, you may have a string like this:
<a href="helloworld.htm" title="Hello World">Hello World</a>
The pattern above will match the following from the above string:
"helloworld.htm" title="Hello World"
It has been too greedy, matching as much text as it could.
The above pattern will also match any characters contained in quotation marks. The non-greedy version (note the "?" modifier) will match as little as possible of the string, so will match each item in quotation marks separately:
Regular expressions use symbols to represent certain things. However, that presents a problem if you want to detect a character in a string where that character is a symbol. A period (".") for example, in a regular expression, represents "any character except the new line character". If you want to find a period in a string, you can't just use "." as a pattern - it will match just about everything. So, you need to tell the parser to treat the period as a literal period rather than a special character. This you do with an escape character.
An escape character precedes the special character and tells the parser to ignore what follows. There are certain characters that will need to be escaped in the majority of patterns and languages, and you can find these characters listed at the bottom right of the cheat sheet.
The pattern to match a period is:
Other special characters in regular expressions represent unusual elements in text. New lines and tabs, for example, can be typed using a keyboard, but are likely to trip up programming languages. The special characters use the escape character as well, to tell the regular expression parser that the following character is to be treated as a special character rather than a normal letter or number.
String replacement is covered in more detail in the "Groups and Ranges" section below, however one small point to note is the existence of "passive" groups. These are groups that are ignored for the purposes of replacement. This is very useful when you want to match something that requires an "or" section, but don't want it in the replacement.
Groups and Ranges
Groups and ranges are very very useful. Ranges are perhaps the easiest place to begin. They allow you to specify a selection of characters to match. For example, if you wanted to see if a string contained hexadecimal characters (zero to nine and a to f), you would use this range:
If you wanted to see if a string did not contain the same, you would use a negative range, which in this case will match any character that isn't zero to nine or a to f.
Groups are essential to regular expressions, and are most often used when you want to use "or" in a pattern, or you want to reference part of a pattern later in the same pattern, or where using regular expression string replacement.
To use "or" is very simple - the following will match "ab" or "bc":
If you want to reference a previous group in a regular expression, you would use "\n", where "n" is the number of the group. You might need a pattern to match "aaa" or "bbb", followed by numbers, followed by the same 3 letters, and this would be done with groups, like so:
The above matches "aaa or bbb", and groups the match with the brackets. This is followed by a pattern for one or more numbers ("[0-9]+"), then finally "\1". The "\1" backreferences the first group, and looks for the same thing. It will match the matched text from the string, not the pattern, so "aaa123bbb" will not match the above pattern, as the "\1" will be looking for "aaa" to follow the numbers.
String replacement is one of the most useful tools of regular expressions. You can use "$n" to reference groups matched with the pattern when replacing text. Let's say you are want to make every instance of the word "wish" bold in a block of text. You would use a regular expression replacement function for this, which might look a little like this:
replace(pattern, replacement, subject)
The pattern is first, and would be something like the following (you would need a few extra characters for this specific function.
This will find any instance of the word wish where it is preceded and followed by any non-alphanumeric character.
Your replacement can then be:
This replacement will replace the whole pattern matched above. We start with the first character matched above ($1) (the first non-alphanumeric one), otherwise we'll be deleting characters from the block of text. The same applies at the end ($3) of the match. In the middle, we add the HTML tags for bold text (though you should use CSS or <strong>, of course), with the second group matched in the pattern ($2).
Pattern modifiers are used in several languages, most notably Perl. These allow you to change how the parser works. For example, the "i" modifier will tell the parser to ignore case.
In Perl, regular expressions contain the same character at the beginning and end. This can be any character at all (often "/"), and is used like so:
Modifiers would be added at the end of this, like so:
Finally, the last section of the cheat sheet lists the meta-characters. These are the characters that have special meaning in regular expressions, so if you want to use them literally, they must be escaped.
So, if you wanted to match test consisting of a bracket, you would need to use the following pattern:
The Regular Expressions Cheat Sheet is released under a Creative Commons License (Attribution, Non-Commercial, Share Alike).