Regular expressions in PHP
PHP has three sets of functions that allow you to work with regular expressions. Programmer s does not use these powerful functions often as it seems to be difficult to create patterns. Also it is not easy to find a basic and simple regular expression tutorial in a single webpage . So I would like to give a try on this to collect the info and make it easy to learn and interesting. Also I will include some very useful reg-expressions in the end of this post.
About regular expressions
A regular expression is a pattern that can match various strings. Regular expressions started as a feature of Unix shell. They were made to make string operations easier. It’s really useful in programming with PHP as they help to reduce a lot of codes. As a simple example we can say the validation of an email address or a phone number.
Different sets of regular expressions
PHP provides mainly three sets of regular expressions.
1. preg – All the preg functions require to specify the regular expression in Perl syntax. If you want to include a slash (/) in your expression string, you should escape it with a back slash(). Hope you understood the core idea of using Perl syntax in preg. (I would be using preg functions in my examples)
2. ereg – The ereg functions require you to specify the regular expression as a string, as you would expect.
3. mb_ereg – They are very similar to ereg functions, but when ereg treat string as a series of 8 bit characters, mb_ereg can work with multi byte characters.
Enough theories, let’s go practical.
Operators and purposes
Operator | Purpose |
. (period) | Match any single character |
^ (caret) | Match the string that occurs at the beginning of a line or string |
$ (dollar sign) | Match the string that occurs at the end of a line |
A | Match an uppercase letter A |
a | Match a lowercase letter a |
| | OR operator |
d | Match any single digit |
D | Match any single non digit character |
w | Match any single alphanumeric character |
[A-Z] | Match any of uppercase A to Z |
[^A-Z] | Match any character except uppercase A to Z |
[0-9] | Match any digit from 0-9 |
[^0-9] | Match any digit except 0 to 9 |
X? | Match none or one capital letter X |
X* | Match zero or more capital Xes |
X+ | Match one or more capital Xes |
X{n} | Match exactly n capital Xes (f.e: A{2}) |
X{n,m} | Match at least n and no more than m capital Xes; if you omit m, the expression tries to match at least n Xes |
Basic syntax explanation
I will explain the main operators and syntax which are mostly used.
The use of “^” and “$” (start with and end with)
“^Aaa” – matches any string that starts with ” Aaa “;
“aa test$” – matches a string that ends in the substring “aa test”;
“^abc$” – a string that starts and ends with “abc” — that could only be “abc” itself!
“abcd” – a string that has the text “abcd” in it.
The use of “*”,”+” and ”?” (zero or more, one or more and zero or one)
“ab*” – matches a string that has an ‘a’ followed by zero or more b’s (“a”, “ab”, “abbb”, etc.);
“ab+” – matches string followed by one or more ‘b’ (“ab”, “abbb”, etc.);
“ab?” – there might be a ‘b’ or not;
“a?b+$” – a possible ‘a’ followed by one or more ‘b’s ending a string.
Specify range of number of occurrences
“ab{2}” – matches a string that has an a followed by exactly two b’s (“abb”);
“ab{2,}” – there are at least two b’s (“abb”, “abbbb”, etc.);
“ab{3,5}” – from three to five b’s (“abbb”, “abbbb”, or “abbbbb”).
Note : the first value of a range should be specified (i.e, “{0,2}”, not “{,2}”).
Specify range of occurrences of a sequence
“a(bc)*” – matches a string that has an a followed by zero or more copies of the sequence “bc”;
“a(bc){1,5}” – one to five occurrences of “bc.”
Using OR (|) operator
“euro|dollar” – matches a string that has either “euro” or “dollar” in it;
“(b|cd)ef” – a string that has either “bef” or “cdef”;
“(a|b)*c” – a string that has a sequence of alternating a’s and b’s ending in a c;
Using period(.) operator
“a.[0-9]” – matches a string that has an a followed by one character and a digit;
“^.{3}$” – a string with exactly 3 characters.
Bracket (“[]”) expressions
They specify which characters are allowed in a single position of a string
“[ab]” – matches a string that has either an a or a b (that’s the same as “a|b”);
“^[a-zA-Z]” – a string that starts with a letter;
“[0-9]%” – a string that has a single digit before a percent sign;
“,[a-zA-Z0-9]$” – a string that ends in a comma followed by an alphanumeric character.
In bracket expressions the symbol “^” brings a negative effect. Ie. It matches string that is NOT IN specified list.
“[^a-zA-Z0-9]” – means a string with character not in the character range specified.
Some useful regular expression patterns
i. regular expression pattern to replace/match the special characters in a string. It is really helpful when you want to rename files without special characters and check the presence of special characters in a string.
pattern : “%[^a-zA-Z0-9]%”
code sample:
<?php
$string = ‘$%abcd*.06′;
echo preg_replace(“%[^a-zA-Z0-9]%”,’_’,$string);
//output : __abcd__06
?>
Also if you want to give exceptions for any of the special characers, just include them inside the bracket after the symbol “^”
pattern : “%[^a-zA-Z0-9.$]%”
code sample:
<?php
$string = ‘$%abcd*.06′;
echo preg_replace(“%[^a-zA-Z0-9.$]%”,’_’,$string);
//output : $_abcd_.06
?>
ii. Regular expression patterns to match a valid email address
pattern : “^[a-zA-Z0-9._-]+@[a-zA-Z0-9-]+.[a-zA-Z.]{2,5}$”
code sample :
<?php
$email = ‘[email protected]’;
echo eregi(“^[a-zA-Z0-9._-]+@[a-zA-Z0-9-]+.[a-zA-Z.]{2,5}$”,$email );
//output : 1
?>
I will add some more commonly used patterns soon.
Useful links
Leave a Reply