Regular expressions in PHP

PHP
Regular expressions in PHP

PHP has three sets of functions that allow you to work with regular expressions. Programmer s does not use these powerful functions often as it seems to be difficult to create patterns. Also it is not easy to find a basic and simple regular expression tutorial in a single webpage .  So I would like to give a try on this to collect the info and make it easy to learn and interesting. Also I will include some very useful reg-expressions in the end of this post.

About regular expressions

A regular expression is a pattern that can match various strings. Regular expressions started as a feature of Unix shell. They were made to make string operations easier. It’s really useful in programming with PHP as they help to reduce a lot of codes. As a simple example we can say the validation of an email address or a phone number.

Different sets of regular expressions

PHP provides mainly three sets of regular expressions.
1.    preg  – All the preg functions require to specify the regular expression in Perl syntax. If you want to include a slash (/) in your expression string, you should escape it with a back slash(). Hope you understood the core idea of using Perl syntax in preg. (I would be using preg functions in my examples)
2.    ereg – The ereg functions require you to specify the regular expression as a string, as you would expect.

3.    mb_ereg – They are very similar to ereg functions, but when ereg treat string as a series of 8 bit characters, mb_ereg can work with multi byte characters.
Enough theories, let’s go practical.

Operators and purposes

Operator Purpose
. (period) Match any single character
^ (caret) Match the string that occurs at the beginning of a line or string
$ (dollar sign) Match the string that occurs at the end of a line
A Match an uppercase letter A
a Match a lowercase letter a
| OR operator
d Match any single digit
D Match any single non digit character
w Match any single alphanumeric character
[A-Z] Match any of uppercase A to Z
[^A-Z] Match any character except uppercase A to Z
[0-9] Match any digit from 0-9
[^0-9] Match any digit except 0 to 9
X? Match none or one capital letter X
X* Match zero or more capital Xes
X+ Match one or more capital Xes
X{n} Match exactly n capital Xes (f.e: A{2})
X{n,m} Match at least n and no more than m capital Xes; if you omit m, the expression tries to match at least n Xes

Basic syntax explanation

I will explain the main operators and syntax which are mostly used.

The use of “^” and “$” (start with and end with)

“^Aaa” – matches any string that starts with ” Aaa “;

“aa test$” –  matches a string that ends in the substring “aa test”;

“^abc$” –  a string that starts and ends with “abc” — that could only be “abc” itself!

“abcd” –  a string that has the text “abcd” in it.

The use of “*”,”+” and ”?” (zero or more, one or more and zero or one)

“ab*” –  matches a string that has an ‘a’ followed by zero or more b’s (“a”, “ab”, “abbb”, etc.);

“ab+” – matches string followed by one or more ‘b’ (“ab”, “abbb”, etc.);

“ab?” –  there might be a ‘b’ or not;

“a?b+$” –  a possible ‘a’ followed by one or more ‘b’s ending a string.

Specify range of number of occurrences

“ab{2}” –  matches a string that has an a followed by exactly two b’s (“abb”);

“ab{2,}” –  there are at least two b’s (“abb”, “abbbb”, etc.);

“ab{3,5}” – from three to five b’s (“abbb”, “abbbb”, or “abbbbb”).

Note : the first value of a range should be specified (i.e, “{0,2}”, not “{,2}”).

Specify range of occurrences of a sequence

“a(bc)*” –  matches a string that has an a followed by zero or more copies of the sequence “bc”;

“a(bc){1,5}” –  one to five occurrences  of “bc.”

Using OR  (|) operator

“euro|dollar” – matches a string that has either “euro” or “dollar” in it;

“(b|cd)ef” –  a string that has either “bef” or “cdef”;

“(a|b)*c” –  a string that has a sequence of alternating a’s and b’s ending in a c;

Using period(.) operator

“a.[0-9]” –  matches a string that has an a followed by one character and a digit;

“^.{3}$” –  a string with exactly 3 characters.

Bracket (“[]”) expressions

They specify which characters are allowed in a single position of a string

“[ab]” –  matches a string that has either an a or a b (that’s the same as “a|b”);

“^[a-zA-Z]” –  a string that starts with a letter;

“[0-9]%” –  a string that has a single digit before a percent sign;

“,[a-zA-Z0-9]$” – a string that ends in a comma followed by an alphanumeric character.

In bracket expressions the symbol “^” brings a negative effect. Ie. It matches string that is NOT IN specified list.

“[^a-zA-Z0-9]” – means a string with character not in the character range specified.

Some useful regular expression patterns

i. regular expression pattern to replace/match the special characters in a string. It is really helpful when you want to rename files without special characters and check the presence of special characters in a string.

pattern : “%[^a-zA-Z0-9]%”

code sample:

<?php

$string = ‘$%abcd*.06′;

echo preg_replace(“%[^a-zA-Z0-9]%”,’_’,$string);

//output :  __abcd__06

?>

Also if you want to give exceptions for any of the special characers, just include them inside the bracket after the symbol “^”

pattern : “%[^a-zA-Z0-9.$]%”

code sample:

<?php

$string = ‘$%abcd*.06′;

echo preg_replace(“%[^a-zA-Z0-9.$]%”,’_’,$string);

//output :  $_abcd_.06

?>

ii. Regular expression patterns to match a valid email address

pattern :  “^[a-zA-Z0-9._-]+@[a-zA-Z0-9-]+.[a-zA-Z.]{2,5}$”

code sample :

<?php

$email = ‘[email protected]’;

echo eregi(“^[a-zA-Z0-9._-]+@[a-zA-Z0-9-]+.[a-zA-Z.]{2,5}$”,$email );

//output :  1

?>

I will add some more commonly used patterns soon.

Useful links

http://www.ibm.com/developerworks/library/os-php-regex1/

http://in3.php.net/manual/en/book.pcre.php

Leave a Reply

Your email address will not be published. Required fields are marked *

12 − 5 =

2hats Logic HelpBot