Regular expressions in Perl

in utopian-io •  7 years ago  (edited)

A substantial chunk of what all programs do when they're running is process strings. For word processors, this is obvious – we've all used the 'search and replace' facility – but even spreadsheets have to process the function names you feed it and then the parameters that you assign to the function. In short, virtually every program that you'll encounter will process strings at some point. The better this is supported in your language, the easier it is to write programs that use it.

In Perl, string processing is used so heavily that functionality such as 'search and replace' and 'find' have their own shorthand. If you look through Perl documentation, trying to locate functions with names like find() or replace() is a waste of time. As we'll see, there's a far more powerful way of processing strings than this.

Find/match

One of the most useful string functions is to find a substring within a longer string so that we can make a decision about which way a program should flow (a topic we'll cover next month. For the moment, we'll just use Perl's 'if-then' structure to demonstrate our point. So, in effect, we're looking for the Perl equivalent to 'if (find(sub_string, main_string)) then ...'. In Perl, this is the match operator that's usually written in the form:

if (m/sub_string/) {

... with the 'sub_string' literal. If the substring is found, the operator returns the value '1', which can be used in 'if-then' structures, as we'll see in next month's Perl Masterclass. So, where has our 'main_string' disappeared to? The answer is that Perl is written so that if nothing is specified, the default value is used – in this case, a variable with the curious name '$_'.

In many pieces of Perl code, you'll be processing some string and the result will be assigned to the '$_' variable, so that you can continue processing it on the next line without having to assign the output of the function to any particular variable.

If you do want to assign our '$main_string' scalar variable to the match operator (m//), you need to use the binding operator like so:

if ($main_string =~ m/sub_str/) {

Or, using something a little more like what you would find in a real program:

$x = "some string"; 
if ($x =~ m/ring/) { print "$x has a ring\n";
};

Search and replace

Finding that a substring exists within a longer one has many uses when it comes to making decisions, but it's handy if you can perform substitutions as well. Perl's equivalent of replace ('long_string', 'search_string', 'replacement_string') is simply:

s/srch_string/rplce_string/;

... using the default variable. As 's///' does something on its own, we don't need to use it in a flow control structure in order to make use of it. You can replace a letter with a null string if you want to. In this case, we use the binding operator as we did with 'm//':

$a = "Turnkey solutions"; 
$a =~ s/rn/r/;
print "$a\n";

... which prints 'Turkey solutions'. As you may have gathered, there's a lot more to it than just this. Alternative representations If you have a search string (whether it's in a match or a search and replace) that contains a number of '/'s, you're going to run into a problem of having to escape many characters. For example, supposing you wanted to check that a string contained 'http://', you're going to end up with ...

m/http:\/\//

... as you need to escape the forward slashes with backslashes. You can use an alternative to forward slashes, such as '!' or '%', or just about anything you're not likely to use in the search string. In addition, you can use characters that have a left and right form, such as square brackets or braces in a match, though in a search and replace, you need to use the right and then left forms in the middle. If that wasn't enough, if you stick to using forward-slashes, you can do without the 'm' in match. The following are equivalent:

m/perl/ 
/perl/ 
m!perl! 
m{perl}

and, for search and replace:

s/visualbasic/perl/ 
sperl% 
s{visualbasic}{perl} 
s{visualbasic}<perl> 
s{visualbasic}/perl/

As you can see, you can mix them, including using mirrored characters with non-mirrored, as long as you start off with a mirrored pair.

Extending the match

When we looked at modifying string scalars last month using '\U\E', we showed you a glimpse of the power of Perl. Using special escaped characters, we can search for numbers, letters, any characters, specific numbers of characters, and so on. For our 'match/search' string, we've seen how to specify particular words but, suppose we're looking for two different spellings of a word, such as 'while' and 'whilst'. We could do the match twice or we could combine the two to read:

m/whil(st|e)/

... which uses the vertical bar (normally a shifted backslash on the keyboard) to say 'or'. In effect, this says: the first four letters of the search string are 'whil' and the following character(s) can either be 'st' or 'e'. Note that this works with 's///' as well.

In addition to stating specific groups of characters as alternatives, we can specify ranges of characters for a single character in a 'match/search' string. We do this by enclosing them in square brackets, so if we're looking for an embedded number, we could use [0123456789]. However, if we wanted to find any letter of the alphabet, the code could get a bit long! Perl, however, has a shorthand way of doing this. Instead of writing each number, you can get away with [0-9], and instead of every lower case letter, [a-z]. Note that as we use a minus sign to signify a range, if you want to look for a minus sign, you'll have to escape it first, like so: [-]. If when looking for letters (any case) and numbers, we can condense these contractions into [0-9a-zA-Z]. If you include underscores in those characters, you can use the shortcut '\w' and, for [0-9], you can use '\d'. In addition to '\d' and '\w', there's also '\s', which is equivalent to [ \t\n\r\f], or space, tab, new line, carriage return and form feed. In addition to this, we can use a full stop to represent any single character except a new line.

Just to extend this a little further, if we wanted to find any character other than letters 'a' to 'f' (for example), we can put a caret in front so that it reads [^a-f]. To get anything but one of the three classes, we can either use the caret – [^\d], for example, or we can use the uppercase version of them – [\D] – which will find anything but a digit.

And more

Finding one character is fine but if we had to repeat our character specification for each character, it would become laborious and inflexible. The search string would become long and if the number of characters that we could find was specified too closely, variations would cause it to fail, such as if we needed to match 2, 3 or 4 characters.

If you've used DOS or a shell in one of the Unices, you'll probably be familiar with the wild cards '*' and '?', which represent any number of any characters, and any one character respectively.

In Perl, we use similar wild cards to repeat the search patterns 'ñ [a-z]+'. The star represents any number of times of the previous character – from 0 to infinity – so if we were looking at '5260385' and used 'm/[a-z]*/', we would get a positive match, as would 'm/d*/' because [a-z] and 'd' occur zero times.

While this hasn't a great deal of use if it's on its own, it can be very useful if you're looking for something embedded within another string. In order to make more sense, we also have '?', which means 0 or 1 occurrences, and '+', which means one or more occurrences. If that isn't enough, you can even specify how many occurrences, so if you wanted to match between 2 and 4 letters, you could use 'm/[a-z]{2,4}/'.



Posted on Utopian.io - Rewarding Open Source Contributors

Authors get paid when people like you upvote their post.
If you enjoyed what you read here, create your account today and start earning FREE STEEM!
Sort Order:  

Thanks for this! I actually don't know Perl yet, but I like reading up on other languages just for perspective if nothing else. Learning Ruby right now.

Your contribution cannot be approved as it is not informative enough.

See the Utopian Rules. Please edit your contribution (improve length/detail) to reapply for approval.

You may edit your post here, as shown below:

You can contact us on Discord.
[utopian-moderator]

Your contribution cannot be approved because it does not follow the Utopian Rules.

  • The comment was not replied to for 48 hrs and stands rejected as per rules.

You can contact us on Discord.
[utopian-moderator]