Regular Expressions — Basics

Cyber Grover 🐱‍💻
6 min readFeb 24, 2023

--

Regular expressions, also known as regex, are a sequence of characters that form a search pattern. Heavily used in programming and text editing applications to search for, manipulate, and validate text based on a specific pattern or set of rules.

Regular expressions allow for complex search patterns, making it possible to match a wide range of strings, including specific characters, digits, and words.

Regular expressions use a combination of characters and metacharacters to define a pattern. In addition to the basic metacharacters, regular expressions can also include quantifiers, character classes, and grouping constructs to create more complex search patterns.

If the above definition was foreign language to you, No Worries. Read along and it would become clearer. I would recommend that you would practice along with me to see regex in action. To start practice, you would need to create a file ( I called mine test.txt) and copy the following text in that file. We would be using the ‘egerp’ tool already built in Linux distributions.

Syntax for egerp : egrep <pattern> <file>

Practice Text

To Sherlock Holmes she is always the woman. I have seldom heard him mention her under any other name. In his eyes she eclipses and predominates the whole of her sex. It was not that he felt any emotion akin to love for Irene Adler. All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind. He was, I take it, the most perfect reasoning and observing machine that the world has seen, but as a lover he would have placed himself in a false position. He never spoke of the softer passions, save with a gibe and a sneer. They were admirable things for the observer — excellent for drawing the veil from men’s motives and actions. But for the trained reasoner to admit such intrusions into his own delicate and finely adjusted temperament was to introduce a distracting factor which might throw a doubt upon all his mental results. Grit in a sensitive instrument, or a crack in one of his own high-power lenses, would not be more disturbing than a strong emotion in a nature such as his. And yet there was but one woman to him, and that woman was the late Irene Adler, of dubious and questionable memory.

That is the first paragraph from one of my favorite books The Adventures of Sherlock Holmes by Sir Author Conan Doyle. You can use any text you want, just then the results would be different but the regex is the same.

test.txt file for practice
test.txt file for practice

Literal characters

This is the most simplest regex. A regular expression that consists of a single literal character matches that character in the text. For example, the regular expression ‘x’ matches the letter ‘x’ in a text string. Here I searched for ‘Irene’ in the file and it highlighted all the “Irene” in the text.

Regex search for Irene
Regex search for Irene

Do note that case matters in Regex. If I search for ‘irene’, I do not get any results back.

Regex search for irene
Regex search for irene

Character Classes

Character classes or Charsets are used to match specific sets of characters and are defined using square brackets ([]). The regular expression [aeiou] matches any vowel in a text string or file. Here the order of characters in the square brackets does not matter.

Regex search for [aeiou]
Regex search for [aeiou]

You can define character ranges using ‘-’. The regex [a-d] will search for all characters a, b, c and d only.

Regex search for [a, b, c, d]

This means [a-zA-Z] will search for all the characters from a to z, both upper and lower case.

Negated Character Classes

Negated character classes are a way to match any character that is not in a specified set of characters. They are denoted by the ‘^’ character at the beginning of a set of brackets.

For example, the regex pattern ‘[^abc]’ matches any character that is not ‘a’, ‘b’, or ‘c’.

Remember we searched for all vowels in the test file above. Here is the same example but here we are searching for anything but vowels.

Regex search for consonants [^aeiou]

Negated character classes can be useful when you want to match any character except for a certain set of characters. They can also be combined with other regex features, such as quantifiers, to create more complex patterns.

Quantifiers

Quantifiers are used to specify how many times a character, group, or character class should be matched. They can be used to match a single character, a sequence of characters, or a group of characters.

Some common quantifiers in regex:

  • ‘*’ (asterisk): Matches zero or more occurrences of the preceding character or group. In the example below, we are looking for any word that begins with a capitalized letter ([A-Z]) and then followed by any character from ([a-z]). The ‘*’ quantifier matches the [a-z] regex zero or more times.
  • ‘+’ (plus): Matches one or more occurrences of the preceding character or group. In the example below, we are looking for any word that begins with a capitalized letter ([A-Z]) and then followed by any character from ([a-z]). The ‘+’ quantifier matches the [a-z] regex one or more times.

Notice the subtle difference in the above two images. In the first image, by using the ‘*’ quantifier which matches the previous charset zero or more times, the ‘I’ has been selected by the regex engine. However its not the case when using the ‘+’ quantifier which matches one or more times.

  • ‘?’ (question mark): Matches zero or one occurrence of the preceding character or group. Question mark (?) is a quantifier that indicates that the preceding character or group is optional. This means that the pattern will match with or without the preceding character or group in the text.

To demonstrate I created a new file with the text “color is red and colour is blue”. Upon searching the regex colou?r the regex engine finds both color and colours in the file. Here the ‘?’ makes the ‘u’ optional.

The ‘{}’ are used as quantifiers to indicate how many times a character or group should be repeated.

  • ‘{n}’ (curly braces with a number n): Matches exactly n occurrences of the preceding character or group.

Example: The regular expression ‘(ha){3}’ matches ‘hahaha’

  • ‘{n,}’ (curly braces with a number n and a comma): Matches n or more occurrences of the preceding character or group.

Example: The regular expression ‘a{2,}b?’ matches ‘aa’, ‘aaa’, ‘aaaa’, ‘aab’, or ‘aaab’. Here its looking for two or more ‘a’. Notice the ‘?’ after ‘b’ which makes it optional.

  • ‘{n,m}’ (curly braces with two numbers n and m separated by a comma): Matches between n and m occurrences of the preceding character or group.

Example: The regular expression ‘ab{2,4}c’ matches ‘abbc’, ‘abbbc’, or ‘abbbbc’.

Anchors

Anchors are special characters that match the position of a string, rather than matching any character in the string. There are two types of anchors: the caret (^) and the dollar sign ($).

The caret (^) anchor matches the beginning of a line or string. The regular expression “^Hello” matches any string that begins with the word “Hello”

The dollar sign ($) anchor matches the end of a line or string. The regular expression “world$” matches any string that ends with the word “world”.

In this brief article, we went over the basics of regular expressions. This is just the tip of the iceberg. We also have alternation, grouping and capturing, backreferences and a whole lot more.

--

--

Cyber Grover 🐱‍💻
Cyber Grover 🐱‍💻

Written by Cyber Grover 🐱‍💻

Cybersecurity Professional, Developer. Adept at system and network analysis, cyber threat intellignece and security frameworks.

No responses yet