Pietro Passarelli

Regex - an easier approach



How to build more complex regex without going insane

“Regular expressions are a way to describe patterns in string data.”

  • Excerpt From: Haverbeke, Marijn. “Eloquent JavaScript.” iBooks. This material may be protected by copyright.

Defining the problem

Let's say we want to write a regex to match time-codes, to test a given time-code has a valid format

We consider the notation hours, minutes, seconds, milliseconds. Such as 01:17:23.736 where the milliseconds that can be represented either with a . or a , hh:mm:ss.mmm Or hh:mm:ss,mmm.

When writing a regex you'd be tempted to mash all the symbols together, test it and tweak it until it works, and get it done and over with asap. An alternative approach is [ giving a variable to patterns and then combining those patterns this makes it easier to read, easier to write, and most importantly easier to think about in a systematic way.

exploring the problem

For this example I'm using JavaScript, but most languages have support for regex with only minor tweaks to transpose them from one language to another.

We know that \d is equivalent to any digit character [0-9].

So just by looking at our time-code string 01:17:23.736 hh:mm:ss.mmm we can already describe it as:

/\d\d:\d\d:\d\d.\d\d\d/

This can be refactored using the notation {number} defines how many time the instance is expected to occur. so where writing \d{2} is equivalent to write \d\d, where we are expecting two digits one after the other.

/\d{2}:\d{2}:\d{2}.\d{3}/

Square brackets such as [something somethingElse] means any something or somethingElse. And any character in brackets loose it's special value.

So if we want to see milliseconds with , and milliseconds with , we would write

/\d{2}:\d{2}:\d{2}[ . ,]\d{3}/

Which in JavaScript you would run by defining a variable with the regex pattern, and the method .test on that variable passing the string you want to test as argument. Last but not least using console.log to print out the outcome.

const timeCode =  /\d{2}:\d{2}:\d{2}[ . ,]\d{3}/;

console.log(timeCode.test("01:17:23.736"));

This is also equivalent to writing it without using the variable declaration if you want to have a one liner.

console.log( /\d{2}:\d{2}:\d{2}[ . ,]\d{3}/.test("01:17:23.736"));

a more sensible approach

Now you might have noticed that I got carried away and didn't really follow the approach I set out to show you and the code is not very readable especially in the last one liner, I didn't define variables with regex patterns and combined those to make it more manageable and readable, as well as flexible if the time-code I am looking for where to change.

So here's how you'd go about following this more sensible approach.

Once again you got your time-code string 01:17:23.736 hh:mm:ss.mmm

Now you think, what have you got as smallest unit. We got hours,minutes, and milliseconds that are all 2 digit characters.

So I could start to write a pattern to match that

const twoDigit = /\d{2}/;

Then we see we got milliseconds which is a 3 digit characters.

const mmm = /\d{3}/;

which we could already write as

const timeCode = /twoDigit:twoDigit:twoDigit.mmm/;

which would look like:

// a simple readable approach to writing regex using variables.
const twoDigit = /\d{2}/;
const mmm = /\d{3}/;
const timeCode = /twoDigit:twoDigit:twoDigit.mmm/;
console.log(timeCode.test("01:17:23.736"));

but timecode come in pairs

but generally timecode, for example in subtitles files come in pairs, for example in an srt file, so here's in an example of how we did a regex, but now need to expand on it to match a larger string.

00:00:06,500 --> 00:00:10,790
some text from a subtitle file srt

it can be usefull

“A part of a regular expression that is surrounded in parentheses counts as a single element as far as the operators following it are concerned.”

  • Excerpt From: Haverbeke, Marijn. “Eloquent JavaScript.”

for this example if we just consider the string with the time-codes, let's say we have parsed it and isolated it into its own string.

00:00:06,500 --> 00:00:10,790

considering the code written in the previous section, we can now add to it as follow:

// a simple readable approach to writing regex using variables.
const twoDigit = /\d{2}/;
const mmm = /\d{3}/;
const timeCode = /twoDigit:twoDigit:twoDigit,mmm/;

const twoTimeCodes = /timeCode --> timeCode/;

console.log(twoTimeCodes.test("00:00:06,500 --> 00:00:10,790"));

how much easier was that? we just had to add this line:

const twoTimeCodes = /timeCode --> timeCode/;

and change what variable we were using to run the test to this last one.

console.log(twoTimeCodes.test("00:00:06,500 --> 00:00:10,790"));