Pietro Passarelli

Regex - an easyer approach

Regex - an “easyer” approach

“Regular expressions are a way to describe patterns in string data.”

Excerpt From: Haverbeke, Marijn. “Eloquent JavaScript.” iBooks. This material may be protected by copyright.

defining the problem

Let’s say we want to write a regex to match timecodes, to test a given timecode has a valid format

We consider the notation hours, minutes, seconds, milliseconds. Such as 01:17:23.736 where the milliseconds that can be represented either with a . or a , hh:mm:ss.mmm Or hh:mm:ss,mmm.

When writing a regex you’d be tempted to mash all the symbols together, test it and tweak it until it works, and get it done and over with asap. An alaternative approach is [ giving var to patterns and then combining those patterns this makes it easyer to read, easyer to write, and most importantly easyer to think about in a systematic way.

exploring the problem

For this example I’m using JavaScript, but most languages have support for regex with only minor tweaks to transpose them from one language to another.

We know that \d is equivalent to any digit caracther ` [0-9]`.

So just by looking at our timecode string 01:17:23.736 hh:mm:ss.mmm we can already describe it as:

/\d\d:\d\d:\d\d.\d\d\d/

This can be refactored using the notation {number} defines how many time the instance is expected to occur. so where writing \d{2} is equivalent to write \d\d, where we are expecting two digits one after the other.

/\d{2}:\d{2}:\d{2}.\d{3}/

Square brackets such as [something somethingElse] means any something or somethingElse. And any character in brackets loose it’s special value.

So if we want to see milliseconds with , and milliseconds with , we would write

/\d{2}:\d{2}:\d{2}[ . ,]\d{3}/

Which in JavaScript you would run by defining a variable with the regex pattern, and the method .test on that variable passing the string you want to test as argument. Last but not least using console.log to print out the outcome.

var timeCode =  /\d{2}:\d{2}:\d{2}[ . ,]\d{3}/;

console.log(timeCode.test("01:17:23.736"));

This is also equivalent to writing it without using the variable declaration if you want to have a one liner.

console.log( /\d{2}:\d{2}:\d{2}[ . ,]\d{3}/.test("01:17:23.736"));

a more sensible approach

Now you might have noticed that I got carried away and didn’t realy follow the approach I set out to show you and the code is not very readable especilly in the last one liner, I didn’t define variables with regex patterns and combined those to make it more manageable and readable, as well as flexible if the timecode I am looking for where to change.

So here’s how you’d go about following this more sensible approach.

Once again you got your timecode string 01:17:23.736 hh:mm:ss.mmm

Now you think, what have you got as smallest unit. We got hours,minutes, and milliseconds that are all 2 digit characters.

So I could start to write a pattern to match that

var twoDigit = /\d{2}/;

Then we see we got milliseconds which is a 3 digit characters.

var mmm = /\d{3}/;

which we could already write as

var timeCode = /twoDigit:twoDigit:twoDigit.mmm/;

which would look like:

// a simple readable approach to writing regex using variables.
var twoDigit = /\d{2}/;
var mmm = /\d{3}/;
var timeCode = /twoDigit:twoDigit:twoDigit.mmm/;
console.log(timeCode.test("01:17:23.736"));

but timecode come in pairs

but generally timecode, for example in subtitles files come in pairs, for example in an srt file, so here’s in an example of how we did a regex, but now need to expand on it to match a larger string.

00:00:06,500 --> 00:00:10,790
some text from a subtitle file srt 

it can be usefull

“A part of a regular expression that is surrounded in parentheses counts as a single element as far as the operators following it are concerned.” - Excerpt From: Haverbeke, Marijn. “Eloquent JavaScript.”

for this example if we just consider the string with the timecodes, let’s say we have parsed it and isolated it into its own string.

00:00:06,500 --> 00:00:10,790

considering the code written in the previous section, we can now add to it as follow:

// a simple readable approach to writing regex using variables.
var twoDigit = /\d{2}/;
var mmm = /\d{3}/;
var timeCode = /twoDigit:twoDigit:twoDigit,mmm/;

var twoTimeCodes = /timeCode --> timeCode/;

console.log(twoTimeCodes.test("00:00:06,500 --> 00:00:10,790"));

how much easyer was that? we just had to add this line:

var twoTimeCodes = /timeCode --> timeCode/;

and change what var we were using to run the test to this last one.

console.log(twoTimeCodes.test("00:00:06,500 --> 00:00:10,790"));