Let’s talk about REGEX.

Cloudflare, one of the biggest companies on the internet in terms of reach, expanse, control, and practicability out there. They specialize in internet security, safety, and general privacy. They’ve been named the leading DDoS Prevention source In fact, I use them for my own DDoS protection needs.

A couple weeks ago, for 30 minutes, a LARGE chunk of the internet was down. This is because CloudFlare serves as the “gateway” or access point for many such websites, including some cryptocurrency endevours. Needless to say, they have quite a bit of control over the internet.

Why did everything break? Because of a little something called REGEX. A CloudFlare engineer apparently screwed up with a “regex” rule.

What is REGEX?

It stands for “regular expression”, and is a sequence of characters that defines a search pattern. This search pattern can be used to compare against strings to find characters or sub-strings that match the said pattern. It was original created by a man named Stephen Cole Kleene. He was a mathematician in 1951, and he described what would eventually become early implementations of pattern matching.

REGEX is usually a bad idea, unless you’re attempting a pattern match, or you really know what you’re doing. Whenever working with regular expressions, be VERY careful that you double, and then triple check the logic of what you’re writing.

([A-Z])\w+

This is a pattern that matches strings of characters that start with a capital letter until it hits a space. Confusing, right?

\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b is a more complex pattern. It describes a series of letters, digits, dots, underscores, percentage signs and hyphens, followed by an at sign, followed by another series of letters, digits and hyphens, finally followed by a single dot and two or more letters. In other words: this pattern describes an email address.

Oh yeah. But not only are they confusing, they can be dangerous. Let me give you an example.

Suppose we have the following, very simple regex pattern:

(x+x+)+y

What this does is match any string with two or more “X”s followed by a “Y”. This would match these strings:

“xxxxxy”, “xxxxxxxxxxxxxxxxxxxy”, “xxy”, “xxxy”, and any number of “X”s followed by a Y you can imagine. It will NOT match “xy”. Okay, fair enough. Simple enough, right?

Let’s take a look at what’s going on when compared to this string:

xxxxxxxxxxy

The first X+ will match all 10 X characters. The second X fails. The first x+ then “backtracks” algorithmically to 9 matches, and the second one picks up the remaining x. The group has now matched once. The group repeats, but fails at the first X. Since one repetition was sufficient, the group matches. The Y character then matches and an overall match is found. The coder sees the correct return value, the regex is declared functional, the code is pushed into the wild, and our computers get just a little closer to exploding.

Except… what happens if the Y character didn’t match? Like, what if there wasn’t a Y character in that string? Then what would happen?

The regex engine backtracks. Hard. The group has one iteration it can backtrack into. The second X matched only one X, so there’s nothing it can do there. “But wait,” The program thinks. “What if the X+ gives up a matching X character?”

So it matches the first X+ to 8 Xs instead of 9. The second x+ promptly matches xx. The group again has one iteration, fails the next one, and the Y fails. Stay with me here. Backtracking again, the interpreter now realizes that the second X+ contains a position it can backtrack into, by removing one of the Xs from the second match and combining it with the first X+. It’s all very confusing, but keep in mind it’s basically just the computer trying all possible combinations to attempt to match the string.

The group tries a second iteration. The first X+ matches but the second X+ doesn’t. Backtracking again, the first X+ in the group’s first iteration reduces itself to 7 characters. The second X+ matches XXX. Now maybe it matches? No, there’s still no Y character.

Failing, the interpreter tries again, since there are still many more combinations it can try. The second X+ is reduced to XX, it tries again, fails, and then reduces further back into the original X. Now, the group can match a second iteration, with one X for each X+. But this wacky combination fails too.

Are you starting to see the problem here? This is off ONE expression. If you go to

https://regexr.com/

and put the same string in that I did without the Y, it will actually fail to match after adding a couple Xs because it will time out from taking so long.

With the Y, it completes within a millisecond.
Without the Y and a couple more Xs? Oh boy.

At least it tells us. If this exact situation happened inside a .NET application, it would crash, since a stack overflow would probably happen.

Now, the first thing you might think is: “Uh, just remove the parenthesis to fix  the nested quantifiers, genius”, but let’s replace each “X” with another, more complicated example. Something that might be coded in the workplace:

^(.*?,){19}A

Not too much worse than what we had above. Now some context: The person writing this regex has a text file and they were attempting to figure out where the 20th item on a line started with an A.

Doesn’t seem like that hard of a task, and the regex works. What’s it doing, though? Same as the above example, it’s just harder to tell.

The regex looks like it work fine. The lazy dot and comma match a single comma-delimited field, and the {19} skips the first 19 fields. Finally, the A checks if the 20th field indeed starts with A. In fact, this is exactly what will happen when the 20th field indeed starts with a A. Straightforward logic.

When the 20th field does not start with a A, that’s when the problems start. Let’s say the string is “1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20” At that point, the regex engine will backtrack. It will backtrack to the point where ^(.*?,){19} had consumed “19“, giving up the last match of the comma. The next token is again the dot. The dot matches a comma, since a dot is a wildcard character. The comma does not match the 1 in the 20th field, so the dot continues until the 19th iteration of .*?, has consumed “19, 20” You can already see the root of the problem: the part of the regex (the dot) matching the contents of the field also matches the delimiter (the comma). Because of the double repetition (star inside {19}), this leads to a catastrophic amount of backtracking. At this rate, it would take multiple seconds just to check one line of text which should take less than a tenth of a millisecond. This can lead to HUGE lagspikes. In Cloudflare’s case, their server’s CPUs were maxed out to 100% load instantly, causing the internet as we know it to vanish for a solid 30 minutes as a software engineering positioning was busy being posted on Glassdoor.

How do you fix this?

Like this:

^([^,\r\n]*,){19}A

The trick is to be more specific. The complexity of the original regex expression was exponentially difficult, and contained a complexity of O(2n), which is terrible. The time it takes to complete said RegEx searches are exponentially higher the more characters in the string, and leads to terrible workarounds.

If you’d like to learn more about complex RegEx expressions, pattern matching, and how they could be vastly improved with the Ken Thompson algorithm, read this.

Now excuse me while I go back to something normal and non-confusing, like JavaScript.


It helps me if you share this post

Published 2019-07-15 18:55:26

2 comments on “Let’s talk about REGEX.”

  1. There is noticeably a bundle to learn about this. I suppose you’ve made certain nice points in features also.

Leave a Reply

Your email address will not be published. Required fields are marked *