Java, poor regex performance with lazy expressions

The code is actually in Scala (Spark/Scala) but the library scala.util.matching.Regex, as per documentations, delegates to java.util.regex.

The code, essentially, reads a bunch of regex from a config file and then matches them against logs fed to the Spark/Scala app. Everything worked fine until I added a regex to extract strings separated by tabs where the tab has been flattened to “#011” (by rsyslog). Since the strings can have white-spaces, my regex looks like:
(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)

The moment I add this regex to the list, the app takes forever to finish processing logs. To give you an idea of the magnitude of delay, a typical batch of a million lines takes less than 5 seconds to match/extract on my Spark cluster. If I add the expression above, a batch takes an hour!

In my code, I have tried a couple of ways to match regex:

if ( (regex findFirstIn log).nonEmpty ) { do something } – (1)

val allGroups = regex.findAllIn(log).matchData.toList
if (allGroups.nonEmpty) { do something }
– (2)

if (regex.pattern.matcher(log).matches()){do something} – (3)

All three suffer from poor performance when the regex mentioned above it added to the list of regex. Any suggestions to improve regex performance or change the regex itself?


Source: java

Leave a Reply