Highlighting Search Results: RegEx Character Collation?

When I run a fulltext MySQL query, thanks to Unicode character collations I will get results matching all of the following, whichever of them I may query for: saka, sakā, śāka, ṣaka etc.

Where I’m stuck is with highlighting the matches in search results. With standard RegEx, I can only match and highlight the original query word in the results — not all the collated matches.

How would one go about solving this? I’ve initially thought of these approaches:

  • Creating a RegEx pattern that would analyze the target results against all possible variants. Would easily turn into one monster of a bloated pattern.
  • Creating a normalized version of the results, locating the matches there, and using the string positions as a basis for highlighting.

However both these approaches incur a substantial processing overhead compared to a regular search result highlighting. The first approach would incur a mighty CPU overhead; the second would probably eat up less CPU but munch at least twice the RAM for the results. Any suggestions?

P.S. In case it’s relevant: The specific character set I’m dealing with (IAST for Sanskrit transliteration with extensions) has three variants of L and N; two variants of M, R and S; and one variant of A, D, E, H, I, T and U; in total A-Z + 19 diacritic variants; + uppercase (that poses no problem here).

Source: regex

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.