When I run a fulltext MySQL query, thanks to Unicode character collations I will get results matching all of the following, whichever of them I may query for:
saka, sakā, śāka, ṣaka etc.
Where I’m stuck is with highlighting the matches in search results. With standard RegEx, I can only match and highlight the original query word in the results — not all the collated matches.
How would one go about solving this? I’ve initially thought of these approaches:
- Creating a RegEx pattern that would analyze the target results against all possible variants. Would easily turn into one monster of a bloated pattern.
- Creating a normalized version of the results, locating the matches there, and using the string positions as a basis for highlighting.
However both these approaches incur a substantial processing overhead compared to a regular search result highlighting. The first approach would incur a mighty CPU overhead; the second would probably eat up less CPU but munch at least twice the RAM for the results. Any suggestions?
P.S. In case it’s relevant: The specific character set I’m dealing with (IAST for Sanskrit transliteration with extensions) has three variants of L and N; two variants of M, R and S; and one variant of A, D, E, H, I, T and U; in total A-Z + 19 diacritic variants; + uppercase (that poses no problem here).