Regular expression: finding two elements not surrounding another element in text

I need to find badly formatted HTML content from some text; we let users add strong and em tags but they don’t always close them correctly

This is some <b>correct</b> formatting
This is some <b>incorrect<b> formatting

I would like to catch instances where the formatting is incorrect, ie where an opening tag is not followed by a closing tag. I started using negative lookaheads but have had not much success so far

  • <b> Get opening tag
  • (?! negative lookahead for
    • .*? anything, but not greedily
    • </b> the closing tag
    • .*? anything, but not greedily
  • ) closing the lookahead
  • <b> Anothier opening tag

Any idea how I could do that?

Addendum: I know about Tony the pony, but I feel it is not coming right now. This problem could be replaced by “I want to find two occurences of a word “zoinx” where there is no occurence of the word “palantir” in between” which is not HTML-related

Source: regex

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.