How to extract single-/multiline regex-matching items from an unpredictably formatted file and put each one in a single line into output file?

I have a very huge file which looks like this:




That is, its formatting is unpredictable. I need to extract each <b>...</b> item (it might contain multiline text!) and put every one of them in a single separate line. At the same time, I need to replace newlines and spaces with a single space.

Desired output:

<b>data4 data5 data6</b>

All I’ve found is two-steps-long way:

gawk '{if ($0 != "") { printf "%s", gensub(/s+/, " ", "g", gensub(/s+$/, "", "g", $0)) } }' path/to/input.txt > path/to/single-line.txt  

and then

grep -Pzo '(?s)<b>.*?</b>' path/to/single-line.txt > path/to/output.txt

But I don’t like it! Having to convert a multiGB text file to a single line… does not seem to be nice. Is it possible to solve such a problem in a single pass, “on the fly”?

Source: unix

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.