How to extract single-/multiline regex-matching items from an unpredictably formatted file and put each one in a single line into output file?

I have a very huge file which looks like this:

<a>text</a>text
blah


<b>data1</b><b>data2</b><b>data3</b>blahblah
    <c>text</c>
  <d>text</d>
<x>blahblah<b>data4
   data5

        data6</b><b>data7
</x>

That is, its formatting is unpredictable. I need to extract each <b>...</b> item (it might contain multiline text!) and put every one of them in a single separate line. At the same time, I need to replace newlines and spaces with a single space.

Desired output:

<b>data1</b>
<b>data2</b>
<b>data3</b>
<b>data4 data5 data6</b>

All I’ve found is two-steps-long way:

gawk '{if ($0 != "") { printf "%s", gensub(/s+/, " ", "g", gensub(/s+$/, "", "g", $0)) } }' path/to/input.txt > path/to/single-line.txt  

and then

grep -Pzo '(?s)<b>.*?</b>' path/to/single-line.txt > path/to/output.txt

But I don’t like it! Having to convert a multiGB text file to a single line… does not seem to be nice. Is it possible to solve such a problem in a single pass, “on the fly”?


Source: unix

Leave a Reply