Hot to find emails in many files large files

Ubuntu 14.04 64 bit LTS – minimal install- updated.

2x 6 core Xeon,
12 GB ECC memory,
Storage RAID 10 = 4 TB,
File system = ext4,

Above server is dedicated to this project.

Desired result:
Use grep more efficiently, get less false positives, and “cleaner” results and export only email accounts to txt file.

I have many large files in all kinds of formats, .csv, .excel, .txt, .sql etc
Some files are compressed zip, rar, gz etc. (I will be attempting zgrep next)
The files reside on a Windows 2012 server, I have mounted the share on the Ubuntu box, and I need to extract all emails to txt file.

I have done tons of researched and played with various regex but cannot get it working 100% as expected.


First attempt:

grep -Rs .*@.* . >> emails.txt

Second attempt: (after research)

grep -e '^.*@.*..*' -r -n -h >> emails.txt

Third attempt: (for better performance)

LANG=C grep -e '^.*@.*..*' -r -n -h >> emails.txt

Fourth attempt: (even “better” performance, but this depends on hardware)

cat * */* */*/* | parallel --pipe -N 250 --round-robin “grep -e '^.*@.*..*' -r -n -h >> emails.txt"

The issue:

With first second and third attempt, I am still getting a ton of “junk” exported.
With the fourth example cat still complains about folders, I tried running it with find . but then I get only the files that contain the mail accounts in the output.

Any and all assistance will be greatly appreciated.

Kind Regards

Source: regex

Leave a Reply