El blog de Juan Palómez

8 abril 2011

Parallel grep on a single file

Filed under: Uncategorized — Etiquetas: , , — thisisoneball @ 12:24

I have a very large text file which I need to grep a lot of times, and it takes 4 minutes. The grep command only uses one core, so it could take much less time if I parallelize the command.

If you have GNU Parallel this will do the trick (tested on Linux and Cygwin):

$ cat dump.sql | parallel -k --pipe grep -i pattern

This is a different method, using xargs, which is present in most Unices, but it uses temporary files (the original file splitted).

This command splits the big file into 6 smaller ones, as I want to use 6 cores in parallel. The -C switch limits the size of each splitted file, and ensures that no lines are splitted between one file and the next one, the split will always occur at the end of one line:

$ split -C 320000000 dump.sql
 -rw-r--r--. 1 bd users 319999897 Apr 7 16:55 xaa
 -rw-r--r--. 1 bd users 319999939 Apr 7 16:55 xab
 -rw-r--r--. 1 bd users 319999801 Apr 7 16:55 xac
 -rw-r--r--. 1 bd users 319999988 Apr 7 16:55 xad
 -rw-r--r--. 1 bd users 319999595 Apr 7 16:55 xae
 -rw-r--r--. 1 bd users 308677345 Apr 7 16:55 xaf

And this command runs one grep process for each of the splitted files:

$ ls xa* | xargs -P 6 -n 1 grep -i 'pattern'

For differences between this kind of programs, see the Parallel man page

Crea un blog o un sitio web gratuitos con WordPress.com.