I'm trying to implement an algorithm able to search for multiple keys through ten huge files in Python (16 million of rows each one). I've got a sorted file with 62 million of keys, and I'm trying to scan each of the ten files in the dataset to look for a set key and their respective value.
This is a follow-up code on feedback from Scanning multiple huge files in Python. All files are encoded with UTF-8 and They should contain multiple language.
Here is a little slice of my sorted key file:
en Mahesh_Prasad_Varma
en Mahesh_Saheba
en maheshtala
en Maheshtala_College
en Mahesh_Thakur
en Maheshwara_Institute_Of_Technology
en Maheshwar_Hazari
....
en Just_to_Satisfy_You_(song) 1
en Just_to_See_Her 2
en Just_to_See_You_Smile 2
en Just_Tricking 1
en Just_Tricking! 1
en Just_Tryin%27_ta_Live 1
en Just_Until... 1
en Just_Us 1
en Justus 2
en Justus_(album) 2
....
en Zsófia_Polgár 1
Here is an example of some lines from one of my dataset files:
en Mahesh_Prasad_Varma 1
en maheshtala 1
en Maheshtala_College 1
en Maheshwara_Institute_Of_Technology 2
en Maheshwar_Hazari 1
Here is an example of the output file displaying a given key, maheshtala, which only appears once in the first file of the dataset:
...    
1,maheshtala,1,0,0,0,0,0,0,0,0,0,en
...
The sorted_key file contains only unique keys obtained with cat and sort -u unix command on all of the ten dataset files. Each key can't be present more than once in a given dataset file, and each key can be present in more than one of the data files (not important I've to put zero if key is not in a specific file).
I've improved my new solution with multiprocessing module, and I'm now able to process each slice in 3 min, so it will take about 87 min to output final result (27+3*20 = 87 min against 447 min obtained before).
But due to a memory issue I'm not able to save the res dictionary. I'm sure that different processes have different address spaces and so all of them write to their own local copy of the dictionary. I'm forced to use Manager to share data between processes, obtaining worst performance. May I use queue?
Here it is the bash script I use to create sorted keys file.
#! /bin/bash
clear
BASEPATH="/home/process"
mkdir processed
mkdir processed/slice
cat $BASEPATH/dataset/* | cut -d' ' -f1,2 | sort -u -k2 > $BASEPATH/processed/sorted_keys
split -d -l 3000000 processed/sorted_keys processed/slice/slice-
for filename in processed/slice/*; do
    python processing.py $filename
done
rm $BASEPATH/processed/sorted_keys
rm -rf $BASEPATH/processed/slice
For each slice I launch processing.py Here is my working code, with Manager:
import os,sys,datetime,time,thread,threading;
from multiprocessing import Process, Manager
files_1 = ["20140601","20140602","20140603"]
files_2 = ["20140604","20140605","20140606"]
files_3 = ["20140607","20140608"]
files_4 = ["20140609","20140610"]
def split_to_elements(line):
    return line.split(" ")
def print_to_file():
    with open('processed/20140601','a') as output:
        for k in keys:
            splitted = split_to_elements(k)
            sum_count = 0
            clicks = ""
            j=0
            while j < 10:
                click = res.get(k+"-"+str(j), 0)
                clicks += str(click) + ","
                sum_count += click
                j+=1
            to_print = str(sum_count) + "," + splitted[1] + "," + clicks + splitted[0]+ "\n"
            output.write(to_print)
def search_window(files,length):
    n=length
    for f in files:
        with open("dataset/pagecounts-"+f) as current_file:
            for line in current_file:
                splitted = split_to_elements(line)
                res[splitted[0]+" "+splitted[1]+"-"+str(n)] = int(splitted[2].strip("\n"))
        n+=1
with open(sys.argv[1]) as sorted_keys:
    manager = Manager()
    res = manager.dict()
    keys = []
    print "STARTING POPULATING KEYS AT TIME: " + datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d %H:%M:%S')
    for keyword in sorted_keys:
        keys.append(keyword.strip("\n"))
    print "ENDED POPULATION AT TIME: " + datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d %H:%M:%S')
    print "STARTING FILES ANALYSIS AT TIME: " + datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d %H:%M:%S')
    procs = []
    procs.append(Process(target=search_window, args=(files_1,0,)))
    procs.append(Process(target=search_window, args=(files_2,3,)))
    procs.append(Process(target=search_window, args=(files_3,6,)))
    procs.append(Process(target=search_window, args=(files_4,8,)))
    for p in procs:
        p.start()
    for p in procs:
        p.join()
    print_to_file()
    print "ENDED FILES ANALYSIS AT TIME: " + datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d %H:%M:%S')
    print "START PRINTING AT TIME: " + datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d %H:%M:%S')
    print "ENDED PRINT AT TIME: " + datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d %H:%M:%S')
