1

I am looking for a way to execute my loop faster. With the current code the calculations takes forever. So I am looking for a way to make my code more efficient.

EDIT: I do not think either explain , I need to create a program that does all possible combinations of 8 digits , not forgetting to include uppercase , lowercase and numbers .. Then encrypt md5 these possible combinations and save them to a file. But I have new questions , this process would take 63 years would weigh this file ?, As the end of the script? Latest buy a vps server for this task, but if it takes 63 years better not even try haha ..

I am new to coding and all help is appreciated

import hashlib
from random import choice

longitud = 8
valores = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"

def enc(string):
    m = hashlib.md5()
    m.update(string.encode('utf-8'))
    return m.hexdigest()

def code():
    p = ""
    p = p.join([choice(valores) for i in xrange(longitud)])
    text = p
    return text

i = 1
for i in xrange(2000000000000000000):
    cod = code()
    md = enc(cod)
    print cod
    print md
    i += 1
    print i
    f=open('datos.txt','a')
    f.write("%s " % cod)
    f.write("%s" % md)
    f.write('\n')
    f.close()
13
  • 2
    You write to a file in a loop? Don't. Collect the result and write it at the end, after the calculations are done! And - just by the way - isn't that i = 1 assignment superfluous? Commented Jan 1, 2015 at 11:26
  • Do you have to open and save to file at each iteration?, Also why not divide the range and run it with Celery or multiprocessing? Commented Jan 1, 2015 at 11:27
  • Jeje, that's hard for me at this time Commented Jan 1, 2015 at 11:28
  • 4
    You have 2 quintillion iterations. If the inner operation took one CPU operation (which is an... optimistic estimate), on a 2GHz CPU, i.e. 2 billion operations per second, it will take you 1 billion seconds = ~32 years. Making the inner portion of the loop more efficient is a doomed strategy. Commented Jan 1, 2015 at 11:32
  • 1
    if one loop cycle takes only 1ns, you are finished in about 63 years. Congratulations. Commented Jan 1, 2015 at 11:33

4 Answers 4

1

You're not utilizing the full power of modern computers, that have multiple central processing units! This is by far the best optimization you can have here, since this is CPU bound. Note: for I/O bound operations multithreading (using the threading module) is suitable.

So let's see how python makes it easy to do so using multiprocessing module (read comments):

import hashlib
# you're sampling a string so you need sample, not 'choice'
from random import sample
import multiprocessing
# use a thread to synchronize writing to file
import threading

# open up to 4 processes per cpu
processes_per_cpu = 4
processes = processes_per_cpu * multiprocessing.cpu_count()
print "will use %d processes" % processes
longitud = 8
valores = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
# check on smaller ranges to compare before trying your range... :-)
RANGE = 200000
def enc(string):
    m = hashlib.md5()
    m.update(string.encode('utf-8'))
    return m.hexdigest()

# we synchronize the results to be written using a queue shared by processes
q = multiprocessing.Manager().Queue()

# this is the single point where results are written to the file
# the file is opened ONCE (you open it on every iteration, that's bad)
def write_results():
    with open('datos.txt', 'w') as f:
        while True:
            msg = q.get()
            if msg == 'close':
                break;
            else:
                f.write(msg)

# this is the function each process uses to calculate a single result
def calc_one(i):
    s = ''.join(sample(valores, longitud))
    md = enc(s)
    q.put("%s %s\n" % (s, md))

# we start a process pool of workers to spread work and not rely on
# a single cpu
pool = multiprocessing.Pool(processes=processes)

# this is the thread that will write the results coming from
# other processes using the queue, so it's execution target is write_results
t = threading.Thread(target=write_results)
t.start()
# we use 'map_async' to not block ourselves, this is redundant here,
# but it's best practice to use this when you don't HAVE to block ('pool.map')
pool.map_async(calc_one, xrange(RANGE))
# wait for completion
pool.close()
pool.join()
# tell result-writing thread to stop
q.put('close')
t.join()

There are probably more optimizations to be done in this code, but a major optimization for any cpu-bound task like you present, is using multiprocessing.

note: A trivial optimization of file writes would be to aggregate some results from the queue and write them together (if you have many cpus that exceed the single writing thread's speed)

note 2: Since OP was looking to go over combinations/permutations of stuff, it should be noted that there is a module for doing just that, and it's called itertools.

Sign up to request clarification or add additional context in comments.

7 Comments

wow , I think this is the best answer , now I have a new question .. This process would take 63 years to complete ? As might take ? file at the end would occupy 8 PB space ?
I'm not sure I understand your question, but the amount of calculations you're trying to pull will take too long to complete on a single machine. If this is something you need, the only way to scale it so that it completes is using many machines on a framework like Hadoop.
In 1 server VPS, ¿How long does it take to run ?
You can try your range divided by (365 * 24) and check to see if it completes in an hour. Where are you going to store that much data anyway?
I understand , I did not think that the final file so heavy.. :( thank you very much anyway ..
|
0

Note that you should use

for cod in itertools.product(valores, longitud):

rather than picking the strings via random.sample as this will only ever visit a given string once.

Also note that for your given values this loop has 218340105584896 iterations. And the output file will occupy 9170284434565632 bytes or 8PB.

5 Comments

8PB ? that's exactly how TB?
That's 8192TB (1024TB per PB).
It's 9170284434565632 / 1024 / 1024 / 1024 / 1024 TB (~8340). That's why getting a password from a hash isn't suitable for the average home user. And if you use salt to create the md5-value you've lost, even if you have the space.
@jcrashvzla: In case you don't know what a salt is, check Wikipedia.
thank you very much but not from salt because it is a temporary token which I try to decipher.
0

Profile your program first (with the cProfile module: https://docs.python.org/2/library/profile.html and http://ymichael.com/2014/03/08/profiling-python-with-cprofile.html), but I'm willing to bet your program is IO-bound (if your CPU usage never reaches 100% on one core, it means your hard drive is too slow to keep up with the execution speed of the rest of the program).

With that in mind, start by changing your program so that:

  • It opens and closes the file outside of the loop (opening and closing files is super slow).
  • It only makes one write call in each iteration (those each translate to a syscall, which are expensive), like so: f.write("%s %s\n" % (cod, md))

Comments

0

Although it helps with debugging, I have found printing makes a program run slower, so maybe don't quite print as much out. Also I'd move the "f=open('datos.txt', 'a') out from the loop as I can imagine Opening the same file over and over again might cause some time issues, and then move the "f.close()" out of the loop also to the end of the program.

CHANGED

1 Comment

code() creates a random word. You can't move it out of the loop.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.