1

I'm developing a service in NodeJS which will create text files from images using a node wrapper for the tesseract OCR engine. I want it to be a constantly running service, being started and restarted (on crash) by upstart.

I have the option of making the servers (Virtual machines on which this going to run) multiple core machines with large RAM and disk space or I have the option of creating 4 or 5 small VMs with one core each, 1 GB RAM and relatively small disk size.

With the first approach, I would have to fork various child processes to make use of all cores, which adds complexity to the code. On the other hand, I just have one VM to worry about. With the second approach, I don't have to worry about forking child processes, but I would have to create and configure multiple VMs.

Are there other pros and cons of each approach that I haven't thought of?

1
  • Which approach is better and why? sure sounds like you're asking for a bunch of opinions which is strictly off-topic here. Can you reword that last sentence that asks for facts, not opinion to make it clearly on-topic here. Commented Sep 2, 2016 at 17:35

1 Answer 1

2

I'd avoid partitioning VMs since that means you'll likely end up wasting RAM and CPU -- it's not unlikely that you'll find one VM using 100% of its resources while another sits idle. There's also non-trivial overhead involved in running 5 operating systems instead of one.

Why are you considering forking many processes? If you use the right library, this will be unnecessary.

Many of the tesseract libraries on npm are poorly written. They are ultra-simplistic bindings to the tesseract C code. In JS, you call the addon's recognize() function, which just calls tesseract's Recognize(), which does CPU-intensive work in a blocking fashion. This means you're doing the recognition on the main V8 thread, which we know is a no-no. I assume this is why you're considering forking processes, since each would only be able to do a single blocking OCR operation at once.

Instead, you want a library that does the OCR work on a separate thread. tesseract_native is an example. It is properly designed: it uses libuv to call into tesseract on a worker thread.

libuv maintains a worker thread pool, so you can have as many concurrent OCR operations as you have cores, all in a single process.

Sign up to request clarification or add additional context in comments.

3 Comments

Just out of curiosity, are there a lot of popular/well-written npm libraries that use libuv to do processor intensive work in a non-blocking manner? If so, why is it that one has to fork processes (using the cluster module) when developing an express app? Could express also have been written to automatically handle requests in separate worker threads?
It's how many of the core modules are implemented. For example, crypto uses libuv to do expensive operations on a worker thread. There are many other npm modules that use libuv to do work (usually calling into a native C/C++ library) on a worker thread -- examples include bcrypt, serialport, edge (the .NET glue) and many others that provide bindings to native code.
However, calling into a native library on a worker thread is a different class of problem than implementing multi-threaded JavaScript code. V8 (node's JS engine) does not allow sharing memory between JS threads, so it is not possible for a pure JS library like express to automatically run route handler functions on separate threads. (node itself does not support spinning up new V8 threads -- there are modules that create new V8 threads, but you don't get access to require.) Thus, the way we distribute JS load is by spawning new processes.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.