Skip to main content
4 of 6
added solution idea

Observing progress in a distributed system

For a distributed system, there is a requirement of observing the progress of smaller applications on distributed computers (runtime 5 - 20 minutes).

There is a web fronted, which right now only shows a list of those smaller applications (called jobs), with the state of each of them, like preparing, running, finished

So in web-ui, an administrator can see:

  • name
  • state
  • starting time
  • call parameters

from any computer in the network, possibly for the whole system.

Each of the properties is stored in the database, so each state change leads to a call to write to the database. There might be thousends of those jobs at a time.

Description of the distributed system:

Central components, served at one location only:

  • Database server (holding results of the jobs to runs statistics, have an overview of jobs run the last three month etc)

  • Application server (glassfish, java, runs central server software)

Distributed components, each site has at least one, connected via internet / WAN area: (probably about 20 sites, each has 1..4 Job controllers, each job controller runs about 20 jobs in parallel)

  • Job-controller component (windows, c#, wcf, starts and observes small jobs)
  • Small applications running tasks, started by Job-Controller (the jobs)

So, for a vague estimation:

20 sites * 4 Job-controllers * 20 jobs = 1600 jobs in parallel

each of which runs from 0 to 100 percent in about 5 minutes on average,
resulting in a progress update each 3 seconds.

giving 533 progress updates per second (over the internet)

Now the customer wants to see something like a progressbar for each of these jobs.

At first, I thought this might lead to a high network traffic and to a vast amount of traffic on the database server.

I do not think that writing progress like 1%, 2%, 3% to the database is a good idea. The runtime of those jobs is not very easy to be estimated good (so it is near enough to a real result), but each job can tell very well what his progress is.

What would be a good architectural approach to observe progress of possibly thousends of those mini-jobs?

Right now I think that each Job-controller could update the progress of all jobs it controls every 10 seconds at once. Would that be an acceptable approach?