0

Im trying to run a script in a function and then calling it

filedetails ()
{
   # read TOTAL_DU < "/tmp/sizes.out";
    disksize=`du -s "$1" | awk '{print $1}'`;
    let TOTAL_DU+=$disksize;
    echo "$TOTAL_DU";
   # echo $TOTAL_DU > "/tmp/sizes.out"
}

Im using the variable TOTAL_DU as a counter to keep count of the du of all the files

Im running it using parallel or xargs

find . -type f | parallel -j 8 filedetails

But the variable TOTAL_DU is resetting every time and the count is not maintained which is as expected as a new shell is used each time. I've also tried using a file to export and then read the counter but because of parallel some complete faster than others so its not sequential (as expected) so this is no good.... Question in is there a way to keep the count whilst using parallel or xargs

7
  • This should output a handful of lines of total counts for the subset that invocation dealt with, no? In which case you just need to collect and sum those output lines, no? Commented Mar 13, 2015 at 21:38
  • So it just gives one output for every invocation, I just want to maintain the count. But want to utilise the multicore feature of xargs/parallel Commented Mar 13, 2015 at 21:40
  • How on Earth can parallel execute Bash function? Functions are not commands, you cannot just pass them to xargs or parallel. Commented Mar 13, 2015 at 21:53
  • So use the parallelism then combine the output from parallel/xargs again at the end with another step. while IFS= read -r count; do let sum+=$count; done < <(find | parallel); echo "$sum" or something roughly like that. Commented Mar 13, 2015 at 21:58
  • 1
    @firegurafiku: You can't execute bash functions with xargs, but you can with parallel provided that you export -f them. Surprising but true. Commented Mar 13, 2015 at 22:17

1 Answer 1

3

Aside from learning purposes, this is not likely to be a good use of parallel, because:

  1. Calling du like that will quite possibly be slower than just invoking du in the normal way. First, the information about files sizes can be extracted from the directory, and so an entire directory can be computed in a single access. Effectively, directories are stored as a special kind of file object, whose data is a vector of directory entities ("dirents"), which contain the name and metadata for each file. What you are doing is using find to print these dirents, then getting du to parse each one (every file, not every directory); almost all of this second scan is redundant work.

  2. Insisting that du examine every file prevents it from avoiding double-counting multiple hard-links to the same file. So you can easily end up inflating the disk usage this way. On the other hand, directories also take up diskspace, and normally du will include this space in its reports. But you're never calling it on any directory, so you will end up understating the total disk usage.

  3. You're invoking a shell and an instance of du for every file. Normally, you would only create a single process for a single du. Process creation is a lot slower than reading a filesize from a directory. At a minimum, you should use parallel -X and rewrite your shell function to invoke du on all the arguments, rather than just $1.

  4. There is no way to share environment variables between sibling shells. So you would have to accumulate the results in a persistent store, such as a temporary file or database table. That's also an expensive operation, but if you adopted the above suggestion, you would only need to do it once for each invocation of du, rather than for every file.

So, ignoring the first two issues, and just looking at the last two, solely for didactic purposes, you could do something like the following:

# Create a temporary file to store results
tmpfile=$(mktemp)
# Function which invokes du and safely appends its summary line
# to the temporary file
collectsizes() {
  # Get the name of the temporary file, and remove it from the args
  tmpfile=$1
  shift
  # Call du on all the parameters, and get the last (grand total) line
  size=$(du -c -s "$@" | tail -n1)
  # lock the temporary file and append the dataline under lock
  flock "$tmpfile" bash -c 'cat "$1" >> "$2"' _ "$size" "$tmpfile"
}
export -f collectsizes

# Find all regular files, and feed them to parallel taking care
# to avoid problems if files have whitespace in their names
find -type f -print0 | parallel -0 -j8 collectsizes "$tmpfile"
# When all that's done, sum up the values in the temporary file
awk '{s+=$1}END{print s}' "$tmpfile"
# And delete it.
rm "$tmpfile"
Sign up to request clarification or add additional context in comments.

5 Comments

This looks neat I'll see if I can test this tomorrow and make some comments, but my above example was just a snippet of what I'm trying to achieve the steps involving a du on each file is just one of them ; identifying the sparsity and and calculating totals is the other. Performance is key I've got this to work by just using do loop but it takes too long, I have millions of files to get through .. I'll give your above suggestion a try...
@AShah: Honestly, if you can get it to work by just letting du run through the directories itself (even if you have to do that twice, once with --apparent-size, you'll be better off. You can coordinate the data using an awk script, for example. Also, take a good look at stat to see if a custom format (maybe including %i, %b and %s) can help.
I only wanted to pipe my original loop through xargs or parallel just to increase performance (hence the reason for creating the function) , but it seems like this is more difficult than I'd thought.... Impression is that xargs, parallel are only enjoyed for the more basic processing loops....happy to be corrected...
@AShah: Whenever you are thinking about parallelizing, you have to think about the appropriate unit of execution. If you make it too small, you end up slowing things down because of setup and syncronization overhead. If you make it too big, you might not get as much parallelization as you wanted. bash imposes a lot of setup overhead, so you need to make the units bigger. The other side of the equation is collecting and aggregating the results, which parallel does not help you with; that's the second half of the map-reduce paradigm.
... in this case, you could improve (and simplify) parallelization by storing the data in an appropriate directory structure. If you have millions of files organized into, for example, a two-level structure with 1000 files in each second level directory (and no files in the first-level directories), then you can parallelize in units of first-level directories instead of file, which will make the du calls a lot more effective. Also, the name of the first-level directory gives you a convenient label for the results from that directory, avoiding locking.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.