8

I am having 104k string values , out of which 89k are unique. I would like to check a string is present in this list or not.

This is my class and its method which hold all these records.

public class TestClass {
    private static TestClass singletonObj = null;
    private List<String> stringList= null;

    public static synchronized TestClass getInstance() {
        if(singletonObj == null) {
            singletonObj = new TestClass();
        }
        return singletonObj;
    }


    public boolean isValidString(String token) {
        if(stringList == null) {
            init();
        }
        if(stringList != null && token != null && !token.isEmpty())
            return stringList.contains(token.toLowerCase());
        return false;
    }

    private init() {
     stringList = new ArrayList<String>();
     // put all 104k values in this data structure.
    }
}

My application tries to use this isValidString() method concurrently with around 20 requests per second. This works fine but when i tried to change the Data structure to HashSet, the CPU usage goes very high. As per my understanding Hashset should perform better[o(1)] than ArrayList[o(n)]. Can anyone explain me why is this happening ?

19
  • Shouldn't your init() method be in synchronized block? Commented Aug 6, 2015 at 8:04
  • 2
    Agree with @Codebender If your call is singleton then you can create it at the time of getInstance(), I believe it will always have only one list or set. Commented Aug 6, 2015 at 8:12
  • 1
    @Sthita - Your isValidString() should have a synchronized block which ensures init(); is called only once. Also you need to check for null twice. Commented Aug 6, 2015 at 8:14
  • 1
    @ankitkatiyar91, if you make inValidString() synchronized it defeats the whole purpose of multithreading, you should only make the if(stringList==null) condition synchronized. Commented Aug 6, 2015 at 8:20
  • 3
    What's the point of having a lazily instantiated singleton that has a lazily loaded stringList anyway? You've got to load the collection before it makes sense to query it, so you're not going to get any parallelism there. So why not initialize in the constructor? Commented Aug 6, 2015 at 8:38

3 Answers 3

2

I created a simple class to spawn 20 threads that access your dictionary checker every second as per the bottom of this post.

I cannot replicate your results - but this might be because of the input data that I have access to. I have used your TestClass implementation importing ~130,000 words from the English Open Word List (EOWL). No ongoing high CPU use is seen with either ArrayList or HashSet as the type of stringList.

My guess is that your problem is due to your input data. I tried adding my input dictionary twice to create duplication - obviously with the ArrayList this just makes the list twice as long, but with the HashSet, it means the code is throwing out duplicates. You note that about 1/5 of your input data is duplicate. With 1/2 duplicates in my tests, I do see a slight increased CPU for about 2 seconds and then it drops back to almost nothing once stringList is initialised.

This "blip" could be more prolonged if your input strings are more complex than the single words I'm using. So maybe that is your problem. Alternatively - maybe you have some other code that wraps this part that is hogging CPU.

N.B. I would caution you as others have in comments about your implementation of init. In my experiments, I saw that many threads could call the dictionary check before the dictionary had fully initialised, giving inconsistent results for the same test word. Why not call it from your constructor, if it is to be a singleton object?

Code listings

Your TestClass with some input data code:

import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Scanner;

public class TestClass {
    private static TestClass singletonObj = null;
    //private List<String> stringList = null;

    private HashSet<String> stringList = null;

    public static synchronized TestClass getInstance() {
        if (singletonObj == null) {
            singletonObj = new TestClass();
        }
        return singletonObj;
    }

    public boolean isValidString(String token) {
        if (stringList == null) {
            init();
        }
        if (stringList != null && token != null && !token.isEmpty())
            return stringList.contains(token.toLowerCase());
        return false;
    }

    private void init() {
        String dictDir = "C:\\Users\\Richard\\Documents\\EOWL_CSVs";
        File[] csvs = (new File(dictDir)).listFiles();
        stringList = new HashSet<String>();
        Scanner inFile = null;

        for (File f : csvs) {
            try {
                inFile = new Scanner(new FileReader(f));
            } catch (FileNotFoundException e) {
                e.printStackTrace();
                System.exit(1);
            }

            while (inFile.hasNext()) {
                stringList.add(inFile.next().toLowerCase()
                        .replaceAll("[^a-zA-Z ]", ""));
            }
            inFile.close();
        }

        System.out.println("Dictionary initialised with " + stringList.size()
                + " members");
    }
}

Thread to access it:

import java.io.FileNotFoundException;

public class DictChecker extends Thread {

    TestClass t = null;
    public static int classId = 0;
    String className = null;
    
    
    public void doWork()
    {
        String testString = "Baby";
        if (t.isValidString(testString))
        {
            System.out.println("Got a valid string " + testString + " in class " + className);
        }
        else
        {
            System.out.println(testString + " not in the dictionary");
        }
    }
    
    public void run()
    {
        while (true)
        {
            try {
                DictChecker.sleep(1000);
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
            doWork();
        }
    }
    
    public DictChecker()
    {
        t = TestClass.getInstance();
        className = "dChecker" + classId;
        classId += 1;
        System.out.println("Initialised " + className + " in thread " + this.getName());
    }
    
    public static void main(String[] args) throws FileNotFoundException
    {
        for (int i = 0; i < 20; i++)
        {
             (new DictChecker()).start();
             try {
                DictChecker.sleep(50);//simply to distribute load over the second
            } catch (InterruptedException e) {
                e.printStackTrace();
            } 
        }
    }
}
Sign up to request clarification or add additional context in comments.

2 Comments

@J Richard Snape yes, i am initializing it from constructor and it is currently HashSet with initial size, Looks good till now.
@sthita glad you have got it working the way you want :)
1

My guess is that HashSet, being a hash-based structure, computes the hashCode of every String since the very instant of inserting it into the HashSet, i.e. in the method init. This may be the period along which the CPU goes high, and it's part of the price we pay for getting a better throughput when iterating the values of the structure.

If I'm right, after the method init ends, the CPU should drop off, and the speed of the program should increase hugely, and that's the benefit of using HashSet.

By the way: A sure way of optimization is pre-sizing the structure:

  • ArrayList should have an initial size equal to the maximum number of elements that will contain.
  • And HashSet an initial size 1.7 greater than the maximum.

BTW: The standard hash algorithm of String.hash computes all of the characters of the string. Maybe you could be content with computing just the first 100 characters, for example (depending on the nature of the data you are processing, of corse). Then, you could encapsulate your Strings into your own class, overriding the hashCode method with your own hashing algorithm, and overriding the equals method to perform a strict comparation.

4 Comments

No the CPU usage was not dropping at all and after some time the application stops responding if i am using HashSet. I am using ArrayList and given the initial size.
And are you using HashSet with an initial size? If not, that could be the cause: A small hash structure that gets overfull, turns unefficient because it becomes a set of long lists.
@Sthita - Check if GC is kicking in as well
@LittleSanti Setting HashSet with initial size looks good and working out for me i think. I will keep monitoring my application today and see the impact. Thanks
0

JDK HashSet is built on top of a HashMap<T, Object>, where value is a singleton ‘present’ object. It means that the memory consumption of a HashSet is identical to HashMap: in order to store SIZE values, you need 32 * SIZE + 4 * CAPACITY bytes (plus size of your values).

For ArrayList, it's the capacity of the java.util.ArrayList multiplied by the reference size (4 bytes on 32bit, 8bytes on 64bit) + [Object header + one int and one references].

So HashSet is definitely not a memory-friendly collection.

Depends if you're using a 32-bit or a 64-bit VM. That said, HashSet gets hurt worse by 8-byte references than ArrayList does -- adding an extra 4 bytes per reference, based on the linked memory consumption chart, brings ArrayList up to ~12 bytes per element, and HashSet up to ~52 bytes per element.)

ArrayList is implemented using an array of Objects. Figure below shows the memory usage and layout of an ArrayList on a 32-bit Java runtime:

Memory usage and layout of a ArrayList on a 32-bit Java runtime enter image description here

The figure above shows that when an ArrayList is created, the result is an ArrayList object using 32 bytes of memory, along with an Object array at a default size of 10, totaling 88 bytes of memory for an empty ArrayList. This means that the ArrayList is not accurately sized and therefore has a default capacity, which happens to be 10 entries.

Attributes of an ArrayList

Default capacity - 10

Empty size - 88 bytes

Overhead - 48 bytes plus 4 bytes per entry

Overhead for 10K collection - ~40K

Search/insert/delete performance- O(n) — Time taken is linearly dependent to the number of elements

A HashSet has fewer capabilities than a HashMap in that it cannot contain more than one null entry and cannot have duplicate entries. The implementation is a wrapper around a HashMap, with the HashSet object managing what is allowed to be put into the HashMap object. The additional function of restricting the capabilities of a HashMap means that HashSets have a slightly higher memory overhead.

Memory usage and layout of a HashSet on a 32-bit Java runtime enter image description here

The figure above shows the shallow heap (memory usage of the individual object) in bytes, along with the retained heap (memory usage of the individual object and its child objects) in bytes for a java.util.HashSet object. The shallow heap size is 16 bytes, and the retained heap size is 144 bytes. When a HashSet is created, its default capacity — the number of entries that can be put into the set — is 16 entries. When a HashSet is created at the default capacity and no entries are put into the set, it occupies 144 bytes. This is an extra 16 bytes over the memory usage of a HashMap.

Table below shows the attributes of a HashSet:

Attributes of a HashSet

Default capacity -16 entries

Empty size -144 bytes

Overhead -16 bytes plus HashMap overhead

Overhead for a 10K collection -16 bytes plus HashMap overhead

Search/insert/delete performance - O(1) — Time taken is constant time, regardless of the number of elements (assuming no hash collisions)

4 Comments

The reference shows the performance is better for iteration, here we just want to compare/ check for availability of the token in the list/set.
"If you don't care about ... the performance of contains" but contains is the method we're calling most often here, right?
@Sthita - See the capacity in reference 2 and the added capacity for arrayList above.
OP is talking about CPU usage, not memory. They obviously do care about performance of contains.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.