I wrote a simple string pool for my compiler in C++ and want to hear your thoughts on its design.
String pool has 2 functions: intern and internLiteral. The only difference is that after interning literal, its memory is used. So no allocation happens. If memory for the same string was already allocated, it will be deallocated.
Hence, I have 2 destruction behaviours differentiated by StringType enum.
I also want my InternedStrings to have interface of const std::string_view (basically because of size method).
EDIT:
- Added concept for char allocator
- Strings are always null-terminated
- Added tests (all passed)
- Fixed compilation
EDIT:
- Removed
shared_ptrandweak_ptr. Use raw pointer - Now, after interning literals, previously allocated memory for the same non-literal string will be deallocated
- Simplified logic
- Updated test
FAQ:
- How to use
InternedString?
InternedString is only meant to do 2 things:
- O(1) equality check (as it is a pointer to
const std::string_view) - read-only access to string
StringPool shouldn't be destroyed, while InternedString exists.
- Is
StringPoolworth it?
Well, it depends.
My use case is a compiler, where I have a lot of repeated strings for keywords, common functions names, etc.
Let's take while keyword for example.
Null-terminated "while" string contains 6 bytes.
StringPool uses additional sizeof(std::string_view) bytes for it.
After 4 while keywords we start gaining profit.
And comparing pointers is extremely fast
- Why care about literals?
I want to "preallocate" keywords, I already use in my program as literals. But it's a very small optimization, and I may delete it to simplify pool.
#pragma once
#include <string_view>
#include <unordered_map>
/**
* @file StringPool.hpp
* @brief Manage pool of strings.
* Pros:
* - No allocation for repeated strings
* - No allocation for literals
* - Strings have stable addresses -- fast to check for equality
*/
namespace ppl::common
{
/// Reference to string, interned in the pool
using InternedString = const std::string_view *;
namespace concepts
{
/// Concept for char allocator
template<typename T>
concept CharAllocator = requires(T t, char *str, size_t size)
{
{ t.allocate(size) } -> std::same_as<char *>;
{ t.deallocate(str, size) };
};
}
/// Pool of strings
template<concepts::CharAllocator Allocator = std::allocator<char>>
class StringPool
{
public:
/// Allocator, used by pool
Allocator allocator;
/// Perfect forward arguments to allocator
template<typename ...Args>
requires std::is_constructible_v<Allocator, Args...>
StringPool(Args &&... args) noexcept
: allocator(std::forward<Args>(args)...) {}
/// Forbid copy
StringPool(const StringPool &) = delete;
/// Forbid copy
StringPool &operator=(const StringPool &) = delete;
/// Allow move
StringPool(StringPool &&) = default;
/// Allow move
StringPool &operator=(StringPool &&) = default;
/// @brief Intern a string literal on the pool
/// @note Never allocates memory for string
/// @note Deallocates memory for previously interned non-literal string
InternedString internLiteral(const char *literal)
{
std::string_view view{literal};
auto [it, _] = strings.emplace(view, StringType::literal);
if (it->second == StringType::allocated)
{
// We can safely change it to be a view over literal,
// because it's exactly the same string.
//
// ...But I'm not sure
auto &oldView = const_cast<std::string_view &>(it->first);
deallocateViewContent(oldView);
oldView = view;
it->second = StringType::literal;
}
return &it->first;
}
/// @brief Intern a string on the pool
/// @note
/// If string is a literal,
/// use @c internLiteral() instead for memory efficiency
InternedString intern(std::string_view str)
{
if (auto it = strings.find(str); it != strings.end())
{
return &it->first;
}
auto *allocated = allocator.allocate(str.size() + 1);
if (!allocated) { return nullptr; }
std::uninitialized_copy(str.begin(), str.end(), allocated);
allocated[str.size()] = '\0';
return &strings.emplace(
std::string_view{allocated, str.size()},
StringType::allocated
).first->first;
}
/// Deallocate memory
~StringPool()
{
for (auto &[str, type] : strings)
{
if (type == StringType::allocated)
{
deallocateViewContent(str);
}
}
}
private:
/// Deallocate string that is viewed
void deallocateViewContent(std::string_view view)
{
allocator.deallocate(
const_cast<char *>(view.data()),
view.size() + 1
);
}
/// Type of string
enum class StringType
{
literal, // Doesn't need deallocation
allocated // Needs deallocation
};
/// Interned strings
std::unordered_map<
std::string_view,
StringType
> strings;
};
} // namespace ppl::common
Google tests:
#include <gtest/gtest.h>
#include "ppl/common/StringPool.hpp"
template<typename Allocator>
struct TrackAllocator
{
Allocator allocator;
char * allocate(size_t size)
{
++allocations;
return allocator.allocate(size);
}
void deallocate(char *str, size_t size)
{
++deallocations;
allocator.deallocate(str, size);
}
size_t allocations = 0;
size_t deallocations = 0;
};
/// Doesn't allocate memory
struct NullAllocator
{
char * allocate(size_t) { return nullptr; }
void deallocate([[maybe_unused]]char *ptr, size_t)
{
assert(!ptr);
}
};
using namespace ppl::common;
TEST(StringPool, internLiteralFirst)
{
auto pool =
StringPool<TrackAllocator<NullAllocator>> {
TrackAllocator<NullAllocator>{}
};
const char *literal = "Hello";
auto internedLiteral = pool.internLiteral(literal);
ASSERT_EQ(pool.allocator.allocations, 0);
ASSERT_EQ(pool.allocator.deallocations, 0);
ASSERT_EQ(internedLiteral->data(), literal);
auto internedLiteral1 = pool.internLiteral(literal);
ASSERT_EQ(pool.allocator.allocations, 0);
ASSERT_EQ(pool.allocator.deallocations, 0);
ASSERT_EQ(internedLiteral1, internedLiteral);
std::string str = literal;
auto internedStr = pool.intern(str);
ASSERT_EQ(pool.allocator.allocations, 0);
ASSERT_EQ(pool.allocator.deallocations, 0);
ASSERT_EQ(internedStr, internedLiteral);
}
TEST(StringPool, internFirst)
{
TrackAllocator<std::allocator<char>> trackAllocator;
{
StringPool<TrackAllocator<std::allocator<char>> &> pool(trackAllocator);
const char *literal = "Hello";
std::string str = literal;
auto internedStr = pool.intern(str);
ASSERT_EQ(pool.allocator.allocations, 1);
ASSERT_EQ(pool.allocator.deallocations, 0);
ASSERT_EQ(*internedStr, literal);
auto internedLiteral = pool.internLiteral(literal);
ASSERT_EQ(pool.allocator.allocations, 1);
// Note that internedStr was deallocated
ASSERT_EQ(pool.allocator.deallocations, 1);
ASSERT_EQ(internedLiteral, internedStr);
// Note that literal's memory is used instead
ASSERT_EQ(internedLiteral->data(), literal);
}
ASSERT_EQ(trackAllocator.deallocations, 1);
}
TEST(StringPool, allocationFailure)
{
TrackAllocator<NullAllocator> trackAllocator;
{
auto pool =
StringPool<TrackAllocator<NullAllocator> &> {
trackAllocator
};
std::string str = "hello";
auto internedStr = pool.intern(str);
ASSERT_FALSE(internedStr);
ASSERT_EQ(pool.allocator.allocations, 1);
ASSERT_EQ(pool.allocator.deallocations, 0);
}
ASSERT_EQ(trackAllocator.deallocations, 0);
}
P.S: don't know how to handle if InternedString outlives the pool. I guess, it's not an issue of string pool, but an incorrect usage.