go-twig/BUFFER_OPTIMIZATION.md
semihalev 4d0e37e1a0 Implement Zero Allocation Plan Phase 1: Global String Cache Optimization
- Added global string cache for efficient string interning (5.2x faster)
- Implemented optimized tokenizer with object pooling
- Created comprehensive benchmarks and documentation
- Cleaned up old optimization files and experiments

Performance improvements:
- String interning: 5.2x faster (1,492 ns/op vs 7,746 ns/op)
- Zero allocations for common strings
- Same memory efficiency as original (36 B/op, 9 allocs/op)
2025-03-12 10:05:23 +03:00

6.7 KiB

Buffer Handling Optimization in Twig

This document describes the optimization approach used to improve string handling and buffer management in the Twig template engine, which is a critical area for performance in template rendering.

Optimization Goals

  1. Eliminate String Allocations - Minimize the number of string allocations during template rendering
  2. Improve Integer and Float Formatting - Optimize number-to-string conversions with zero-allocation approaches
  3. Efficient String Concatenation - Reuse buffer memory to reduce allocations when building strings
  4. Format String Support - Add efficient formatting operations without using fmt.Sprintf
  5. Smart Buffer Growth - Implement intelligent buffer sizing to avoid frequent reallocations

Implementation Details

1. Buffer Pooling

The core of our optimization is a specialized Buffer type that is reused through a sync.Pool:

// BufferPool is a specialized pool for string building operations
type BufferPool struct {
    pool sync.Pool
}

// Buffer is a specialized buffer for string operations
// that minimizes allocations during template rendering
type Buffer struct {
    buf   []byte
    pool  *BufferPool
    reset bool
}

This allows us to reuse buffer objects and their underlying byte slices across template rendering operations, significantly reducing memory allocations.

2. Zero-Allocation Integer Formatting

We implemented several techniques to avoid allocations when converting integers to strings:

  1. Pre-computed String Table - We store pre-computed string representations of common integers (0-99 and -1 to -99):
var smallIntStrings = [...]string{
    "0", "1", "2", "3", "4", "5", "6", "7", "8", "9",
    "10", "11", "12", "13", "14", "15", ...
}
  1. Manual Integer Formatting - For larger integers (up to 6 digits), we manually convert them to strings without allocations:
func (b *Buffer) formatInt(i int64) (int, error) {
    // Algorithm: calculate digits, convert to ASCII digits directly
    // and append to buffer without allocating strings
}

3. Float Formatting Optimization

We developed a special float formatting approach that:

  1. Detects whole numbers and formats them as integers
  2. For decimals with 1-2 decimal places, formats them directly without allocations
  3. Uses a smart rounding algorithm to handle common cases
  4. Falls back to standard formatting for complex cases

4. String Formatting Without Allocations

We implemented a custom WriteFormat method that:

  1. Parses format strings like %s, %d, %v directly
  2. Writes formatted values directly to the buffer
  3. Achieves 46% better performance than fmt.Sprintf with fewer allocations

5. Smart Buffer Growth Strategy

The buffer uses a tiered growth strategy for efficient memory usage:

// For small buffers (<1KB), grow at 2x rate
// For medium buffers (1KB-64KB), grow at 1.5x rate
// For large buffers (>64KB), grow at 1.25x rate

This reduces both the frequency of reallocations and wasteful memory usage for large templates.

Benchmark Results

Here are key performance improvements from our buffer optimizations:

1. Integer Formatting

BenchmarkSmallIntegerFormatting/Optimized_Small_Ints-8   3739724  310.0 ns/op   0 B/op   0 allocs/op
BenchmarkSmallIntegerFormatting/Standard_Small_Ints-8    3102302  387.1 ns/op   0 B/op   0 allocs/op

Our optimized approach is about 25% faster for small integers.

2. Float Formatting

BenchmarkFloatFormatting/OptimizedFloat-8    2103276  566.2 ns/op  216 B/op  9 allocs/op
BenchmarkFloatFormatting/StandardFloat-8     1854208  643.1 ns/op  288 B/op  12 allocs/op

Our approach is 12% faster with 25% fewer memory allocations.

3. String Formatting

BenchmarkFormatString/BufferFormat-8     22180171    45.10 ns/op    16 B/op    1 allocs/op
BenchmarkFormatString/FmtSprintf-8       14074746    85.92 ns/op    64 B/op    2 allocs/op

Our custom formatter is 47% faster with 75% less allocated memory.

Usage in the Codebase

The optimized buffer is now used throughout the template engine:

  1. String Writing - The WriteString utility now uses Buffer when appropriate
  2. Number Formatting - Integer and float conversions use optimized methods
  3. String Formatting - Added WriteFormat for efficient format strings
  4. Pool Reuse - Buffers are consistently recycled back to the pool

String Interning Implementation

We have now implemented string interning as part of our zero-allocation optimization strategy:

1. Global String Cache

A centralized global string cache provides efficient string deduplication:

// GlobalStringCache provides a centralized cache for string interning
type GlobalStringCache struct {
    sync.RWMutex
    strings map[string]string
}

2. Fast Path Optimization

To avoid lock contention and map lookups for common strings:

// Fast path for very common strings
switch s {
case stringDiv, stringSpan, stringP, stringA, stringImg, 
     stringIf, stringFor, stringEnd, stringEndif, stringEndfor, 
     stringElse, "":
    return s
}

3. Size-Based Optimization

To prevent memory bloat, we only intern strings below a certain size:

// Don't intern strings that are too long
if len(s) > maxCacheableLength {
    return s
}

4. Concurrency-Safe Design

The implementation uses a combination of read and write locks for better performance:

// Use read lock for lookup first (less contention)
globalCache.RLock()
cached, exists := globalCache.strings[s]
globalCache.RUnlock()

if exists {
    return cached
}

// Not found with read lock, acquire write lock to add
globalCache.Lock()
defer globalCache.Unlock()

5. Benchmark Results

The string interning benchmark shows significant improvements:

BenchmarkStringIntern_Comparison/OriginalGetStringConstant-8   154,611  7,746 ns/op   0 B/op  0 allocs/op
BenchmarkStringIntern_Comparison/GlobalIntern-8                813,786  1,492 ns/op   0 B/op  0 allocs/op

The global string interning is about 5.2 times faster than the original method.

Future Optimization Opportunities

  1. Tokenizer Pooling - Create a pool for the OptimizedTokenizer to reduce allocations
  2. Locale-aware Formatting - Add optimized formatters for different locales
  3. Custom Type Formatting - Add specialized formatters for common custom types
  4. Buffer Size Prediction - Predict optimal initial buffer size based on template

Conclusion

The buffer handling optimizations significantly reduce memory allocations during template rendering, particularly for operations involving string building, formatting, and conversion. This improves performance by reducing garbage collection pressure and eliminates point-in-time allocations that can cause spikes in memory usage.