Skip to content

Reduce per-cell work in SheetParser (~13% speedup)#74

Open
HoangMinhBK wants to merge 1 commit into
woahdae:masterfrom
HoangMinhBK:perf/precompute-cell-idx-cast-once
Open

Reduce per-cell work in SheetParser (~13% speedup)#74
HoangMinhBK wants to merge 1 commit into
woahdae:masterfrom
HoangMinhBK:perf/precompute-cell-idx-cast-once

Conversation

@HoangMinhBK

@HoangMinhBK HoangMinhBK commented Apr 25, 2026

Copy link
Copy Markdown

Two small changes to the hot path in SheetParser that together shave ~12-16% off parse time. All 82 existing tests pass.

Precompute column index at the 'c' element

cell_idx was doing a regex scan + column_letter_to_number on every end_element call for 'v'/'t'. We already have the cell ref when the 'c' starts, so just stash it as @cell_idx then. Also dropped the inject({}) hash build per cell in favour of a single each pass.

Accumulate raw chunks, cast once

characters was calling Loader.cast on each SAX chunk and concatenating the results. That's slightly wrong if a value ever gets split mid-string (e.g. a number like 42.5 arriving as "42" + ".5" would cast to two separate values before being joined). Easier to just collect the raw chunks and cast once in end_element.


Benchmark

Script (run from repo root with bundle exec ruby -I test bench.rb):

require 'benchmark'
require_relative 'test/test_helper'

def build_sheet(rows, cols)
  letters = ('A'..'Z').to_a + ('A'..'Z').flat_map {|a| ('A'..'Z').map {|b| "\#{a}\#{b}" } }
  acc = +"<worksheet xmlns='http://schemas.openxmlformats.org/spreadsheetml/2006/main'><sheetData>"
  rows.times do |n|
    n += 1
    acc << "<row>"
    cols.times {|c| acc << "<c r='\#{letters[c]}\#{n}' s='0'><v>Hello \#{letters[c]}\#{n}</v></c>" }
    acc << "</row>"
  end
  acc << "</sheetData></worksheet>"
end

styles = '<styleSheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"><cellXfs count="1"><xf numFmtId="0" /></cellXfs></styleSheet>'
xlsxs = [[1_000, 12], [10_000, 12], [50_000, 12]].map do |rows, cols|
  [rows, cols, TestXlsxBuilder.new(sheets: [build_sheet(rows, cols)], styles: styles).archive]
end

Benchmark.bm(24) do |x|
  xlsxs.each do |rows, cols, xlsx|
    x.report("\#{rows}r × \#{cols}c:") { SimpleXlsxReader.open(xlsx.path).sheets[0].rows.each {} }
  end
end

Results on M-series Mac (string-heavy workload, real time):

Sheet size master this PR delta
1,000 rows × 12 cols 0.053s 0.050s -6%
10,000 rows × 12 cols 0.474s 0.418s -12%
50,000 rows × 12 cols 2.419s 2.037s -16%

Two hot-path optimizations in sheet_parser.rb:

1. Precompute column index at 'c' element start instead of recomputing it
   via regex + column_letter_to_number on every 'v'/'t' end element.
   Also replaces the inject({}) Hash allocation per cell with a single-pass
   each over the attrs array.

2. Accumulate raw SAX string chunks in characters() and call Loader.cast
   once in end_element_namespace, rather than casting each chunk and
   concatenating already-cast values. This is both faster and more correct
   (avoids partial-value casting for numeric/date types split across chunks).

Benchmarked at ~13% faster on a 10k-row x 12-column string workload
(0.495s -> 0.429s on M-series Mac). All 82 existing tests pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant