Reduce per-cell work in SheetParser (~13% speedup) by HoangMinhBK · Pull Request #74 · woahdae/simple_xlsx_reader

HoangMinhBK · 2026-04-25T02:51:40Z

Two small changes to the hot path in SheetParser that together shave ~12-16% off parse time. All 82 existing tests pass.

Precompute column index at the 'c' element

cell_idx was doing a regex scan + column_letter_to_number on every end_element call for 'v'/'t'. We already have the cell ref when the 'c' starts, so just stash it as @cell_idx then. Also dropped the inject({}) hash build per cell in favour of a single each pass.

Accumulate raw chunks, cast once

characters was calling Loader.cast on each SAX chunk and concatenating the results. That's slightly wrong if a value ever gets split mid-string (e.g. a number like 42.5 arriving as "42" + ".5" would cast to two separate values before being joined). Easier to just collect the raw chunks and cast once in end_element.

Benchmark

Script (run from repo root with bundle exec ruby -I test bench.rb):

require 'benchmark'
require_relative 'test/test_helper'

def build_sheet(rows, cols)
  letters = ('A'..'Z').to_a + ('A'..'Z').flat_map {|a| ('A'..'Z').map {|b| "\#{a}\#{b}" } }
  acc = +"<worksheet xmlns='http://schemas.openxmlformats.org/spreadsheetml/2006/main'><sheetData>"
  rows.times do |n|
    n += 1
    acc << "<row>"
    cols.times {|c| acc << "<c r='\#{letters[c]}\#{n}' s='0'><v>Hello \#{letters[c]}\#{n}</v></c>" }
    acc << "</row>"
  end
  acc << "</sheetData></worksheet>"
end

styles = '<styleSheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"><cellXfs count="1"><xf numFmtId="0" /></cellXfs></styleSheet>'
xlsxs = [[1_000, 12], [10_000, 12], [50_000, 12]].map do |rows, cols|
  [rows, cols, TestXlsxBuilder.new(sheets: [build_sheet(rows, cols)], styles: styles).archive]
end

Benchmark.bm(24) do |x|
  xlsxs.each do |rows, cols, xlsx|
    x.report("\#{rows}r × \#{cols}c:") { SimpleXlsxReader.open(xlsx.path).sheets[0].rows.each {} }
  end
end

Results on M-series Mac (string-heavy workload, real time):

Sheet size	master	this PR	delta
1,000 rows × 12 cols	0.053s	0.050s	-6%
10,000 rows × 12 cols	0.474s	0.418s	-12%
50,000 rows × 12 cols	2.419s	2.037s	-16%

Two hot-path optimizations in sheet_parser.rb: 1. Precompute column index at 'c' element start instead of recomputing it via regex + column_letter_to_number on every 'v'/'t' end element. Also replaces the inject({}) Hash allocation per cell with a single-pass each over the attrs array. 2. Accumulate raw SAX string chunks in characters() and call Loader.cast once in end_element_namespace, rather than casting each chunk and concatenating already-cast values. This is both faster and more correct (avoids partial-value casting for numeric/date types split across chunks). Benchmarked at ~13% faster on a 10k-row x 12-column string workload (0.495s -> 0.429s on M-series Mac). All 82 existing tests pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce per-cell work in SheetParser (~13% speedup)#74

Reduce per-cell work in SheetParser (~13% speedup)#74
HoangMinhBK wants to merge 1 commit into
woahdae:masterfrom
HoangMinhBK:perf/precompute-cell-idx-cast-once

HoangMinhBK commented Apr 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

HoangMinhBK commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

HoangMinhBK commented Apr 25, 2026 •

edited

Loading