#841 Add support for file offsets in in-place processors#844
Conversation
… for in-place processing in Spark.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (1)
WalkthroughThis PR threads file_start_offset and file_end_offset through the processing pipeline: FSStream now accepts offsets and enforces bounded reads; CobolProcessor and SparkCobolProcessor propagate readerParameters to per-file streamers; FileUtils adds a helper to compute Hadoop read sizes; tests and the README were updated. ChangesFile Offset Processing Support
Sequence DiagramsequenceDiagram
participant Builder as Builder
participant CobolProc as CobolProcessor
participant FSStream as FSStream
participant FileStreamer as FileStreamer
participant FS as FileSystem
Builder->>CobolProc: build with file_start_offset,<br/>file_end_offset options
CobolProc->>CobolProc: extract readerParameters (offsets)
rect rgba(100,150,255,0.5)
CobolProc->>FSStream: FSStream(file, startOffset, endOffset)
FSStream->>FS: seek to startOffset (skipFully)
FS-->>FSStream: byte chunks
FSStream->>CobolProc: data until effectiveSize reached
end
rect rgba(150,200,100,0.5)
CobolProc->>FileStreamer: compute maximumBytes via FileUtils
FileStreamer->>FS: FileStreamer(file, startOffset, maximumBytes)
FS-->>FileStreamer: bounded read
FileStreamer->>CobolProc: data within bounds
end
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly Related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 60 minutes.Comment |
JaCoCo code coverage report - 'cobol-parser'
|
JaCoCo code coverage report - 'spark-cobol'
|
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/SparkCobolProcessorSuite.scala (1)
163-193: ⚡ Quick winAdd a Spark
InPlacecase too.This regression test still runs the
ToVariableLengthbranch, so it won't catch a break in the actual in-place Spark processor path that this PR is about. A matchingCobolProcessingStrategy.InPlacecase would close that gap.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/SparkCobolProcessorSuite.scala` around lines 163 - 193, The test currently exercises only CobolProcessingStrategy.ToVariableLength; add a parallel case that sets withProcessingStrategy(CobolProcessingStrategy.InPlace) (using the same SerializableRawRecordProcessor implementation, file_start_offset/file_end_offset options, load/save and subsequent read/assert steps) so the in-place Spark processing path is covered; replicate the sequence that writes outputFile, reads binary bytes and JSON via spark.read (same options: copybook_contents, record_format "V", is_rdw_big_endian "true", pedantic "true") and assert the outputData and actual JSON equal the same expected values to ensure the InPlace branch is tested.cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/processor/CobolProcessorBuilderSuite.scala (1)
94-112: ⚡ Quick winAdd an
InPlaceregression case alongside this one.This new coverage only exercises
CobolProcessingStrategy.ToVariableLength, while the feature request is specifically about the in-place processor path. A sibling assertion forCobolProcessingStrategy.InPlacewould make sureCobolProcessorInPlacecan't regress unnoticed.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/processor/CobolProcessorBuilderSuite.scala` around lines 94 - 112, Add a sibling test that exercises the InPlace processing path: duplicate the existing case that builds via CobolProcessor.builder but set .withProcessingStrategy(CobolProcessingStrategy.InPlace) and use the same RawRecordProcessor, options ("file_start_offset","file_end_offset") and input/output files; then assert the returned count and the output binary contents match the expected bytes to ensure CobolProcessorInPlace is covered and cannot regress.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/stream/FSStream.scala`:
- Around line 21-36: The FSStream constructor currently uses
bytesStream.skip(fileStartOffset) which is unreliable and computes effectiveSize
that can be negative; change it to loop until the requested fileStartOffset
bytes are actually skipped (repeatedly calling bytesStream.skip(remaining) and
if skip returns 0, read and discard a single byte to advance or break on EOF) to
ensure the stream position is correct, and clamp effectiveSize to be
non-negative by computing effectiveSize = Math.max(0L, fileSize -
fileStartOffset - fileEndOffset); ensure size and totalSize return that clamped
effectiveSize and that next() uses the clamped value to avoid premature EOF
behavior.
In `@README.md`:
- Around line 2014-2015: Update the changelog entry by increasing the heading
level from "#### 2.10.4 will be released soon." to "### 2.10.4 will be released
soon." to remove the MD001 warning, and correct the link target in the bullet
(currently "[`#841`](.../pull/841)") to point to the intended object—either change
to "[`#841`](.../issues/841)" if referencing the issue or to
"[`#844`](.../pull/844)" if referencing the PR that introduced the change; edit
the matching line containing the heading text and the link text in README.md
accordingly.
---
Nitpick comments:
In
`@cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/processor/CobolProcessorBuilderSuite.scala`:
- Around line 94-112: Add a sibling test that exercises the InPlace processing
path: duplicate the existing case that builds via CobolProcessor.builder but set
.withProcessingStrategy(CobolProcessingStrategy.InPlace) and use the same
RawRecordProcessor, options ("file_start_offset","file_end_offset") and
input/output files; then assert the returned count and the output binary
contents match the expected bytes to ensure CobolProcessorInPlace is covered and
cannot regress.
In
`@spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/SparkCobolProcessorSuite.scala`:
- Around line 163-193: The test currently exercises only
CobolProcessingStrategy.ToVariableLength; add a parallel case that sets
withProcessingStrategy(CobolProcessingStrategy.InPlace) (using the same
SerializableRawRecordProcessor implementation, file_start_offset/file_end_offset
options, load/save and subsequent read/assert steps) so the in-place Spark
processing path is covered; replicate the sequence that writes outputFile, reads
binary bytes and JSON via spark.read (same options: copybook_contents,
record_format "V", is_rdw_big_endian "true", pedantic "true") and assert the
outputData and actual JSON equal the same expected values to ensure the InPlace
branch is tested.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: d4240e36-697e-4c0e-92ed-329177a75dfb
📒 Files selected for processing (8)
README.mdcobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/CobolProcessor.scalacobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/stream/FSStream.scalacobol-parser/src/test/scala/za/co/absa/cobrix/cobol/processor/CobolProcessorBuilderSuite.scalaspark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/SparkCobolProcessor.scalaspark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scalaspark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/utils/FileUtils.scalaspark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/SparkCobolProcessorSuite.scala
| - #### 2.10.4 will be released soon. | ||
| - [#841](https://github.com/AbsaOSS/cobrix/pull/841) Added support for file start and end offset options for in-place processing of files without converting to dataframes. |
There was a problem hiding this comment.
Fix the changelog heading and reference target.
This entry introduces the MD001 warning (## Changelog jumps straight to ####), and the link points to pull/841 even though this change is PR 844 closing issue 841. Bump the heading to ### and point the link at the intended object (issues/841 or pull/844).
📝 Suggested edit
-- #### 2.10.4 will be released soon.
- - [`#841`](https://github.com/AbsaOSS/cobrix/pull/841) Added support for file start and end offset options for in-place processing of files without converting to dataframes.
+### 2.10.4 will be released soon.
+- [`#844`](https://github.com/AbsaOSS/cobrix/pull/844) Added support for file start and end offset options for in-place processing of files without converting to dataframes.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - #### 2.10.4 will be released soon. | |
| - [#841](https://github.com/AbsaOSS/cobrix/pull/841) Added support for file start and end offset options for in-place processing of files without converting to dataframes. | |
| ### 2.10.4 will be released soon. | |
| - [`#844`](https://github.com/AbsaOSS/cobrix/pull/844) Added support for file start and end offset options for in-place processing of files without converting to dataframes. |
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 2014-2014: Heading levels should only increment by one level at a time
Expected: h3; Actual: h4
(MD001, heading-increment)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@README.md` around lines 2014 - 2015, Update the changelog entry by increasing
the heading level from "#### 2.10.4 will be released soon." to "### 2.10.4 will
be released soon." to remove the MD001 warning, and correct the link target in
the bullet (currently "[`#841`](.../pull/841)") to point to the intended
object—either change to "[`#841`](.../issues/841)" if referencing the issue or to
"[`#844`](.../pull/844)" if referencing the PR that introduced the change; edit
the matching line containing the heading text and the link text in README.md
accordingly.
…ace with CobolProcessor.
Closes #841
Summary by CodeRabbit
New Features
Documentation
Tests