[WIP] Feat: Add video_calibration_mapper and video_split_by_frame_mapper by 1van2ha0 · Pull Request #902 · datajuicer/data-juicer

1van2ha0 · 2026-02-01T13:05:16Z

This PR introduces two new operators:

video_calibration_mapper: Extract camera intrinsics from videos using DroidCalib.
video_split_by_frame_mapper: Splits videos into segments based on a specified frame count or duration.

Core implementation logic for both operators has been completed, but significant validation and polish work remains before this can be considered production-ready.

Completed

Core logic implementation for Operators
Basic unit tests written for both operators
Minimal working demos
Shell installation script (demos/install_droidcalib.sh) created for DroidCalib in video_calibration_mapper

Pending Work

Item	Status	Details
unittests	not validated	Tests are written but have not been executed. Expected to fail.
demos	rough	Current demos are minimal stubs
DroidCalib	blocker	Requires manual installation of `DroidCalib` before use. Included shell script (`demos/install_droidcalib.sh`) has not been tested and may fail.

Next Steps

Test and fix demos/install_droidcalib.sh script
Verify unit tests pass in clean environment
Refine demos

gemini-code-assist · 2026-02-01T13:05:36Z

Summary of Changes

Hello @1van2ha0, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the video processing capabilities of the Data-Juicer framework by introducing two powerful new mappers. One mapper focuses on extracting crucial camera intrinsic data from video streams, which is vital for computer vision tasks requiring precise spatial understanding. The other provides flexible video segmentation, enabling users to break down long videos into manageable, overlapping or non-overlapping clips based on frames or duration. These additions enhance the framework's utility for multimodal data preparation, particularly for video-centric applications.

Highlights

New Video Calibration Mapper: Introduced VideoCalibrationMapper which extracts camera intrinsics from videos using the DroidCalib library. This operator dynamically loads DroidCalib and handles video processing to determine camera parameters.
New Video Split by Frame Mapper: Added VideoSplitByFrameMapper to segment videos based on a specified frame count or duration, with support for overlapping segments. It leverages FFmpeg for video splitting.
Dependency Management for DroidCalib: The VideoCalibrationMapper includes logic for ensuring DroidCalib is importable at runtime, and a dedicated shell script (demos/video_calibration/install_droidcalib.sh) is provided for its installation.
Work in Progress (WIP) Status: The core logic for both new operators is complete, but significant validation, testing, and polish work is still pending, as indicated by the '[WIP]' tag and the PR description.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces two new mappers, video_calibration_mapper and video_split_by_frame_mapper, along with associated demos and tests. The changes are generally well-structured, but there are a few critical issues and areas for improvement. A significant bug has been introduced in ray_dataset.py that breaks the handling of Mapper and Filter operators. Additionally, there are opportunities to improve resource management in video_calibration_mapper and error handling in video_split_by_frame_mapper. Please see the detailed comments for specific feedback.

gemini-code-assist · 2026-02-01T13:07:28Z

data_juicer/core/data/ray_dataset.py

 from data_juicer.core.data.schema import Schema
 from data_juicer.core.tracer import should_trace_op
-from data_juicer.ops import Deduplicator, Filter, Mapper, Pipeline
+from data_juicer.ops import Deduplicator, Mapper, Pipeline


The Filter operator is removed from imports, but its logic is still used (and needed) in _run_single_op. This should be re-added to fix the broken logic for handling Filter operators.

Suggested change

from data_juicer.ops import Deduplicator, Mapper, Pipeline

from data_juicer.ops import Deduplicator, Filter, Mapper, Pipeline

gemini-code-assist · 2026-02-01T13:07:28Z

data_juicer/core/data/ray_dataset.py

                    if tracer and should_trace_op(tracer, op._name) and original_process:
                        op.process = original_process
-            elif isinstance(op, Filter):
                # Use cached_columns instead of self.data.columns() to avoid breaking pipeline


Removing elif isinstance(op, Filter): merges the logic for Filter operators into the Mapper operator block. This will cause an AttributeError when a Mapper op is processed, as Mapper does not have methods like compute_stats which are specific to Filter. The separate logic for Mapper and Filter should be restored.

elif isinstance(op, Filter): # Use cached_columns instead of self.data.columns() to avoid breaking pipeline

gemini-code-assist · 2026-02-01T13:07:28Z

data_juicer/ops/mapper/video_calibration_mapper.py

+        for i in range(1):
+            for t, image, intrinsics, size_factor in stream:
+                if droid is None:
+                    # Update args with actual image size
+                    self.droid_args.image_size = [image.shape[2], image.shape[3]]
+                    droid = Droid(self.droid_args)
+
+                droid.track(t, image, intrinsics=intrinsics)
+                sf = size_factor
+
+            if droid is not None:
+                # Terminate and get results
+                # We need to pass the stream again for terminate?
+                # demo.py: droid.terminate(image_stream(...))
+                # It seems terminate does a final BA pass using the stream?
+                # Let's recreate stream
+                stream_second_pass = self._image_stream(video_path)
+                traj_est, intr_est = droid.terminate(stream_second_pass)
+
+                # Rescale intrinsics back to original resolution
+                if sf:
+                    intr_est = intr_est.copy()
+                    intr_est[0:4:2] /= sf[0]
+                    intr_est[1:4:2] /= sf[1]
+
+                intr_est_list = intr_est.tolist()
+
+            # except Exception as e:
+            #     # Log error or just skip
+            #     print(f"Error processing video {video_path}: {e}")
+            # finally:
+            # Cleanup
+            if droid:
+                del droid
+            torch.cuda.empty_cache()


The resource cleanup logic (e.g., del droid, torch.cuda.empty_cache()) is not guaranteed to run if an exception occurs during video processing. It's better to wrap the processing logic in a try...finally block to ensure resources are always released. The for i in range(1): loop is unnecessary and can be removed. Also, consider using the logger for error messages instead of print.

try: for t, image, intrinsics, size_factor in stream: if droid is None: # Update args with actual image size self.droid_args.image_size = [image.shape[2], image.shape[3]] droid = Droid(self.droid_args) droid.track(t, image, intrinsics=intrinsics) sf = size_factor if droid is not None: # Terminate and get results stream_second_pass = self._image_stream(video_path) _, intr_est = droid.terminate(stream_second_pass) # Rescale intrinsics back to original resolution if sf: intr_est = intr_est.copy() intr_est[0:4:2] /= sf[0] intr_est[1:4:2] /= sf[1] intr_est_list = intr_est.tolist() except Exception as e: if self.verbose: logger.warning(f'Error processing video {video_path}: {e}') finally: # Cleanup if droid: del droid torch.cuda.empty_cache()

gemini-code-assist · 2026-02-01T13:07:28Z

data_juicer/ops/mapper/video_split_by_frame_mapper.py

+        keys = samples_after_split[0].keys()
+        res_samples = {}
+        for key in keys:
+            res_samples[key] = [s[key] for s in samples_after_split]
+        return res_samples


If samples_after_split is empty, accessing samples_after_split[0] will raise an IndexError. This can happen if no samples are kept or generated. You should add a check to handle this case to prevent the operator from crashing.

Suggested change

keys = samples_after_split[0].keys()

res_samples = {}

for key in keys:

res_samples[key] = [s[key] for s in samples_after_split]

return res_samples

if not samples_after_split:

return {key: [] for key in samples.keys()}

keys = samples_after_split[0].keys()

res_samples = {}

for key in keys:

res_samples[key] = [s[key] for s in samples_after_split]

return res_samples

gemini-code-assist · 2026-02-01T13:07:28Z

data_juicer/ops/mapper/video_split_by_frame_mapper.py

+            # Default encoding args from user code
+            # We can allow overriding via ffmpeg_extra_args
+            if self.ffmpeg_extra_args:
+                import shlex


Imports should be at the top of the file to follow standard Python style guides (PEP 8). Please move import shlex to the top-level imports.

yian.zyf and others added 4 commits February 1, 2026 20:18

feat: add video calibration (droidcalib) mapper and demos

323a077

feat: add video split by frame mapper

1276f3b

fix: add install_droidcalib.sh for droidcalib

e11723a

fix based on pre-commit

a6fb759

1van2ha0 requested a deployment to Testing February 1, 2026 13:05 — with GitHub Actions Waiting

gemini-code-assist bot reviewed Feb 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Feat: Add video_calibration_mapper and video_split_by_frame_mapper#902

[WIP] Feat: Add video_calibration_mapper and video_split_by_frame_mapper#902
1van2ha0 wants to merge 4 commits intodatajuicer:mainfrom
1van2ha0:feature/vlapipeline

1van2ha0 commented Feb 1, 2026

Uh oh!

gemini-code-assist bot commented Feb 1, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 1, 2026

Uh oh!

gemini-code-assist bot Feb 1, 2026

Uh oh!

gemini-code-assist bot Feb 1, 2026

Uh oh!

gemini-code-assist bot Feb 1, 2026

Uh oh!

gemini-code-assist bot Feb 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	from data_juicer.ops import Deduplicator, Mapper, Pipeline
	from data_juicer.ops import Deduplicator, Filter, Mapper, Pipeline

Conversation

1van2ha0 commented Feb 1, 2026

Uh oh!

gemini-code-assist bot commented Feb 1, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant