Faster Deep-Copy Proposal #659

buckcronk · 2026-01-15T20:58:37Z

buckcronk
Jan 15, 2026

Hi! In our application, we call XmlUtils.deepCopy repeatedly to duplicate parts of a DOCX document. This is slow, since XmlUtils.deepCopy is based on marshaling and unmarshaling the XML.

A much faster alternative is to implement a deep-copy method on each of docx4j's OpenXML object classes (Body, P, R, Text, etc.). That's a challenge, because there appear to be more than 2500 of those classes. Nevertheless, I've been able to create a proof of concept that, for my own particular test cases, is 25X faster for 10K table rows and 46X faster for 100K table rows.

We plan to pursue this improvement ourselves, but have some questions:

Would there be any interest in accepting it as a contribution to docx4j itself?
If so, is there a preferred approach to implement it?

buckcronk · 2026-01-15T21:05:43Z

buckcronk
Jan 15, 2026
Author

For the implementation, I've thought of two approaches. Broadly:

One approach would be to automate the process of generating the OpenXML object classes with XJC, so that we could use an XJC plugin to generate deep-copy methods for all 2500+ classes. My understanding is that the existing classes were originally generated with XJC, but have accumulated many manual changes over the years.
Another approach would be to directly implement deep-copy methods in the existing classes. But since there are over 2500 classes, this would still need to be done via some kind of one-time automation... a script, OpenRewrite recipe, or AI.

Approach 1

The first approach is the one I used to create the POC. I created a Maven project and used the JAXB Maven Plugin from JAXB Tools to generate docx4j's classes from the XSDs. I added several XJC plugins to the process: the Copyable plugin to implement a copyTo method in each generated class, the Parent Pointer Plugin to implement the Child interface, the Inheritance plugin to implement ContentAccessor, and the Annotate plugin to add some @XmlRootElement annotations. I also tweaked the XSDs here and there and wrote a script to apply some minor transformations to the generated code.

But I only did enough to get it working at runtime for two particular test cases. Rather than compile these classes as part of docx4j, I built a separate JAR with them and used it in place of the original docx4j-openxml-objects* JARs when building our application. I think there would be a lot of work left to bring the generated classes into line with the existing classes, due to the manual changes they've accumulated.

The benefit of this approach is that it would become possible to use XJC plugins to make this deep-copy change to all 2500+ classes automatically, and in the future to make other changes/fixes automatically across all generated classes (including any new classes from new XSDs).

But the downsides are notable:

There's a risk that replacing/regenerating so much existing code would introduce bugs.
With full automation, it may no longer be practical to make modifications directly to the generated classes, or even to keep them in source control. (If XJC does not generate the classes exactly the same every time, it could lead to noisy diffs.) That would be a serious departure from how the project currently works.
This approach adds a dependency on JAXB Tools.

Approach 2

Directly implement deep-copy methods in the existing classes. This would avoid the work and risk of automating the changes via XJC and JAXB Tools. But that is replaced by the work and risk of coming up with a way to do this across all 2500+ classes using something like a script, OpenRewrite recipe, or AI, hopefully in a reliable and deterministic way. And by making the change directly, we might make automation or big changes harder in the future by introducing more custom code across all these files. (Rather than making big changes easier by establishing a repeatable process for generating the code from the XSDs.)

0 replies

plutext · 2026-01-17T01:23:39Z

plutext
Jan 17, 2026
Maintainer

Hi Buck

A faster deep copy as indicated by your POC results would be a very worthwhile improvement, so yes, there is interest in having this as a contribution.

Thank you for your analysis of available approaches.

Some initial notes:

yes the classes were originally generated using XJC, and yes there have been manual changes in places, but not a huge number of manual changes. It would be feasible to re-generate, then re-add the manual changes where necessary, though ideally as you note there would instead be a repeatable process which just generated the code from the XSDs.
a repeatable process is a worthy goal, but if it turns out to be not entirely possible, we should give some thought to whether to take this opportunity to use any other XJC plugins (eg fluent interface?)
XJC has at least in the past typically generated classes with at least the methods ordered differently each time, which does indeed lead to noisy diffs... (this is one reason we've sometimes preferred manual changes)
new XSDs are indeed published by Microsoft and added from time to time, leading to new generated classes

My initial preference would be for some variant on approach 1. Approach 2 may be feasible using OpenRewrite, but it wouldn't be a one-off operation (owing to new XSDs), so it would be another moving part to maintain, which is another reason for favouring the XJC-based approach 1.

Were this to be implemented, we would release it in a docx4j 15.0 (or other N.0) as a signal that it is a ".0" release which may introduce new bugs.

1 reply

buckcronk Jan 17, 2026
Author

Hi Jason,

Sounds good, thank you for the detailed reply! I will plan to continue with Approach 1 and open a PR once I've got something closer to being merge-able.

Thanks,
Buck

plutext · 2026-01-19T01:00:03Z

plutext
Jan 19, 2026
Maintainer

Hi Buck

The just committed https://github.com/plutext/docx4j/blob/VERSION_11_5_10/xsd/ROOT.xsd will generate pretty much all docx4j-openxml-objects-* (sans manual edits).

There are a couple of small discrepancies I will look into later this week, but I hope that gives you a solid base to work from.

I also added https://github.com/plutext/docx4j/blob/VERSION_11_5_10/xsd/docx4j_jaxb_packages.xlsx as a bit of a guide as to which xsd files results in which Java packages.

Historically, docx4j was a monolithic project built using ant. When we converted to Maven and modules, it seemed like a good idea to have:

        <module>docx4j-openxml-objects</module>
        <module>docx4j-openxml-objects-pml</module>
        <module>docx4j-openxml-objects-sml</module>

that is, to split the pml and sml specific generated classes into separate modules from wml, dml etc.

I am not wedded to this. That is, if it is simpler to use ROOT.xsd (or equivalent) to build <module>docx4j-openxml-objects</module> and get rid of the -pml and -sml modules, then let's do that.

cheers .. Jason

0 replies

buckcronk · 2026-01-19T03:31:57Z

buckcronk
Jan 19, 2026
Author

Hi Jason,

Amazing, thank you again! I had been generating classes from a list of separate XSDs, but having one root XSD seems like a much better idea. And indeed, I hadn't even anticipated it yet, but it will probably be simpler to combine the modules. I will take a closer look at this tomorrow. (I'm in the US, Central time)

Cheers! Buck

0 replies

buckcronk · 2026-01-30T22:17:05Z

buckcronk
Jan 30, 2026
Author

Small update: I haven't given up on this, I just have competing priorities while I work through the necessary changes. I've got to a point where docx4j compiles with the generated classes, and have started working on failing tests. (Mostly missing @XmlRootElement annotations at this point.)

I'm working internally so far; I'll see if I can push code to GitHub next week so it's open for feedback.

2 replies

buckcronk Feb 3, 2026
Author

Need to work through internal processes / legal review before I can push anything. Not sure how long it will take.

plutext Feb 3, 2026
Maintainer

Makes sense. Thanks for the update.

buckcronk · 2026-04-10T20:27:04Z

buckcronk
Apr 10, 2026
Author

Hi @plutext, We now have a working version of this feature internally, and we'd still like to contribute it if possible. Would you need us to sign individual and/or corporate Contributor License Agreements? I see there is an individual CLA in the docx4j repo that looks to be based on Apache's CLAs, but I thought we'd better check with you about whether it's required, whether it's up-to-date, and whether a corporate CLA is required in place of or in addition to the individual agreement.

2 replies

plutext Apr 12, 2026
Maintainer

Hi @buckcronk thanks for the update. It would be great to have your contribution.

It would be preferable if you could get a corporate contribtor agreement signed please. See now https://github.com/plutext/docx4j/blob/VERSION_11_5_13/legals/docx4j_CorporateContributor.pdf based also on the Apache CLA.

If that is not feasible, we can accept the contribution under clause 5 of the ASL if in your PR you state that you are duly authorised to submit the contribution on behalf of your employer, Oracle Corp.

buckcronk Apr 13, 2026
Author

All right, I'll reach out to our legal team for that. Thank you!

Faster Deep-Copy Proposal #659

Uh oh!

buckcronk Jan 15, 2026

Replies: 6 comments · 5 replies

Uh oh!

buckcronk Jan 15, 2026 Author

Approach 1

Approach 2

Uh oh!

plutext Jan 17, 2026 Maintainer

Uh oh!

buckcronk Jan 17, 2026 Author

Uh oh!

plutext Jan 19, 2026 Maintainer

Uh oh!

buckcronk Jan 19, 2026 Author

Uh oh!

buckcronk Jan 30, 2026 Author

Uh oh!

buckcronk Feb 3, 2026 Author

Uh oh!

plutext Feb 3, 2026 Maintainer

Uh oh!

buckcronk Apr 10, 2026 Author

Uh oh!

plutext Apr 12, 2026 Maintainer

Uh oh!

buckcronk Apr 13, 2026 Author

buckcronk
Jan 15, 2026

Replies: 6 comments 5 replies

buckcronk
Jan 15, 2026
Author

plutext
Jan 17, 2026
Maintainer

buckcronk Jan 17, 2026
Author

plutext
Jan 19, 2026
Maintainer

buckcronk
Jan 19, 2026
Author

buckcronk
Jan 30, 2026
Author

buckcronk Feb 3, 2026
Author

plutext Feb 3, 2026
Maintainer

buckcronk
Apr 10, 2026
Author

plutext Apr 12, 2026
Maintainer

buckcronk Apr 13, 2026
Author