Skip to content

Conversation

@user202729
Copy link
Contributor

@user202729 user202729 commented Dec 29, 2025

Currently, arb_poly_mul uses a blockwise decomposition algorithm, which behaves like $O(n^2)$ unless both polynomials have log-coefficients close to linear with same (integral multiple of log(2)) slope.

This pull request implements an algorithm that performs reasonably efficiently for polynomials with convex (actually negative-convex) log-coefficients.

In particular, with $f = (x+1)^{10000}$ and $g = (x+2)^{10000}$, arb_poly_mul(output, f, g, 128) takes 1.5s, and the new algorithm is 10× faster. The speedup is larger for larger polynomial degree. Benchmark code.

Currently just a draft version (*). What do you (the maintainers) think of the idea (i.e. is there bandwidth to review the code + the algorithm)?

For more details and proof of correctness, refer to https://drive.proton.me/urls/RA07BXFPFW#aCldUtt21hLr .

Caveat: not (peer-)reviewed.

(*): Needed:

  • determine whether using slong for the exponent (and more importantly, the multiplication inside convex_hull) is okay, or should fmpz be used (fmpz is about 4% slower) [edit: I think this was debug build, actual overhead is less]
  • documentation
  • decide whether the method need to be explicitly invoked (current status) or use some heuristic to try it in arb_poly_mul. In particular, computing convex hull of log-coefficient takes linear time in degree and does not depend on prec, which is still relatively cheap.
  • consider if the nice_powers trick is useful for arb_poly_mullow. It would be helpful with polynomials with linear log-coefficient with slope being a non-integral multiple of log(2), but this may be an extremely specific case.
  • maybe more...

Previously discussed in #2278

@albinahlback
Copy link
Collaborator

Interesting! Could you specify some benchmarks for some cases (both @vneiger and @fredrik-johansson write great benchmarks, for example see #2437 and #2524)? Like cases where this algorithm works great compared to current routines, and perhaps cases were it doesn't work too well? In case this will be used in other routines, algorithm selection will be important.

@user202729
Copy link
Contributor Author

The problem is not as much about performance, it's more about overestimation of error bound.

But in order to determine whether overestimation of error happens, it's necessary to compute an (approximate) max-plus convolution of log coefficients, which is already like $O(n \log^2 n)$. Although at least in terms of constant factor, it should be much better than computing the result itself (no × prec term).

In practice, something like "if N ≥ threshold, run new algorithm. If new algorithm has large errors, run old algorithm" should be fine --- if new algorithm is at least say 3× faster than old algorithm, then the hybrid version above is at most 33% slower than old algorithm.

@fredrik-johansson
Copy link
Collaborator

Just to let you know I'm aware of this PR. Superficially it looks good, but I don't know when I will have time to do a detailed review.

One concern I have is that it seems to reinvent instead of improving the existing _arb_poly_mullow_block with better scaling and pruning strategies. The obvious optimization that I never got around to implement is to compute a quick bound for each output coefficient (or rather a bound for its radius); when a product of blocks adding to a range $c_k, \ldots, c_l$ of the output would be small compared to the bound for that range, we can reduce its precision or skip it entirely. This seems to be roughly what you do here, in addition to somewhat different scaling and subdivision strategies. Am I missing something else?

I would not be surprised to see the same kind of speedup with appropriate modifications to _arb_poly_mullow_block, so it's not clear that there should be a completely different algorithm for log-convex coefficients.

The default arb_poly multiplication currently guarantees that error bounds are nearly optimal coefficient-wise; in particular, exact coefficients are preserved. This is one reason for scaling by a power of two even when some other factor would be more efficient.

@user202729
Copy link
Contributor Author

user202729 commented Jan 21, 2026

Thanks for the comment.

One concern I have is that it seems to reinvent instead of improving the existing _arb_poly_mullow_block with better scaling and pruning strategies.

I tried to understand what _arb_poly_mullow_block and its helper functions do, and documented my understanding in #2481 , but if I recalled correctly, I didn't find much of it reusable.

It's true that the old code could be improved to use the nice_powers logic though. (Although it might be a regression in other cases, since the scaling becomes more expensive. It isn't clear that this is better or worse in the typical case)

Afterwards, it should be possible to reuse more code by refactoring _arb_vec_get_fmpz_2exp_blocks to invoke a helper function that converts just one block (then the helper function can be reused).

This seems to be roughly what you do here, in addition to somewhat different scaling and subdivision strategies.

I think the main improvement is that this PR allows each block to have a different slope. The old algorithm requires all blocks to have the same slope.

Another difference in the subdivision strategy is that in general, even if both polynomials are convex, each polynomial may still be divided into $Θ(n)$ blocks, so iterating through all blocks may still take $Θ(n^2)$. Thus the divide-and-conquer.

I have an argument (see the linked PDF) that this algorithm gives $O(n \log^t n)$ for some $t$. I believe (but haven't proved) that if all blocks are required to have the same slope, then you need at least $Ω(n^{1+ε)}$ in some (reasonably typical) cases.

Of course, in practice, it may be the case that $t$ is large enough or $ε$ is small enough that it doesn't matter. And the $O(n^2)$ factor (looping through all blocks) may be negligible in practice compared to the actual multiplication of arb.

Would need a benchmark though.

we can reduce its precision or skip it entirely.

Note that to ensure correctness, it would be necessary to add a small error term to each element of the output, taking linear time per discarded block. Or one can pre-add the error at the start or end (which is what I do here).

@user202729
Copy link
Contributor Author

user202729 commented Jan 23, 2026

Thinking about the comment above, with the current implementation of mullow_block, if we were to adapt it to just prune away the small blocks, one issue is in which order should the blocks be added. We want to add the blocks from large to small, so that the errors coming from the larger ones allow one to skip the computations from the smaller ones.

I can think of some ways

  • estimate the magnitude of by doing a max-plus convolution on fmpz. Since this step does not depend on the precision, we may hope this step is reasonably fast compared to the actual multiplication (with sufficiently careful implementation).
  • do the convex convolution in order to determine, on each diagonal, which part should be computed first.
  • randomize the orders in which the blocks are added. This one probably gives an additional log n factor.

I wonder if it's possible to tighten the analysis to shave a log n factor. The original paper (Bringmann and Cassis, 2023) introduces a log n factor in the block decomposition (Lemma 24), when adapted to polynomial convolution, I think another log n factor is introduced because of the FFT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants