Skip to content

[FEA] Support bool based on bit, Timestamp more range and StringView #21782

@jinchengchenghh

Description

@jinchengchenghh

In Velox and Cudf integration, the bool type conversion is expensive, because the source Velox bool type is one bit for one value, but in cudf, it is one byte for one value, it is better to have a bool type align with Velox, then we only needs to copy a big buffer(data buffer and null buffer) instead of setting the value one by one, so as Timestamp.

With this approach, we can do the conversion in any time considering the memory/runtime statistic/operator fallback. E.g. The final round join is under memory pressure we cannot process in GPU without spill, use multiple threads CPU to do the following join instead of spilling to disk maybe a good trade off.

System goal Timestamp type Example systems / libraries Notes
Maximum time range int128 Some database engines, experimental extensions in Apache Arrow, specialized scientific time libraries Used when timelines may span thousands or millions of years, or when avoiding overflow is critical
High-performance analytics int64 RAPIDS cuDF, Pandas, Apache Arrow Fast arithmetic, smaller memory footprint, and native support on CPUs/GPUs; covers ~1677–2262 for nanoseconds

For StringView, the arrow newest version also supports this format, inline the string with size <= 12 and store them in several continuous buffers, this will allow to not allocate buffer in one time, concat vectors will also benefit from it, as I know, copy from GPU to GPU is more expensive than CPU to GPU, please correct me if I'm wrong.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions