[FEA] Support bool based on bit, Timestamp more range and StringView

In Velox and Cudf integration, the bool type conversion is expensive, because the source Velox bool type is one bit for one value, but in cudf, it is one byte for one value, it is better to have a bool type align with Velox, then we only needs to copy a big buffer(data buffer and null buffer) instead of setting the value one by one, so as Timestamp.

With this approach, we can do the conversion in any time considering the memory/runtime statistic/operator fallback. E.g. The final round join is under memory pressure we cannot process in GPU without spill, use multiple threads CPU to do the following join instead of spilling to disk maybe a good trade off.


System goal | Timestamp type | Example systems / libraries | Notes
-- | -- | -- | --
Maximum time range | int128 | Some database engines, experimental extensions in Apache Arrow, specialized scientific time libraries | Used when timelines may span thousands or millions of years, or when avoiding overflow is critical
High-performance analytics | int64 | RAPIDS cuDF, Pandas, Apache Arrow | Fast arithmetic, smaller memory footprint, and native support on CPUs/GPUs; covers ~1677–2262 for nanoseconds

For StringView, the arrow newest version also supports this format, inline the string with size <= 12 and store them in several continuous buffers, this will allow to not allocate buffer in one time, concat vectors will also benefit from it, as I know, copy from GPU to GPU is more expensive than CPU to GPU, please correct me if I'm wrong.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Support bool based on bit, Timestamp more range and StringView #21782

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

System goal	Timestamp type	Example systems / libraries	Notes
Maximum time range	int128	Some database engines, experimental extensions in Apache Arrow, specialized scientific time libraries	Used when timelines may span thousands or millions of years, or when avoiding overflow is critical
High-performance analytics	int64	RAPIDS cuDF, Pandas, Apache Arrow	Fast arithmetic, smaller memory footprint, and native support on CPUs/GPUs; covers ~1677–2262 for nanoseconds

[FEA] Support bool based on bit, Timestamp more range and StringView #21782

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions