-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Describe the bug
I have a query predicate left = right where left is a column with type Utf8View and right is a column with type Dictionary(UInt8, LargeUtf8). The SQL planner returns Err for this query because it cannot find a coercion path for these two types. However, a coercion pathway exists: modifying the query to use left = arrow_cast(right, 'LargeUtf8') allows it to succeed.
To Reproduce
I am unable to attach a .rs file for some reason, so below is the contents of my reproducer which can very simply be compiled with datafusion as the only dependency.
#[cfg(test)]
mod tests {
use std::sync::Arc;
use datafusion::catalog::MemTable;
use datafusion::common::Result as DataFusionResult;
use datafusion::common::arrow::array::{
DictionaryArray, Int64Array, LargeStringArray, StringViewArray, UInt8Array, UInt64Array,
};
use datafusion::common::arrow::datatypes::{DataType, Field, Schema};
use datafusion::common::arrow::record_batch::RecordBatch;
use datafusion::prelude::SessionContext;
const STATES: [&str; 50] = [
"AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "FL", "GA", "HI", "ID", "IL", "IN", "IA",
"KS", "KY", "LA", "ME", "MD", "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ",
"NM", "MY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT", "VT",
"VA", "WA", "WV", "WI", "WY",
];
async fn create_table_with_text() -> DataFusionResult<MemTable> {
let states_schema = Arc::new(Schema::new(vec![
Field::new("rank", DataType::UInt64, false),
Field::new("state", DataType::Utf8View, false),
]));
let (ranks, states) = STATES
.iter()
.enumerate()
.collect::<(Vec<usize>, Vec<&str>)>();
let states_batch = RecordBatch::try_new(
Arc::clone(&states_schema),
vec![
Arc::new(ranks.iter().map(|r| *r as u64).collect::<UInt64Array>()),
Arc::new(states.iter().map(Some).collect::<StringViewArray>()),
],
)?;
MemTable::try_new(states_schema, vec![vec![states_batch]])
}
async fn create_table_with_dictionary() -> DataFusionResult<MemTable> {
let gub_schema = Arc::new(Schema::new(vec![
Field::new("row_id", DataType::UInt64, false),
Field::new(
"state",
DataType::Dictionary(DataType::UInt8.into(), DataType::LargeUtf8.into()),
true,
),
]));
let (ids, state_keys) = (1..=1000)
.map(|i| (i as u64, ((i + 10 * i) % STATES.len()) as u8))
.collect::<(Vec<u64>, Vec<u8>)>();
let state_values = STATES.iter().map(Some).collect::<LargeStringArray>();
let batch = RecordBatch::try_new(
Arc::clone(&gub_schema),
vec![
Arc::new(UInt64Array::from(ids)),
Arc::new(DictionaryArray::new(
UInt8Array::from(state_keys),
Arc::new(state_values),
)),
],
)?;
MemTable::try_new(gub_schema, vec![vec![batch]])
}
/// Test evaluating equality of the `states.states` string column against the `gub.states` dictionary column.
///
/// We run the same query twice. First we provide the coercion pathway ourselves using the
/// `arrow_cast` function to demonstrate that there is a straightforward coercion.
/// Then we evaluate the query without the explicit coercion to demonstrate that the SQL
/// planner does not find this coercion.
#[tokio::test]
async fn join() -> DataFusionResult<()> {
let df: SessionContext = Default::default();
df.register_table("states", Arc::new(create_table_with_text().await?))?;
df.register_table("gub", Arc::new(create_table_with_dictionary().await?))?;
let expect_batch = vec![RecordBatch::try_new(
Arc::new(Schema::new(vec![Field::new(
"count(*)",
DataType::Int64,
false,
)])),
vec![Arc::new(Int64Array::from(vec![1000]))],
)?];
let query_cast = format!(
"SELECT count(*) FROM gub JOIN states ON states.rank = ((gub.row_id + 10 * gub.row_id) % {}) WHERE states.state = arrow_cast(gub.state, 'Utf8')",
STATES.len()
);
let frame_cast = df.sql(&query_cast).await.expect("query_cast error");
let batch_cast = frame_cast
.collect()
.await
.expect("query_cast evaluation error");
assert_eq!(batch_cast, expect_batch);
let query_nocast = format!(
"SELECT count(*) FROM gub JOIN states ON states.rank = ((gub.row_id + 10 * gub.row_id) % {}) WHERE states.state = gub.state",
STATES.len()
);
let frame_nocast = df.sql(&query_nocast).await.expect("query_nocast error");
let batch_nocast = frame_nocast
.collect()
.await
.expect("query_nocast evaluation error");
assert_eq!(batch_nocast, expect_batch);
Ok(())
}
}
Expected behavior
The SQL planner should find and deploy the coercion pathway on its own - the author of the query should not have to deploy the arrow_cast function.
Additional context
Arguably this is a feature request but I am filing it as a bug because I have experienced it as such.
The reproducer above is adapted from a test I have written for my application which creates the same data but using SQL rather than the MemTable. That is, we have CREATE TABLE states (rank INT, state TEXT) and then the other table uses a custom table provider which results in the dictionary type.
In DataFusion 44, the SQL created the state column with Arrow data type Utf8 or LargeUtf8.
In DataFusion 50, the SQL created the state column with Arrow data type Utf8View.
This has broken my test because there previously was a coercion pathway for comparing the state column against my Dictionary type column, but now there is no such coercion pathway.
This looks related to #12511.