Avro
Input | Output | Alias |
---|---|---|
✔ | ✔ |
Description
Apache Avro is a row-oriented serialization format that uses binary encoding for efficient data processing. The Avro
format supports reading and writing Avro data files. This format expects self-describing messages with an embedded schema. If you're using Avro with a schema registry, refer to the AvroConfluent
format.
Data type mapping
The table below shows all data types supported by the Apache Avro format, and their corresponding ClickHouse data types in INSERT
and SELECT
queries.
Avro data type INSERT | ClickHouse data type | Avro data type SELECT |
---|---|---|
boolean , int , long , float , double | Int(8\16\32), UInt(8\16\32) | int |
boolean , int , long , float , double | Int64, UInt64 | long |
boolean , int , long , float , double | Float32 | float |
boolean , int , long , float , double | Float64 | double |
bytes , string , fixed , enum | String | bytes or string * |
bytes , string , fixed | FixedString(N) | fixed(N) |
enum | Enum(8\16) | enum |
array(T) | Array(T) | array(T) |
map(V, K) | Map(V, K) | map(string, K) |
union(null, T) , union(T, null) | Nullable(T) | union(null, T) |
union(T1, T2, …) ** | Variant(T1, T2, …) | union(T1, T2, …) ** |
null | Nullable(Nothing) | null |
int (date) *** | Date, Date32 | int (date) *** |
long (timestamp-millis) *** | DateTime64(3) | long (timestamp-millis) *** |
long (timestamp-micros) *** | DateTime64(6) | long (timestamp-micros) *** |
bytes (decimal) *** | DateTime64(N) | bytes (decimal) *** |
int | IPv4 | int |
fixed(16) | IPv6 | fixed(16) |
bytes (decimal) *** | Decimal(P, S) | bytes (decimal) *** |
string (uuid) *** | UUID | string (uuid) *** |
fixed(16) | Int128/UInt128 | fixed(16) |
fixed(32) | Int256/UInt256 | fixed(32) |
record | Tuple | record |
* bytes
is default, controlled by setting output_format_avro_string_column_pattern
** The Variant type implicitly accepts null
as a field value, so for example the Avro union(T1, T2, null)
will be converted to Variant(T1, T2)
.
As a result, when producing Avro from ClickHouse, we have to always include the null
type to the Avro union
type set as we don't know if any value is actually null
during the schema inference.
Unsupported Avro logical data types:
time-millis
time-micros
duration
Format settings
Setting | Description | Default |
---|---|---|
input_format_avro_allow_missing_fields | Whether to use a default value instead of throwing an error when a field is not found in the schema. | 0 |
input_format_avro_null_as_default | Whether to use a default value instead of throwing an error when inserting a null value into a non-nullable column. | 0 |
output_format_avro_codec | Compression algorithm for Avro output files. Possible values: null , deflate , snappy , zstd . | |
output_format_avro_sync_interval | Sync marker frequency in Avro files (in bytes). | 16384 |
output_format_avro_string_column_pattern | Regular expression to identify String columns for Avro string type mapping. By default, ClickHouse String columns are written as Avro bytes type. | |
output_format_avro_rows_in_file | Maximum number of rows per Avro output file. When this limit is reached, a new file is created (if the storage system supports file splitting). | 1 |
Examples
Reading Avro data
To read data from an Avro file into a ClickHouse table:
The root schema of the ingested Avro file must be of type record
.
To find the correspondence between table columns and fields of Avro schema, ClickHouse compares their names. This comparison is case-sensitive and unused fields are skipped.
Data types of ClickHouse table columns can differ from the corresponding fields of the Avro data inserted. When inserting data, ClickHouse interprets data types according to the table above and then casts the data to the corresponding column type.
While importing data, when a field is not found in the schema and setting input_format_avro_allow_missing_fields
is enabled, the default value will be used instead of throwing an error.
Writing Avro data
To write data from a ClickHouse table into an Avro file:
Column names must:
- Start with
[A-Za-z_]
- Be followed by only
[A-Za-z0-9_]
The output compression and sync interval for Avro files can be configured using the output_format_avro_codec
and output_format_avro_sync_interval
settings, respectively.
Inferring the Avro schema
Using the ClickHouse DESCRIBE
function, you can quickly view the inferred format of an Avro file like the following example.
This example includes the URL of a publicly accessible Avro file in the ClickHouse S3 public bucket: