Skip to main content
Skip to main content

Ingestion patterns

This guide covers how to write data into ClickHouse from a connector, ETL pipeline, or integration. It pairs with the consumption patterns guide and the integration development best practices index.

For language-specific insert APIs, see the client reference pages: Java/JDBC, Python, Go, JavaScript.

Choosing a write protocol

ProtocolBest for
HTTP APILanguage-agnostic connectors, streaming ingestion, maximum control over format and compression. The recommended default for new integrations.
JDBCJava connectors, BI tools, and frameworks that already depend on a JDBC DataSource.
Native client librariesGo, Python, and JavaScript connectors where idiomatic code and maximum throughput matter. The native protocol avoids HTTP overhead.
ClickHouse Cloud

ClickHouse Cloud exposes only HTTPS (port 8443) and the secure native port (9440). Design your integration to require TLS. Plaintext connections are not accepted.

Batch insert fundamentals

Every INSERT statement creates a new on-disk data part. ClickHouse merges parts asynchronously in the background. If inserts arrive faster than merges complete, the active part count in a partition exceeds the default threshold (300 parts) and ClickHouse raises:

DB::Exception: Too many parts (300). Merges are processing significantly slower than inserts.

This is the most common production failure mode for connector-driven ingest. The fix is always larger batches.

Rules:

  • Minimum 1,000 rows per INSERT; target 10,000–100,000 rows
  • No more than 1–2 INSERTs per second per table from a single process
  • Never INSERT one row at a time from an ORM loop

HTTP (curl)

curl --user "user:password" \
     "https://host:8443/?query=INSERT+INTO+events+FORMAT+JSONEachRow" \
     --data '{"user_id":"alice","event":"login","ts":"2024-01-15 12:00:00"}
{"user_id":"bob","event":"signup","ts":"2024-01-15 12:01:00"}'

JDBC (Java)

try (PreparedStatement ps = conn.prepareStatement(
        "INSERT INTO events (user_id, event, ts) VALUES (?, ?, ?)")) {
    for (Event e : batch) {
        ps.setString(1, e.userId());
        ps.setString(2, e.event());
        ps.setObject(3, e.ts());
        ps.addBatch();
    }
    ps.executeBatch();
}

Create a new PreparedStatement per batch cycle. In JDBC driver versions 0.8.6 and above, reusing a PreparedStatement across multiple executeBatch() calls re-inserts rows from all previous batches because the batch buffer is not cleared between executions.

Python

client.insert(
    "events",
    data=rows,           # list of dicts or list of tuples
    column_names=["user_id", "event", "ts"],
)

See the Python client reference for connection setup and driver options.

Go

batch, err := conn.PrepareBatch(ctx, "INSERT INTO events (user_id, event, ts)")
if err != nil {
    return err
}
for _, e := range rows {
    if err := batch.Append(e.UserID, e.Event, e.TS); err != nil {
        return err
    }
}
return batch.Send()

See the Go client reference for connection setup.

JavaScript

await client.insert({
    table: 'events',
    values: createReadStream('./events.ndjson'),
    format: 'JSONEachRow',
});

See the JavaScript client reference for streaming and connection setup.

Choosing an insert format

FormatUse when
JSONEachRowDefault for most connectors. One JSON object per line, easy to generate and stream, human-readable.
RowBinary / RowBinaryWithNamesAndTypesHigh-volume pipelines where throughput is the bottleneck. Binary row format, type-safe, no JSON serialization overhead.
NativeHighest performance. Binary columnar format. Use when using the native client library directly; not practical over generic HTTP.
CSV / TSVFile-based imports and spreadsheet tools.
Parquet / ArrowData lakehouse integrations; columnar exchange with Spark, dbt, DuckDB.

Use JSONEachRow as the default for new integrations. Switch to RowBinary when profiling shows serialization or network throughput is the bottleneck.

# JSONEachRow insert over HTTP
curl --user "user:password" \
     "https://host:8443/?query=INSERT+INTO+events+FORMAT+JSONEachRow" \
     --data-binary @events.ndjson

# RowBinaryWithNamesAndTypes — types are self-described in the header
curl --user "user:password" \
     "https://host:8443/?query=INSERT+INTO+events+FORMAT+RowBinaryWithNamesAndTypes" \
     --data-binary @events.rowbinary

Async insert

async_insert=1 tells the server to buffer incoming inserts and flush them in bulk. This is useful when the client cannot form large batches — for example, many small producers writing individual events.

Enabling via HTTP

curl --user "user:password" \
     "https://host:8443/?async_insert=1&wait_for_async_insert=1" \
     --data "INSERT INTO events FORMAT JSONEachRow {\"user_id\":\"alice\",\"event\":\"login\"}"

Enabling via JDBC

Properties props = new Properties();
props.setProperty("async_insert", "1");
props.setProperty("wait_for_async_insert", "1");
Connection conn = DriverManager.getConnection("jdbc:ch:https://host:8443/db", props);

Critical gotchas

SETTINGS placement. The SETTINGS clause must appear before VALUES in an INSERT statement:

-- Correct
INSERT INTO events SETTINGS async_insert=1 VALUES (?, ?)

-- Wrong — server does not detect async mode; parts accumulate rapidly
INSERT INTO events VALUES (?, ?) SETTINGS async_insert=1

wait_for_async_insert=0 silently drops bad data. With fire-and-forget mode, the server returns HTTP 200 immediately — before the buffer is validated or flushed. If a row has a type error, the entire buffered batch is silently dropped. No error is returned to the client.

Use wait_for_async_insert=1 in most connectors to get synchronous error feedback. Reserve wait_for_async_insert=0 for pipelines where you accept eventual delivery and are not doing per-insert validation.

Monitor for silent failures. Query system.asynchronous_insert_log to detect dropped batches:

SELECT
    event_time,
    table,
    rows,
    status,
    exception
FROM system.asynchronous_insert_log
WHERE status != 'Ok'
  AND event_time > now() - INTERVAL 1 HOUR
ORDER BY event_time DESC;

Idempotent inserts with deduplication

Set insert_deduplication_token to a stable identifier per logical batch (for example, job_id + batch_number). ClickHouse records a checksum for each insert and silently skips re-inserts with the same token within the deduplication window (default: last 100 inserts).

curl --user "user:password" \
     "https://host:8443/?insert_deduplication_token=pipeline-job-42-batch-007" \
     --data "INSERT INTO events FORMAT JSONEachRow ..."

On retry after a network error, send the same token. If the original insert succeeded, the retry is a no-op. If it did not, ClickHouse processes it normally.

Edge cases

Deduplication is per-shard. On a distributed cluster, retrying on a different shard will insert a duplicate. Pin retries to the same shard, or use an idempotent downstream merge strategy.

ALTER TABLE DELETE does not clear deduplication checksums. If you delete rows and then retry an insert with the same original content and token, ClickHouse considers it already seen and silently drops it. Assign a new token when re-inserting after a deletion.

The deduplication window is finite. Only the last 100 inserts are tracked by default (configurable via replicated_deduplication_window). Retries beyond that window are not deduplicated.

Columns with non-deterministic defaults bypass deduplication. A column with DEFAULT rand() or DEFAULT now() produces a different row checksum on each attempt — the two inserts are not considered duplicates and both are written. Use an explicit, stable value instead if deduplication matters for that column.

Writing missing JSON fields and DEFAULT expressions

When inserting JSON payloads that omit some columns, input_format_skip_unknown_fields=1 drops fields that do not exist in the table schema. However, columns absent from the JSON receive the type default (0 for integers, empty string for strings) — not the DDL DEFAULT expression.

To invoke DDL defaults like DEFAULT now() or DEFAULT generateUUIDv4(), enable both settings together:

curl --user "user:password" \
     "https://host:8443/?input_format_skip_unknown_fields=1&input_format_defaults_for_omitted_fields=1" \
     --data "INSERT INTO events FORMAT JSONEachRow {\"user_id\":\"alice\",\"event\":\"login\"}"

With input_format_defaults_for_omitted_fields=1, a column with DEFAULT generateUUIDv4() that is absent from the JSON will receive a new UUID per row. Without it, that column receives an empty string.

Schema evolution

Adding columns

ALTER TABLE ... ADD COLUMN is a metadata-only operation — no data is rewritten. New rows receive the specified default; existing rows return the DEFAULT expression value on read.

ALTER TABLE events ADD COLUMN region LowCardinality(String) DEFAULT '';

Dropping columns

Dropping a column is safe but irreversible. Before dropping, check for dependent Materialized Views and dictionaries:

SELECT database, name, query
FROM system.tables
WHERE engine = 'MaterializedView'
  AND query LIKE '%my_column%';

Adding enum values

Adding a value to an Enum column that is not in the ORDER BY key is a zero-cost metadata change. If the Enum column is part of the ORDER BY or partition key, adding values requires a full table rebuild. For value sets that may grow over time, use LowCardinality(String) instead.

Changing column types

Type changes require a background mutation that rewrites affected parts:

ALTER TABLE events MODIFY COLUMN payload String;

Mutations run asynchronously. Poll system.mutations WHERE is_done = 0 to track progress. Run type changes during low-traffic windows — they compete for I/O with normal query workload.

CDC and upsert patterns

ClickHouse has no ACID transactions and no native UPDATE statement. Use the appropriate table engine for your update pattern.

ReplacingMergeTree

Append rows with the same primary key; the engine retains the latest version during background merges. Use this for CDC pipelines where the source emits full row snapshots.

CREATE TABLE users (
    user_id  UInt64,
    name     String,
    email    String,
    version  UInt64
) ENGINE = ReplacingMergeTree(version)
ORDER BY user_id;

Insert new and updated rows as normal appends:

INSERT INTO users (user_id, name, email, version) VALUES
    (1, 'Alice', 'alice@example.com', 1704067200),
    (1, 'Alice', 'alice@new.example.com', 1704153600);

Merges happen asynchronously. To read deduplicated results immediately, add FINAL:

SELECT * FROM users FINAL WHERE user_id = 1;

FINAL forces deduplication at query time and has a performance cost. For high-read workloads, prefer scheduling periodic OPTIMIZE TABLE users FINAL during off-peak hours instead.

CollapsingMergeTree

For CDC systems that emit discrete change events, CollapsingMergeTree uses a sign column: 1 for inserts and -1 for deletes. The engine cancels paired rows during merge.

CREATE TABLE user_events (
    user_id  UInt64,
    state    String,
    sign     Int8
) ENGINE = CollapsingMergeTree(sign)
ORDER BY user_id;

-- Insert
INSERT INTO user_events VALUES (1, 'active', 1);

-- Delete (cancel previous state)
INSERT INTO user_events VALUES (1, 'active', -1);

-- Insert new state
INSERT INTO user_events VALUES (1, 'churned', 1);

Avoid ALTER TABLE DELETE for high-frequency changes

ALTER TABLE ... DELETE is an asynchronous mutation. It returns success immediately but rewrites affected parts in the background. A SELECT immediately after a DELETE may still return the deleted rows. Do not use it for per-row or per-event deletes — use ReplacingMergeTree or CollapsingMergeTree instead.

Materialized views and row counts

When a Materialized View exists on a source table, the rows_affected count returned by an INSERT includes rows written to both the source table and all MV target tables. Inserting 1,000 rows returns 2,000 rows affected with one MV attached.

Do not use server-reported row counts for ETL consistency validation. Use a post-insert count query instead:

SELECT count()
FROM events
WHERE ts >= {batch_start:DateTime64(3)}
  AND ts < {batch_end:DateTime64(3)};