Integration Development Best Practices
This section is for developers building integrations on top of ClickHouse — BI tool connectors, ETL/ELT pipelines, data catalog adapters, IDE plugins, or any software that connects to ClickHouse on behalf of end users.
If you are a developer writing an application that happens to use ClickHouse directly, the language client documentation is a better starting point.
Start here based on what you are building:
| I am building... | Start with |
|---|---|
| A data pipeline or ETL connector that writes to ClickHouse | Ingestion patterns |
| A BI tool, query builder, or data catalog that reads from ClickHouse | Consumption patterns |
| A Java/JDBC connector | JDBC connector guide |
| A Python integration | Python connector guide |
| A Go integration | Go connector guide |
| A JavaScript/TypeScript integration | JavaScript connector guide |
The rest of this page covers protocol selection, authentication, and a full gotcha reference and checklist that apply regardless of language.
Choosing a connectivity protocol
ClickHouse exposes several interfaces. Choose the one that best fits your integration's language and architecture:
| Protocol | Port | Best for |
|---|---|---|
| HTTP API | 8123 (HTTP), 8443 (HTTPS) | Custom connectors in any language, REST-oriented tools, maximum control over format and compression |
| JDBC | 8123 / 8443 | Java-based integrations, BI tools and frameworks that expect a standard JDBC DataSource |
| ODBC | 8123 / 8443 | Windows-native tools, Excel, legacy enterprise software |
| Native TCP | 9000 / 9440 | Internal tooling where maximum throughput is required; not recommended for external integrations as the protocol is not versioned for stability |
| MySQL wire protocol | 9004 | Tools with built-in MySQL support and no ClickHouse driver available |
| PostgreSQL wire protocol | 9005 | Tools with built-in PostgreSQL support and no ClickHouse driver available |
HTTP API is the recommended foundation for new integrations. It is stable, format-agnostic, works from any language, and exposes the full feature set of ClickHouse. JDBC is the best choice when your integration targets the Java ecosystem or needs to interoperate with JDBC-aware frameworks.
ClickHouse Cloud exposes only the HTTPS port (8443) and the secure native port (9440). Plaintext HTTP connections are not accepted. Design your integration to require TLS from the start.
Authentication
ClickHouse uses username and password authentication. Pass credentials either as HTTP Basic Auth headers or as URL parameters.
HTTP API
JDBC
Never embed credentials in connection URLs that may appear in logs or error messages. Read them from environment variables or a secrets manager.
Dedicated service accounts
Create a dedicated ClickHouse user for your integration with only the permissions it needs. Avoid connecting as default:
SSL/TLS
Always use TLS for connections to ClickHouse Cloud and strongly recommended for self-managed production clusters. For HTTPS connections, sslmode=strict (the default) verifies the server certificate. Use sslmode=none only in isolated development environments — never in production or user-facing integrations.
Schema discovery
ClickHouse exposes rich metadata through system.* tables. Use these to enumerate databases, tables, columns, and other objects for features like schema browsers, column pickers, and query editors.
Listing databases
Listing tables
Listing columns
INFORMATION_SCHEMA
ClickHouse also implements INFORMATION_SCHEMA for compatibility with tools that use standard SQL introspection queries. It covers a subset of metadata and is suitable when portability across databases matters:
Prefer system.columns over INFORMATION_SCHEMA.columns for ClickHouse-specific metadata like is_in_sorting_key and is_in_primary_key, which are important for generating efficient queries. Also note that INFORMATION_SCHEMA.TABLES.TABLE_ROWS is always NULL in ClickHouse — use system.tables.total_rows instead for row count estimates. INFORMATION_SCHEMA also has no foreign key data (KEY_COLUMN_USAGE and REFERENTIAL_CONSTRAINTS are always empty — ClickHouse has no foreign key constraints).
Data type mapping
ClickHouse has a richer type system than most databases your integration will encounter. The following sections cover the types that require special handling.
Numeric types
ClickHouse signed integers map predictably. Unsigned integers require care — they exceed the range of their same-width signed Java/SQL counterparts and must be promoted:
| ClickHouse Type | Recommended Java/SQL mapping |
|---|---|
| Int8, Int16, Int32, Int64 | Byte, Short, Integer, Long |
| Int128, Int256 | BigInteger |
| UInt8 | Short |
| UInt16 | Integer |
| UInt32 | Long |
| UInt64 | BigInteger |
| UInt128, UInt256 | BigInteger |
| Float32, Float64 | Float, Double |
| Decimal32/64/128/256 | BigDecimal |
String types
String in ClickHouse is a raw byte sequence with no enforced encoding (UTF-8 is conventional). FixedString(N) is zero-padded to N bytes when read — strip trailing null bytes (\0) before displaying values to users.
Date and time types
| ClickHouse Type | Notes |
|---|---|
Date | Days since 1970-01-01. No timezone. Map to LocalDate. |
Date32 | Extended date range. No timezone. Map to LocalDate. |
DateTime | Unix timestamp in seconds, stored with optional server/session timezone. Map to Instant or ZonedDateTime. |
DateTime64(n) | Sub-second precision (n = 0–9). Same timezone behavior as DateTime. |
DateTime and DateTime64 values are stored as UTC internally. String representation depends on the server timezone and any timezone defined on the column. When the column has no explicit timezone, the server timezone is used for both reading and writing string values.
session_timezone is an experimental setting that can invalidate partition pruning on DateTime columns. Do not use it in production. Convert timestamps to UTC in your application layer instead, or pass them as Unix epoch values using fromUnixTimestamp64Nano() to avoid any ambiguity.
Type modifiers
Two modifiers wrap other types and must be handled by your integration:
Nullable(T)— the column may contain NULL. Affects the type name returned in metadata (e.g.,Nullable(Int32)). Strip the wrapper when mapping to target types.LowCardinality(T)— a dictionary-encoded form of T, used for performance. Treat identically to the underlying type for all practical purposes.
Complex types
| ClickHouse Type | Behavior |
|---|---|
Array(T) | Nested arrays supported. Via JDBC, returned as java.sql.Array. Via HTTP, returned as JSON arrays. |
Map(K, V) | Key-value pairs. Via HTTP, returned as a JSON object. |
Tuple(T1, T2, ...) | Fixed-length heterogeneous sequence. Via HTTP, returned as a JSON array. |
Enum8, Enum16 | Returned as strings by default. Can be read as their underlying integer. |
UUID | Returned as a string in xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx format over HTTP. |
IPv4, IPv6 | Returned as dotted-decimal / colon-hex strings over HTTP. |
JSON | Returned as a JSON string. Supported in newer ClickHouse versions. |
AggregateFunction columns hold binary-encoded aggregate state. They are not human-readable and should generally be excluded from column listings exposed to end users.
Query patterns
Use parameterized queries
Never concatenate user input into SQL strings. ClickHouse supports named query parameters that prevent SQL injection and improve query plan reuse.
HTTP API:
Parameters are declared as {name:type} in the query and passed as param_name HTTP parameters.
JDBC:
Assign a query_id to every query
Set a deterministic query_id on each request. This lets you trace queries in system.query_log, cancel runaway queries, and implement idempotent retry logic.
HTTP API:
JDBC:
If you retry a query after a timeout, reuse the same query_id. ClickHouse will return the result of the already-running query rather than executing it twice.
Choose the right output format
ClickHouse supports dozens of output formats. Choose based on what your integration needs:
| Format | Use when |
|---|---|
JSONEachRow | Streaming row-by-row JSON; easy to parse incrementally |
JSONCompact | Compact JSON with column names in a header; smaller than full JSON |
CSV / TSV | Interoperability with spreadsheets and generic tools |
Parquet | Columnar data exchange with data lakehouse tools |
RowBinary / Native | Maximum throughput; binary format requiring type-aware deserialization |
Arrow | In-memory columnar exchange with Apache Arrow-compatible tools |
For most connector use cases, JSONEachRow is the best default: it is easy to stream and parse without buffering the full response.
Enable compression
LZ4 compression over HTTP significantly reduces network transfer for large result sets and inserts, at negligible CPU cost. Enable it by sending the Accept-Encoding header:
Paginate large result sets
ClickHouse does not have native cursor-based pagination. Use LIMIT / OFFSET for small datasets, or keyset pagination for large ones:
For very large exports, prefer streaming: consume JSONEachRow output incrementally rather than paginating.
Identifiers are case-sensitive
ClickHouse database names, table names, column names, and most function names are case-sensitive. A column defined as userId is different from userid. BI tools and ORMs that auto-capitalize identifiers will get Unknown column or Unknown function errors. Establish a lowercase or snake_case naming convention for all ClickHouse objects in integrations you control, and document this expectation for users creating tables.
64-bit integers in JSON output
If your integration consumes ClickHouse HTTP responses in JavaScript or TypeScript, note that Int64 and UInt64 values in JSON output are sent as numbers by default. JavaScript's JSON.parse() silently loses precision for integers beyond 2^53. Add output_format_json_quote_64bit_integers=1 to query parameters to receive them as quoted strings, then parse with a BigInt-aware library:
Set query timeouts
Protect your integration from runaway queries. Pass max_execution_time (seconds) as a query setting:
Writing data
Use batch inserts
Always batch rows into a single INSERT rather than sending one row per request. The HTTP API accepts INSERT data as the request body:
Aim for batches of 10,000–100,000 rows per request for optimal throughput. Smaller batches are the single most common production failure mode for connector-driven ingest: every INSERT creates a new on-disk data part, and ClickHouse merges parts asynchronously. If inserts arrive faster than merges complete, the part count crosses a threshold (default 300 active parts per partition) and ClickHouse raises Too many parts errors. Larger batches increase memory pressure. Never issue INSERT more than 1–2 times per second per table.
JDBC batch inserts
Async insert behavior
async_insert=1 buffers inserts server-side and flushes in bulk. Two important behaviors to understand:
- With
wait_for_async_insert=0(fire-and-forget), the server returns HTTP 200 immediately — before the data is validated or written. If a row has a type error, the entire buffered batch is silently dropped with no error surfaced to the client. Usewait_for_async_insert=1in latency-tolerant pipelines to get synchronous error feedback. - The
SETTINGSclause must appear beforeVALUESin an INSERT statement.INSERT INTO t SETTINGS async_insert=1 VALUES (?,?)works;INSERT INTO t VALUES (?,?) SETTINGS async_insert=1does not — the server fails to detect async-insert mode and parts accumulate rapidly.
JDBC PreparedStatement batches
In JDBC driver versions 0.8.6 and above, reusing a PreparedStatement across multiple executeBatch() calls re-inserts data from all previous batches due to a batch buffer not being cleared between executions. Create a new PreparedStatement instance for each batch cycle:
Materialized views inflate row counts
After creating a Materialized View on a source table, the rows_affected count returned by an INSERT includes rows written to both the source table and all MV target tables. An insert of 1,000 rows returns 2,000 rows affected with one MV. Row-count-based validation in ETL pipelines will fail. Use SELECT count() FROM source_table WHERE ... post-insert for validation instead.
Mutations are asynchronous
ALTER TABLE ... DELETE and ALTER TABLE ... UPDATE are asynchronous mutations. They return success immediately but run in the background as data parts are rewritten. A SELECT immediately after a DELETE may still return the deleted rows. For upsert patterns, use ReplacingMergeTree or CollapsingMergeTree instead of mutations. Poll system.mutations WHERE is_done = 0 if you need to wait for completion.
ALTER TABLE DELETE also does not clear deduplication checksums. If you delete rows and retry the same insert with the same content, ClickHouse will silently deduplicate and re-insert nothing. Use a new insert_deduplication_token value when re-inserting after deletions.
ClickHouse has no ACID transactions
Do not rely on BEGIN / COMMIT / ROLLBACK for data consistency. ClickHouse is append-optimized. Design writes to be idempotent:
- Use
insert_deduplication_tokento make retries safe — ClickHouse will deduplicate inserts with the same token within a configurable window:
- Use
ReplacingMergeTreefor upsert semantics — the engine merges duplicate primary keys, keeping the row with the highest version value.
JSON inserts and DEFAULT expressions
When inserting JSON with input_format_skip_unknown_fields=1, columns absent from the payload receive the type default (0, empty string) — not the column's DDL DEFAULT expression. If your schema uses DEFAULT now() or DEFAULT generateUUIDv4(), those expressions are only invoked when you also set input_format_defaults_for_omitted_fields=1:
Identifying your integration
Always identify your integration in the HTTP User-Agent header and in per-query metadata. This makes queries attributable in system.query_log, which is invaluable for debugging customer issues and monitoring usage.
Set a User-Agent header
Use the format product/version (context):
Via JDBC:
Tag individual queries with log_comment
Attach operation context to each query so customer support and the customer themselves can filter system.query_log by feature or job:
Query your tagged queries in system.query_log
Error handling
HTTP 200 does not mean success
ClickHouse begins streaming results immediately and sends the 200 OK status header before it knows whether the query will succeed. If an error occurs mid-stream, the error is appended to the response body — but the HTTP status remains 200.
This also affects inserts: a large POST that triggers a server-side timeout returns HTTP 200 OK with X-ClickHouse-Exception-Code: 209 in the headers and no data committed.
Your integration must:
- Check the
X-ClickHouse-Exception-Coderesponse header on every response, not just non-200s - Scan the response body for
Code: NNN. DB::Exception:if you are processing it as a stream - For DDL and short queries where latency is acceptable, add
wait_end_of_query=1to buffer the response server-side and get a reliable error status
HTTP status codes
| Status | Meaning |
|---|---|
| 200 | Query reached the server — check body and headers for errors |
| 400 | Bad request (malformed query, invalid parameter) |
| 401 | Authentication failed |
| 403 | Permission denied |
| 404 | Database or table not found |
| 500 | Server-side error (query execution failure, OOM, etc.) |
ClickHouse returns error details in the response body and as X-ClickHouse-Exception-Code and X-ClickHouse-Summary HTTP headers. Error format:
The numeric code (60) is stable across releases and suitable for programmatic handling.
Retry strategy
Retry on:
- Network-level errors (connection refused, timeout)
- HTTP 500 where the exception code indicates a transient condition (e.g., server overload)
Do not retry on:
- HTTP 400 (bad query — retrying won't help)
- HTTP 403 (wrong permissions — retrying won't help)
Use exponential backoff with jitter. Reuse the same query_id on retries for INSERT operations so ClickHouse can deduplicate.
Handling streaming errors
When using streaming output formats (e.g., JSONEachRow), ClickHouse may have already started writing rows before encountering an error. Errors in this case are appended at the end of the response stream rather than returned as an HTTP 500. Always read the full stream and check for a trailing error block:
ClickHouse Cloud considerations
Connection endpoints
ClickHouse Cloud service endpoints follow this pattern:
Where cloud is aws, gcp, or azure. Always use port 8443 (HTTPS) — port 8123 is not available on Cloud.
Auto-pause and connection retries
ClickHouse Cloud services on the development tier may auto-pause after a period of inactivity. An initial connection after a pause may take a few seconds to respond. Design your integration to retry connection attempts with a short backoff before surfacing an error to the user.
ClickHouse Cloud API for programmatic management
If your integration needs to enumerate or provision Cloud services (rather than query data), use the ClickHouse Cloud API. It is separate from the query interface and uses API key authentication.
Testing your integration
Test against both OSS and ClickHouse Cloud
Behavioral differences between self-managed ClickHouse and ClickHouse Cloud are minimal for most integrations, but test both. Specifically:
- ClickHouse Cloud always requires TLS
- Some system table columns may differ between versions
- Auto-pause behavior is Cloud-only
Cover edge-case data types
Most integration bugs are found with edge cases. Explicitly test:
Nullablecolumns — verify NULLs round-trip correctlyUInt64values nearLong.MAX_VALUEFixedString— verify zero-byte strippingDateTime64with sub-second precision and non-UTC timezones- Arrays and Maps — verify nested types survive serialization
- Empty result sets and single-row result sets
Use system.query_log to verify behavior
system.query_log is written asynchronously with a flush interval of approximately 7.5 seconds. Do not query it immediately after a test — add a small delay or poll with retry before expecting entries to appear.
After running your integration's test suite, inspect system.query_log to verify:
- Queries are attributed to your integration's
User-Agent - No unexpected full-table scans (check
read_rowsand whether the primary key is used) - Insert queries have the expected
written_rows
Implementation examples
HTTP API — Python connector skeleton
A minimal pattern for a Python-based connector that queries ClickHouse and streams results:
JDBC — Java BI connector skeleton
A minimal pattern for a JDBC-based BI connector with connection pooling and query tagging:
Remote MCP integrations
Model Context Protocol (MCP) is an open standard that lets AI agents and LLM-powered tools discover and invoke capabilities exposed by a server. For ClickHouse integrations, MCP is the right surface when your integration is AI-native — meaning an agent or LLM is driving the queries — rather than a traditional BI tool, ETL pipeline, or application making programmatic requests.
| Integration type | Recommended surface |
|---|---|
| BI tool / query builder | JDBC or HTTP API |
| ETL / data pipeline | HTTP API or JDBC |
| AI agent / LLM assistant | Remote MCP server |
| IDE coding assistant | Local MCP server (stdio) or remote MCP |
ClickHouse Cloud built-in remote MCP server
ClickHouse Cloud includes a fully managed remote MCP server. It requires no infrastructure to deploy and authenticates via OAuth 2.0.
Endpoint:
Transport: Streamable HTTP (MCP standard)
Authentication: OAuth 2.0 — the MCP client initiates a browser-based OAuth flow using ClickHouse Cloud credentials on first connect.
Capabilities: 13 read-only tools across querying, schema discovery, service management, backups, ClickPipes, and billing. All tools are annotated with readOnlyHint: true.
Enable it per service in the ClickHouse Cloud console under Connect → MCP. Once enabled, point any MCP client at the endpoint above. See the remote MCP setup guide for IDE-specific configuration steps.
If your integration targets ClickHouse Cloud users, this is the fastest path to MCP connectivity — recommend it over building your own server.
Open-source MCP server for self-hosted ClickHouse
For integrations that must support self-hosted ClickHouse instances, use the mcp-clickhouse open-source server as a starting point. It exposes three core tools (run_query, list_databases, list_tables) and supports both local stdio and remote HTTP/SSE transports.
Building your own remote MCP server
Build a custom MCP server when you need tools or behavior that neither the Cloud server nor the open-source server provides — for example, domain-specific query templates, write operations under controlled conditions, or integration with your own auth system.
Choose the right transport
| Transport | Use when |
|---|---|
| stdio | Local dev tools (Claude Code, Cursor, VS Code Copilot). No network exposure. |
| Streamable HTTP | Cloud-deployed servers, multi-tenant services, load-balanced deployments. The MCP standard transport for remote servers. |
| SSE (legacy) | Legacy MCP clients that predate Streamable HTTP. Prefer HTTP for new servers. |
For remote servers, Streamable HTTP is the current standard. Design stateless request handlers — do not store session state in memory, as load balancers will distribute requests across instances.
Authentication
| Scenario | Recommended approach |
|---|---|
| User-facing (human authenticates) | OAuth 2.0 with PKCE — consistent with ClickHouse Cloud's own approach |
| Service-to-service (agent authenticates) | Static Bearer token via Authorization header; rotate regularly |
| Development / local-only | Disable auth (CLICKHOUSE_MCP_AUTH_DISABLED=true); never in production |
Always require authentication for any remotely accessible MCP server. The /health endpoint is the only route that should remain unauthenticated (for orchestrator probes).
Design read-only tools by default
Annotate every tool with readOnlyHint: true in its MCP metadata unless writes are explicitly part of your design. This signals to MCP hosts that the tool has no side effects, enabling better agent planning.
readOnlyHint is advisory only — it informs the LLM host but is not enforced. Apply server-side enforcement as well:
- Connect to ClickHouse with a read-only user (no INSERT, ALTER, DROP grants)
- Validate that query strings begin with
SELECTorWITHbefore execution - Use ClickHouse's
readonlysetting (SET readonly = 1) at the session level as a secondary guard:
If your server must support writes (e.g., agent-driven INSERT), expose write tools under a separate, explicitly named operation, require an additional confirmation parameter, and log every invocation.
Apply resource limits to all queries
Agents can generate unbounded queries. Always enforce limits to prevent runaway execution:
Or pass them per query via HTTP parameters:
10,000 rows is a reasonable ceiling for agent-readable results — LLMs cannot usefully process millions of rows. Design tool descriptions to guide agents toward aggregating queries rather than full table scans.
Write tool descriptions that constrain agent behavior
Tool descriptions are read by the LLM to decide how and when to invoke a tool. Vague descriptions lead to over-use and inefficient queries. Be specific:
Include parameter descriptions that tell the agent what valid input looks like, including ClickHouse SQL syntax specifics (e.g., count() not COUNT(*), toDate() for date literals).
Guard against prompt injection
Query results returned by your MCP server flow back into the LLM's context. If a ClickHouse table contains user-generated text, that text could carry adversarial instructions targeting the agent.
Mitigations:
- Limit result size — small result sets reduce the attack surface
- Return structured data — prefer
JSONEachRowand parse it server-side; avoid returning raw string columns that could contain markdown or instruction-like text directly into the agent's context - Sanitize schema names — when returning database, table, or column names, strip or escape characters that could be interpreted as markdown formatting or instructions
Identify your MCP server in query logs
Set a User-Agent and log_comment on all queries issued by your MCP server, exactly as you would for any integration. This makes it possible to distinguish agent-driven queries from human queries in system.query_log:
Integration development checklist
- Integration uses HTTPS / TLS for all connections
- Credentials are read from environment variables or a secrets manager, not hardcoded
- A dedicated ClickHouse user with minimal permissions is used
-
User-Agentheader identifies the integration by name and version - Every query carries a
query_idfor traceability and idempotent retries - User-facing inputs pass through parameterized queries, not string concatenation
- Schema discovery queries exclude
systemandinformation_schemadatabases -
NullableandLowCardinalitytype wrappers are stripped before type mapping -
UInt64and larger unsigned integers are mapped toBigIntegeror equivalent -
FixedStringvalues are stripped of trailing null bytes before display - Bulk inserts use batches of 10,000–100,000 rows
-
insert_deduplication_tokenis set for retry-safe inserts -
X-ClickHouse-Exception-Codeheader is checked on every response (HTTP 200 does not guarantee success) - Streaming responses are fully consumed and checked for trailing error blocks
- Integration does not rely on JDBC transactions for consistency
- Mutations (DELETE/UPDATE) are not used for high-frequency updates; ReplacingMergeTree used for upsert patterns instead
- If using JavaScript/TypeScript:
output_format_json_quote_64bit_integers=1set to prevent Int64 precision loss - Table and column names use lowercase/snake_case (identifiers are case-sensitive)
- JDBC: new
PreparedStatementcreated per batch (reuse causes duplicate inserts in driver 0.8.6+) - Integration tested against both ClickHouse OSS and ClickHouse Cloud
- Edge-case types (Nullable, UInt64, DateTime64, Array, Map) covered in tests
- If building an MCP server:
- Remote server uses Streamable HTTP transport and stateless request handlers
- All tools annotated with
readOnlyHint: trueunless writes are explicitly required - Server-side enforcement: read-only ClickHouse user + SQL validation, not just
readOnlyHint -
max_execution_time,max_result_rows, andmax_bytes_to_readenforced on the ClickHouse user or per-query - Tool descriptions guide agents toward aggregating queries; warn against full table scans
-
query_idandlog_commentset on every MCP-issued query for observability - Prompt injection mitigations in place for result data flowing back to the LLM