Delta Lake Read Connector Partitioning Strategy – Ascend.io

Introduction

If you are working with Delta Lake, it's important to understand the built-in partitioning strategy of the Delta Lake read connector. This can affect how your data is organized and how efficiently you can read and process it.

Built-in Partitioning Strategy

The Delta Lake read connector uses the partitioning strategy of the underlying Delta Lake files. If the Delta Lake is already partitioned on a date-time value, that partitioning will be used. If the Delta Lake is partitioned based on some generic partitioning strategy, that strategy will be used.

Fingerprint Calculation

The fingerprint of a Delta Lake file is calculated based on its metadata, including the partitioning scheme. This fingerprint is used to determine if a file has changed since the last time it was read.

Incremental Processing

If only new partitions are added to the Delta Lake, they can be processed incrementally. The Delta Lake read connector can use the manifest file to determine which partitions have been added and read only those new partitions. This can significantly improve processing time and reduce the amount of data that needs to be read.

Full Resync

If you choose to do a full resync, the original partitioning of the Delta Lake will not be maintained. Instead, the data will collapse into a single partition on each refresh. This may be useful if you need to do a full rebuild of your data, but it can be less efficient than incremental processing.

Conclusion

Understanding the built-in partitioning strategy of the Delta Lake read connector can help you optimize your data processing workflows. By using incremental processing and maintaining the original partitioning scheme, you can reduce processing time and improve efficiency.