In Ascend, we offer flexibility when handling data partitioning for optimal data reads. Users have the option to manually specify parameters such as minimum (min), maximum (max), and the number of parts for data partitioning. But what happens if some or all of these aren't specified? This article will guide you through our dynamic data partitioning process and how this interacts with Spark.
Manual Specification of Min, Max, and Number of Parts
Users can manually define the 'min', 'max', and the number of parts. We recommend not specifying 'min' and 'max' unless you have very specific reasons to do so. If you choose not to set these parameters, Ascend.io queries the table for 'min' and 'max' values and uses those for data partitioning.
In the absence of a specified number of parts, Ascend.io will count the total number of records and divide by 100,000 to determine the optimal partition count. However, this doesn't guarantee that all reads will be exactly 100,000 records each. We're simply dividing the number spacing into chunks of even values. If the data isn't evenly distributed, there may still be hotspots.
Interaction with Spark
Once these settings are determined, they are passed to Spark. The documentation for the version we're currently using (3.2) doesn't go into much detail on the specifics, but essentially, these settings serve as hints to Spark on how to break the data into smaller chunks for more efficient reads.
The queries sent to the database are likely of the form
SELECT ... WHERE [partition_column] > x AND ... < y. In some cases, there may be paging, where the query form is
SELECT ... WHERE
[partition_column] > x ORDER BY [partition_column] LIMIT z.
Remember that if you decide to specify these options, all of them must be set, including
numPartitions. They are used to partition the table when reading in parallel from multiple workers.
partitionColumn must be a numeric, date, or timestamp column from the table in question.
upperBound are just used to decide the partition stride, not for filtering the rows in the table. Therefore, all rows in the table will be partitioned and returned. This option applies only to reading.
Resync and Data Partitioning
It's important to clarify that resync still only generates a single partition. It only affects how Spark reads in the data, not the partitioning itself.
Understanding how Ascend handles data partitioning can help you optimize your data operations. Whether you're manually specifying your partitioning parameters or leaving it to Ascend.io's dynamic partitioning, rest assured that our system is designed to intelligently handle the partitioning process to ensure efficient and optimal data reads. Regardless of your chosen approach, you'll benefit from our robust system's ability to balance performance and workload effectively.