Introduction
When working with PySpark transforms in Ascend, it is important to understand how Ascend handles partitioning and how it can affect downstream transforms. In some cases, you may encounter unexpected partition counts or empty partitions, which can cause confusion and even incidents. This article will discuss one specific scenario where unexpected partition counts occur and provide a solution to handle this issue.
Scenario
Suppose you have an upstream transform, transform1, which contains many records and is set as a Full Reduction partition strategy. The transform has a date/timestamp column with four distinct dates: 9/1/2022, 9/2/2022, 9/6/2022, and 9/7/2022. Downstream from transform1 is another transform, transform2, which is repartitioning by the date/timestamp column using a granularity of “Day”. However, upon running transform2, you notice that there are seven partitions generated, instead of the expected four, with three of them being empty/0 record Ascend partitions.
Solution
This behavior is expected and by design. When the upstream transform is a Full Reduction and the date column contains a range of dates with some missing values, Ascend will interpolate the missing dates and generate empty partitions for those missing dates. This can result in unexpected empty partitions and partition counts, which may need to be handled in user code.
However, when the upstream data in the previous component is Ascend-partitioned, and each partition, for the timestamp column, contains only one value, this issue does not occur.
It is important to note that this problem can become much more severe if the range of values is very wide. For example, Ascend may generate an extremely large number of empty partitions if the date/timestamp range were by the hour for 10 years or something similar. If you encounter this situation and it causes a problem, consider filing an incident or contacting @reactive in #internal-product-help.
Conclusion
In summary, when working with PySpark transforms in Ascend, it is crucial to understand how partitioning works and how it can affect downstream transforms. In the scenario discussed above, the unexpected partition counts and empty partitions were caused by Ascend interpolating missing dates. To handle this issue, user code may need to be updated to handle empty partitions. As always, if you encounter any issues or have any questions, do not hesitate to reach out to Ascend support for assistance.
Comments
0 comments
Please sign in to leave a comment.