If you're working with a large amount of data in a blob storage, it can be time-consuming and resource-intensive to pull everything. Fortunately, there are several ways to filter and select only the data you need, based on the date or timestamp information embedded in the file names or prefixes.
Here are some strategies you can use to optimize your blob read connector and only ingest the latest data:
Filter Data by Date/Time Information in File Names or Prefixes
One workaround to get only the latest data in a blob read connector is to filter out older data based on the date or timestamp information embedded in the file names or prefixes. You can use object pattern matching techniques like glob, regex, or match to identify only the partitions that you need, and filter out the ones you don't. For example, you could use regex to filter only the directories for years 2015 through any future year.
Build a Custom Python Read Connector
If you want more control over how your data is filtered and ingested, you can build a custom Python read connector that has logic to filter prefixes dynamically based on date or time information. For instance, you can create a connector that pulls only files created today or in the last X hours. If you need to keep track of new files after a certain date, you can write code to save this information somewhere.
Structure Your Data in Your Bucket Differently
Another strategy is to structure your data in your bucket differently by having a separate prefix or folder for archive or cold storage log files, and another prefix for recent logs. You can instruct your blob read connector to only index the recent logs prefix, and ignore the archive prefix.
Automate Archiving with a Custom Python Read Connector
You can also automate archiving with a custom Python read connector that reads files in a directory and constantly moves older files into an archive that your blob read connector no longer reads.
Getting only the latest data in a blob read connector can help you optimize your data ingestion and reduce resource usage. By filtering your data by date or timestamp information, building a custom Python read connector, or structuring your data differently, you can ensure that you're only ingesting the data you need, without wasting time and resources on older or unnecessary data.