How to load a very large table by splitting it into multiple tasks (with CDC)

Creation date: 05/12/2024 14:59 Updated: 11/12/2024 09:33 full load full load using multiple segments full load using multiple tasks multiple segments multiple tasks progress qlik qlik replicate save full load progress save progress of full load split tasks

Let's say I have an EmployeeDetails table with 1.2 billion records and loading the table into my target takes a very long time or there are issues which prevent me from loading this table all at once, forcing me to start the whole full load over again.

In this scenario, I want to load the table in segments, and save progress after each segment is complete, which in our scenario can't easily be accomplished in one task by doing one big full load. Below is a diagram of how I want to split my table. I want only the customers 1950 - 1975, loaded in 3 segments.

To accomplish this, I will split the table into 3 tasks for a full load, followed up by one task that captures changes on this table. The architecture of this setup is as follows:

One CDC only task - This will capture all the changes.
Multiple Full Load only tasks - These will only do the full load. They will split the table using filters.

Here are the steps:

1. Create your full load tasks according to your number of segments. I have three segments so I've created a task for each segment as shown in the image above, named accordingly, "EE_FL_Task_TaskNumber".

2. Next, disable Apply Changes on each of the full load tasks.

3. For each of your full load tasks, set these Full Load settings accordingly:

Task 1 should be set to "DROP and CREATE table". This is so that if any issues occur, Task 1 can be used to recreate the table and start over.

Each subsequent full load task after Task 1 should be set to "Do nothing". This is so that the table created by Task 1 is not deleted by the subsequent full load tasks.

4. Set your filters on each task. We will filter each task with records according to these segments. Our advice is to filter using a column that is not updated. We are using a date column to filter.

You can choose to use a Full Load Passthru filter which can be relatively more performant than a regular table filter as it filters the records on the source rather than pulling them into Replicate then filtering. Alternatively, you can opt to use regular table filters. Below is an example of what our regular filter looks like. This can be achieved by going to the Table Settings -> Filter.

If you require assistance with this step please don't hesitate to reach out to us at helpdesk@ibt.com.au

5. Once you've set the full load task settings correctly, ensure that your tables have either a primary key or a mix of columns with keys defined on them for uniqueness. This is essential for the CDC aspect of this whole setup. In my case, I had no PK column so I set keys on three columns to ensure uniqueness. I made sure to also add these keys in each and every full load task so that they are all consistent.

6. Run your full load tasks. Start from Task 1 and go up sequentially ensuring you only run one full load at a time.

Pro tip: After each full load segment completes, you can compare the row count of the segments on the source and the segments on target to ensure that the correct number of records are being replicated. The row count may differ slightly between the source and target if the source table is live.

7. Create a new task for capturing changes, and ensure that Full Load Processing is OFF.

8. Next, set these Error Handling settings which will prevent duplicates from occurring. This is because we will need to start the task from a timestamp to capture all the changes since the first full load task was started. This means that some changes will be replayed thus there is a potential for duplicates if these settings are not set and there are no keys defined on the table for uniqueness.

9. Ensure that the table key columns in this CDC task are consistent with all your full load tasks.

10. Once you have done this, go to Advanced Run Options and start the task a bit earlier than the time at which full load Task 1 was started. You can choose to go 5 minutes, 30 minutes, or one hour before the time that Task 1 started to avoid missing any changes. This depends on how active your table is and whether there are long running transactions. The error handling settings and keys we enabled above should prevent any duplicates from being applied to the table.

Note: It is highly recommended that you test this first. Try inserting duplicates at the source. Try replaying changes and seeing if duplicates go through to the target.

To find the correct timestamp at which to start your CDC task, you can check the logs of the first full load task (Task 1) for the first instance of the line "Start loading table", as shown in this image below.

You will see there is a hexadecimal encoded timestamp. Convert it to decimal. (See below)

Then convert it again to a timestamp using an epoch converter. (See below)

To reiterate, from this timestamp, you'd want to go back and start the CDC task AT MINIMUM 5 minutes before this or 30 minutes or more to be extra safe. This is to ensure that absolutely no long running transactions on the source tables are missed, resulting in missing data.

If your table is very active you will see changes and latency as the task is now going back through all the changes that occurred during when the full loads were taking place. However as the changes are applied, you should see latency go back down.