Currently we have automated manual backup of redshift data every 1 hour. Assuming a cluster goes down and the data restored from a snapshot, I also want to restore the data not present in the snapshot.
P.S: complete data is present in s3 before we move to redshift.
How can I approach this problem? So that I get remaining data from my s3 to redshift after snapshot restore.
You would need to have some indicator in both Redshift and S3 so that you know which data is loaded.
For example, if your data on S3 is partitioned by Year, Month, Day and Hour like so:
Then you want to have Year, Month, Day and Hour columns in you Redshift tables so you can find the max partition that was loaded.
SELECT MAX(year||month||day||hour) FROM my_table
You can then reload any partitions not currently present in Redshift.