Introduce new native backup provider (KNIB) by JoaoJandre · Pull Request #12758 · apache/cloudstack

JoaoJandre · 2026-03-06T14:13:27Z

Description

This PR adds a new native incremental backup provider for KVM. The design document which goes into details of the implementation can be found on https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=406622120.

The validation process which is detailed in the design document will be added to this PR soon.
The file extraction process will be added in a later PR.

This PR adds a few new APIs:

The createNativeBackupOffering API has the following parameters:

Parameter	Description	Default Value	Required
`name`	Specifies the name of the offering	-	Yes
`compress`	Specifies whether the offering supports backup compression	`false`	No
`validate`	Specifies whether the offering supports backup validation	`false`	No
`allowQuickRestore`	Specifies whether the offering supports quick restore	`false`	No
`allowExtractFile`	Specifies whether the offering supports file extraction from backups	`false`	No
`backupchainsize`	Backup chain size for backups created with this offering. If this is set, it overrides the backup.chain.size setting	-	No
`compressionlibrary`	Specifies the compression library for offerings that support compression. Accepted values are zstd and zlib. By default, zstd is used for images that support it. If the image only supports zlib, it will be used regardless of this parameter.	`zstd`	No

The deleteNativeBackupOffering API has the following parameter:

Parameter Description Required

id Identifier of the native backup offering. Yes

A native backup offering can only be removed if it is not currently imported.

The listNativeBackupOfferings API has the following parameters:

Parameter	Description	Required
`id`	Identifier of the offering.	No
`compress`	Lists only offerings that support backup compression.	No
`validate`	Lists only offerings that support backup validation.	No
`allowQuickRestore`	Lists only offerings that support quick restore.	No
`allowExtractFile`	Lists only offerings that support file extraction from backups.	No
`showRemoved`	Lists also offerings that have already been removed.	`false`	No

The listBackupCompressionJobs has the following parameters

Parameter	Description
`id`	List only the job with the specified ID
`backupid`	List jobs associated with the specified backup
`hostid`	List jobs associated with the specified host. When this parameter is provided, the `executing` parameter is implicit
`zoneid`	List jobs associated with the specified zone
`type`	List jobs of the specified type. Accepts `Starting` or `Finalizing`
`executing`	List jobs that are currently executing
`scheduled`	List jobs scheduled to run in the future

By default, lists all offerings that have not been removed.

It also adds parameters to the following APIs:

The isolated parameter was added to the createBackup and createBackupSchedule APIs
The quickRestore parameter was added to the restoreBackup, restoreVolumeFromBackupAndAttachToVM and createVMFromBackup APIs
The hostId parameter was added to the restoreBackup and restoreVolumeFromBackupAndAttachToVM APIs, which can only be used by root admins and only when quick restore is true.

New settings were also added:

Configuration	Description	Default Value
`backup.chain.size`	Determines the max size of a backup chain. If cloud admins set it to 1 , all the backups will be full backups. With values lower than 1, the backup chain will be unlimited, unless it is stopped by another process. Please note that unlimited backup chains have a higher chance of getting corrupted, as new backups will be dependent on all of the older ones.	8
`knib.timeout`	Timeout, in seconds, to execute KNIB commands. After the command times out, the Management Server will still wait for another knib.timeout seconds to receive a response from the Agent.	43200
`backup.compression.task.enabled`	Determines whether the task responsible for scheduling compression jobs is active. If not, compression jobs will not run	`true`
`backup.compression.max.concurrent.compressions.per.host`	Maximum number of concurrent compression jobs. Compression finalization jobs ignore this setting	`5`
`backup.compression.max.job.retries`	Maximum number of attempts for executing compression jobs	`2`
`backup.compression.retry.interval`	Interval, in minutes, between attempts to run compression jobs	`60`
`backup.compression.timeout`	Timeout, in seconds, for running compression jobs	`28800`
`backup.compression.minimum.free.storage`	Minimum required available storage to start the backup compression process. This setting accepts a real number that is multiplied by the total size of the backup to determine the necessary available space. By default, the storage must have the same amount of available space as the space occupied by the backup.	`1`
`backup.compression.coroutines`	Number of coroutines used for the compression process, each coroutine has its own thread	`1`
`backup.compression.rate.limit`	Compression rate limit, in MB/s. Values less than 1 disable the limit	`0`

Types of changes

Breaking change (fix or feature that would cause existing functionality to change)
New feature (non-breaking change which adds functionality)
Bug fix (non-breaking change which fixes an issue)
Enhancement (improves an existing feature and functionality)
Cleanup (Code refactoring and cleanup, that may add test cases)
Build/CI
Test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

Major
Minor

Bug Severity

Screenshots (if appropriate):

How Has This Been Tested?

Tests related to disk-only VM snapshots

N	Test	Result
1	Take disk-only VM snapshot	ok
2	Take disk-only VM snapshot again	ok
3	Stop VM, revert to snapshot 2, start VM	Correct deltas found in the VM volume chain
4	Stop VM, revert to snapshot 1, start VM	Correct deltas found in the VM volume chain
5	Take disk-only VM snapshot	ok
6	Remove disk-only VM snapshot 1	Marked as removed, not removed from storage
7	Remove disk-only VM snapshot 3	Merged with current volume
8	Remove disk-only VM snapshot 2	Removed, snap 1 merged with the current volume

Basic tests with backup

Using backup.chain.size=3

N	Test	Result
1	With the VM stopped, I created a backup (b1)	Full backup created
2	Started VM, wrote data, created a second backup (b2)	Incremental backup created
3	Stopped the VM, went back to backup 1, started	Ok, VM without data
4	Stopped the VM, went back to backup 2, started	Ok, data from the backup created in test 2 present
5	Created 4 backups (b3, b4, b5, b6)	b3 was a full, b4 and b5 incremental, b6 full
6	Removing the last backup (b6)	the delta on the primary was merged with the volume
7	Removed backups b4 and b5	they were marked as removed, but not deleted from storage
8	Batch removing the remaining backups	Ok, all removed
9	Created a new backup	ok
10	Detached the VM from the offer	Deltas were merged on primary
11	Removing this last backup	ok

Interactions with other functionalities

I created a new VM with a root disk and a data disk for the tests below.

N	Test	Result
1	Took a new backup and migrated the VM	ok
2	Migrated the VM + one of the volumes	ok, the migrated volume had no delta on the primary, the other volume still had a delta
3	Took a new backup	For the volume that was not migrated the backup was incremental, for the migrated volume, it was a full backup
4	I took 2 backups	OK, the finished normally
5	Try restoring one of the backups from before the migration	OK
6	Created file 1, created backup b1	OK
7	Created file 2, created VM snap s1	OK
8	Created file 3, created VM snap s2	OK
9	Created file 4, created backup b2	OK
10	Created file 5, created backup b3	OK
11	Stopped the VM, restored VM snap s1, started	Files 1 and 2 present
12	Stopped the VM, restored VM snap s2, started	Files 1, 2 and 3 present
13	Removed VM snapshots	ok
14	Restored backup b1, started	file 1 present
15	Restored backup b2, started	files 1, 2, 3 and 4 present
16	Restored backup b3, started	files 1, 2, 3, 4 and 5 present
17	Took a new backup b4	ok
18	Attached a new volume, wrote data, took a backup b5	ok
19	Stopped the VM, restored backup b4, started the VM	the new volume was not affected by the restoration
20	Detached the volume, restored backup b5	a new volume was created and attached to the VM, the files were there
21	Created a backup	ok
22	Created a volume snapshot	OK
23	Revert volume snapshot	I verified that the delta on the primary left by the last backup was removed

Configuration Tests

I changed the value of the backup.compression.task.enabled setting and verified that no new jobs were started. I verified that when returned to true, they were executed.
I changed the value of the backup.compression.max.concurrent.compressions.per.host setting and verified that the number of jobs executed simultaneously for each host was relative to the value of the setting. I also verified that the value -1 does not limit the number of jobs executed by the host.
I verified that the number of retries respects the backup.compression.max.job.retries setting.
I verified that the time between retries respects the backup.compression.retry.interval setting.
I changed the value of the backup.compression.minimum.free.storage setting and verified that the job failed if there was not enough free space.
I changed the value of the backup.compression.coroutines setting and verified that the value passed to qemu-img was reflected.
I changed the value of the backup.compression.rate.limit setting and verified that the value was passed to qemu-img.

Compression Tests

Tests performed with an offer that provides compressed backups support

Test	Result
Create full backup	Backup created and compressed
Create incremental backup	Backup created and compressed
Create 10 backups of the same machine	Backups created sequentially, but compressed in parallel

Tests with restoreVolumeFromBackupAndAttachToVM

N	Test	Result
1	Restore volume A from VM 1 to the same VM while it is stopped	New volume created, restored, and attached
2	Restore volume B from VM 1 to the same VM while it is running	New volume created, restored, and attached
3	Restore volume B from VM 1 to VM 2 while it is running	New volume created, restored, and attached
4	Restore volume A from VM 1 to VM 2 while it is running, even though the VM was deleted	New volume created, restored, and attached
5	Restore a volume to a VM using quickrestore	New volume created, attached, and consolidated with the backup
6	Restore a volume to a stopped VM using quickrestore and specifying the hostId	New volume created, attached, VM started on the specified host, and volume consolidated

Tests with `restoreBackup`

N	Test	Result
1	Restore VM without quickrestore	OK
2	Restore VM with quickrestore	Volumes restored, VM started, and volumes consolidated
3	Restore VM with quickrestore and specifying hostId	Volumes restored, VM started on the specified host, and volumes consolidated
4	Detach a volume from the VM and repeat test 3	Detached volume duplicated, attached to the VM, restored, and VM started on the host, volumes consolidated

codecov · 2026-03-06T14:22:45Z

Codecov Report

❌ Patch coverage is 4.94580% with 2806 lines in your changes missing coverage. Please review.
✅ Project coverage is 17.84%. Comparing base (74af9b9) to head (0cc9b33).

Files with missing lines	Patch %	Lines
...e/wrapper/LibvirtTakeKnibBackupCommandWrapper.java	0.49%	202 Missing ⚠️
...he/cloudstack/backup/BackupCompressionService.java	0.00%	201 Missing ⚠️
...ervisor/kvm/resource/LibvirtComputingResource.java	4.56%	188 Missing ⚠️
...che/cloudstack/backup/NativeBackupServiceImpl.java	0.00%	134 Missing ⚠️
...apache/cloudstack/storage/backup/BackupObject.java	0.00%	91 Missing ⚠️
...napshot/KvmFileBasedStorageVmSnapshotStrategy.java	0.00%	82 Missing ⚠️
...tack/engine/orchestration/StorageOrchestrator.java	0.00%	80 Missing ⚠️
...e/wrapper/LibvirtCompressBackupCommandWrapper.java	1.40%	70 Missing ⚠️
...ache/cloudstack/backup/BackupCompressionJobVO.java	0.00%	68 Missing ⚠️
...cloudstack/backup/dao/NativeBackupJoinDaoImpl.java	0.00%	68 Missing ⚠️
... and 101 more

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #12758      +/-   ##
============================================
- Coverage     17.92%   17.84%   -0.09%     
- Complexity    16175    16226      +51     
============================================
  Files          5949     6001      +52     
  Lines        534058   537981    +3923     
  Branches      65301    65650     +349     
============================================
+ Hits          95742    95983     +241     
- Misses       427560   431217    +3657     
- Partials      10756    10781      +25

Flag	Coverage Δ
uitests	`3.66% <ø> (-0.01%)`	⬇️
unittests	`18.93% <4.94%> (-0.11%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

yadvr · 2026-03-08T17:52:20Z

@JoaoJandre just heads up - my colleagues have been working on an incremental backup feature for NAS B&R (using nbd/qemu bitmap tracking & checkpoints). We're also working on a new Veeam-KVM integration for CloudStack whose PR may may be out soon. My colleagues can further help review and advise on this.

/cc @weizhouapache @abh1sar @shwstppr @sureshanaparti @DaanHoogland @harikrishna-patnala

Just my 2cents on the design & your comments - NAS is more than just NFS, but any (mountable) shared storage such as CephFS, cifs/samba etc. Enterprise users usually don't want to mix using secondary storage with backup repositories, which is why NAS B&R introduced a backup-provider agnostic concept of backup repositories which can be explored by other backup providers.

JoaoJandre · 2026-03-09T11:27:17Z

Just my 2cents on the design & your comments - NAS is more than just NFS, but any (mountable) shared storage such as CephFS, cifs/samba etc.

At the time of writing that part, I believe it was only NFS that was supported. I'll update the relevant part.

Enterprise users usually don't want to mix using secondary storage with backup repositories, which is why NAS B&R introduced a backup-provider agnostic concept of backup repositories which can be explored by other backup providers.

The secondary storage selector feature (introduced in 2023 by #7659) allows you to specialize secondary storages. This PR extended the feature so that you may also create selectors for backups.

abh1sar · 2026-03-09T13:14:00Z

Hi Joao,

This looks promising. Incremental backups, quick restore and file restore features have been missing from CloudStack KVM.

I am having trouble understanding some of the design choices though:

What’s the reason behind strong coupling with secondary storage?
- I am wondering if the Backup Repository will provide a more flexible alternative. The user would be free to add an external storage server or use the secondary storage by simply adding it as a backup repository? It will be very easy for user to have multiple backup repository attached to multiple backup offerings which can be assigned to instance as required.
  
  This will also be consistent with other backup providers like Veeam and NAS which have the concept of backup repository.
  
  The backup repository feature also comes with a separate capacity tracking and email alerts.
- If a secondary storage is needed just for backup’s purpose, how will it be ensured that templates and snapshots are not copied over to it?
About Qemu compression
- Have you measured / compared the performance of qemu-img compression with other compression methods?
- As I understand, qemu-img compresses the qcow2 file at a cluster granularity (usually 64kb). That might not fare well when compared to storage level compression. In production environments, the operator might choose to have compression at the storage layer if they are using an enterprise storage like NetApp. Even something open source like ZFS might perform better than qemu-img compress due to the granularity limitation that qemu compression has.
- I am making this point because the compression part is introducing a fair bit of complexity due to the interaction with SSVM, and I am just wondering if the gains are worth the trouble and should compression be offloaded to the storage completely.
Do we need a separate backup offering table and api?
- Why not add column or details to backup_offering or backup_offering_details? Other offerings can also benefit from these settings.
What’s the reason behind using virDomainSnapshotCreate to create backup files and not virDomainBackupBegin like incremental volume snapshots and NAS backup?
- Did you face any issues with checkpoints and bitmaps?

…vider

JoaoJandre · 2026-03-09T16:54:53Z

Hi Joao,

Hello, @abh1sar

This looks promising. Incremental backups, quick restore and file restore features have been missing from CloudStack KVM.

I am having trouble understanding some of the design choices though:

1. What’s the reason behind strong coupling with secondary storage?
   
   * I am wondering if the Backup Repository will provide a more flexible alternative. The user would be free to add an external storage server or use the secondary storage by simply adding it as a backup repository? It will be very easy for user to have multiple backup repository attached to multiple backup offerings which can be assigned to instance as required.

I don't see why we should force the coupling of backup offerings with backup repositories, what is the benefit?

     This will also be consistent with other backup providers like Veeam and NAS which have the concept of backup repository.
     The backup repository feature also comes with a separate capacity tracking and email alerts.

The secondary storage also has both features. Although the capacity is not reported to the users currently.

   * If a secondary storage is needed just for backup’s purpose, how will it be ensured that templates and snapshots are not copied over to it?

The secondary storage selectors feature (introduced in 2023 through #7659) allows you to specialize secondary storages. Quoting from the PR description: "This PR aims to add the possibility to direct resources (Volumes, Templates, Snapshots and ISOs) to a specific secondary storage through rules written in JavaScript that will only affect new allocated resources". For a few years it has been possible to have secondary storages that only receive snapshots or templates for example. This PR introduces the possibility to add selectors for backups, so that you have secondary storages that are specific for backups.

Furthermore, my colleagues are working on a feature to allow using alternative secondary storage solutions, such as CephFS, iSCSI and S3, while preserving compatibility with features destined to NFS storages. This feature may be extended in the future to allow essentially any type of secondary storage. Thus, the flexibility for secondary storages will soon grow.

2. About Qemu compression
   
   * Have you measured / compared the performance of qemu-img compression with other compression methods?

Using any other type of backup-level compression will be worse then using qemu-img compression. This is because when restoring the backup, we must have access to the whole backing chain. If we use other types of compression, we will have to decompress the whole chain before restoring. Using qemu-img, the backing files are still valid and do not need to be decompressed, we actually never have to decompress ever. This is the great benefit of using qemu-img.

In any case, here is a brief comparison of using qemu-img with the zstd library and 8 threads and using the pigz implementation of multi-threaded compression, also using 8 threads. The original file is the root volume of a VM that I use.

Command	Time	Original file size	Final file size
`qemu-img convert -c -p -W -m 8 -f qcow2 -O qcow2 -o compression_type=zstd`	real 3m51.944s - user 16m11.970s - sys 4m14.987s	43 G	35G
`pigz -p8`	real 6m13.799s - user 44m33.300s - sys 1m54.801s	43G	34G
`pigz --zip -p8`	real 6m2.729s - user 44m38.401s - sys 1m47.663s	43G	34G

Compression using qemu-img was a lot faster, with a bit smaller compression ratio. Furthermore, we have to consider that the qemu-img compressed image can be used as-is, while the other images must be decompressed, further adding to the processing time of backing up/restoring a backup.

   * As I understand, qemu-img compresses the qcow2 file at a cluster granularity (usually 64kb). That might not fare well when compared to storage level compression. In production environments, the operator might choose to have compression at the storage layer if they are using an enterprise storage like NetApp. Even something open source like ZFS might perform better than qemu-img compress due to the granularity limitation that qemu compression has.

The compression feature is optional, if you are using storage-level compression, you probably will not use backup-level compression. However, many environments do not have storage-level compression, thus having the possibility of backup-level compression is still very interesting.

   * I am making this point because the compression part is introducing a fair bit of complexity due to the interaction with SSVM, and I am just wondering if the gains are worth the trouble and should compression be offloaded to the storage completely.

The compression does not add any interaction with the SSVM.

3. Do we need a separate backup offering table and api?
   
   * Why not add column or details to backup_offering or backup_offering_details? Other offerings can also benefit from these settings.

I did not want to add dozens of parameters to the import backup offering API which are only really going to be used for one provider. This way, the original design of the API is preserved.

Furthermore, you may note that the APIs are intentionally not called createKnibBackupOffering, but createNativeBackupOffering. If other native providers want to use these offerings, they may do so by extending their implementations.

4. What’s the reason behind using virDomainSnapshotCreate to create backup files and not virDomainBackupBegin like incremental volume snapshots and NAS backup?
   
   * Did you face any issues with checkpoints and bitmaps?

There are two main issues with using bitmaps:

They are prone to corruption, while this can be mitigated in some ways, since the incremental volume snapshot feature was added, we have noticed multiple cases of bitmap corruption with different causes. It is possible to detect the corruption and delete corrupt bitmaps, but this would add more complexity to the feature.
Using bitmaps is not compatible with the file-based incremental VM snapshot feature added in File-based disk-only VM snapshot with KVM as hypervisor #10632. After some internal discussion and feedback from users, we have come to the conclusion that being able to use both the incremental VM snapshot and backup features at the same time is very interesting.

At the end of the day, this PR adds a new backup provider option for users. They will be free to choose the provider that best fits their needs. This is one of the reasons why it was done as a new backup provider; KNIB and other backup providers do not have to cancel each-other out.

KNIB

e8e5820

boring-cyborg bot added component:agent component:api labels Mar 6, 2026

bernardodemarco self-requested a review March 6, 2026 14:14

winterhazel added this to the 4.23.0 milestone Mar 6, 2026

winterhazel added status:needs-testing status:needs-review labels Mar 6, 2026

fix minor cherry pick error

8a94dcb

Merge remote-tracking branch 'origin/main' into new-native-backup-pro…

0cc9b33

…vider

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce new native backup provider (KNIB)#12758

Introduce new native backup provider (KNIB)#12758
JoaoJandre wants to merge 3 commits intoapache:mainfrom
scclouds:new-native-backup-provider

JoaoJandre commented Mar 6, 2026

Uh oh!

codecov bot commented Mar 6, 2026 •

edited

Loading

Uh oh!

yadvr commented Mar 8, 2026

Uh oh!

JoaoJandre commented Mar 9, 2026

Uh oh!

abh1sar commented Mar 9, 2026

Uh oh!

JoaoJandre commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

JoaoJandre commented Mar 6, 2026

Description

Types of changes

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

Bug Severity

Screenshots (if appropriate):

How Has This Been Tested?

Tests related to disk-only VM snapshots

Basic tests with backup

Interactions with other functionalities

Configuration Tests

Compression Tests

Tests with restoreBackup

Uh oh!

codecov bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

yadvr commented Mar 8, 2026

Uh oh!

JoaoJandre commented Mar 9, 2026

Uh oh!

abh1sar commented Mar 9, 2026

Uh oh!

JoaoJandre commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Tests with `restoreBackup`

codecov bot commented Mar 6, 2026 •

edited

Loading