Experimental compression remote - third attempt #4566

buengese · 2020-09-03T22:25:16Z

Let's give this another try. In the last few days I've once again reworked the experimental compression remote started by id01 and continued by me a while ago. I've never been quite happy with the original - it just seemed to be to complicated and trying to do too many things as result this third attempt is now much smaller scale.
Differences from the previous attempts.

only gzip compression. the entire compression interface is now gone though I might refactor this a bit further to at least retain the ability to add other compression backends
my own custom gzip library sgzip forked from klauspost's excellent pgzip that allows for seeking in the compressed data and takes care of multi threading. all of this was previously done inside rclone but I think a dedicated library with it's own rather extensive testsuite is better suited to this task.
the integration tests actually fully pass this time.

Like the previous attempts this is still in the work in progress stage. But I'm opening this now in case someone is willing to test it.

ncw · 2020-09-08T15:38:00Z

Let's give this another try. In the last few days I've once again reworked the experimental compression remote started by id01 and continued by me a while ago. I've never been quite happy with the original - it just seemed to be to complicated and trying to do too many things as result this third attempt is now much smaller scale.

That seems like a great idea to me!

Differences from the previous attempts.

only gzip compression. the entire compression interface is now gone though I might refactor this a bit further to at least retain the ability to add other compression backends

Is it 100% compatible gzip?

my own custom gzip library sgzip forked from klauspost's excellent pgzip that allows for seeking in the compressed data and takes care of multi threading. all of this was previously done inside rclone but I think a dedicated library with it's own rather extensive testsuite is better suited to this task.

Nice - yes that seems like a good solution.

You might want to upstream your changes to pgzip? Did you discuss with Klaus?

the integration tests actually fully pass this time.

Yipee!

Like the previous attempts this is still in the work in progress stage. But I'm opening this now in case someone is willing to test it.

Promote it on the forum?

buengese · 2020-09-08T23:54:21Z

Is it 100% compatible gzip?

Yes! The compression creates 100% standard gzip files no special sauce.

Nice - yes that seems like a good solution. You might want to upstream your changes to pgzip? Did you discuss with Klaus?

Not yet. The changes are not yet completely final, I consider them beta right know. More importantly I don't believe these changes are upstreamable. Pgzip has a very clear goal - being a faster (multithreaded) drop in replacement for compress/gzip. While the changes technically retain this ability and the compression part barely changed at all it still introduces some overhead that I'd see contrary to the goal of pgzip.
The decompression code is still designed to use an external file stroring block boundaries to achieve seeking (and more importantly reading from an offset). This design is very much tailored to rclone's use case. If you are not limited to an http stream and can freely jump around in the compressed file like it is the case with a file on disk there are potentially better ways to do this (and have done) that e.g. don't require an additional file. So even if one were to extend pgzip's scope to seekable gzip compression this isn't really a great fit.

Promote it on the forum?

Yes! As soon as I've done a final cleanup pass on this.

One think I forgot to mention in the first post is that this implementation obviously still retains one very ugly limitation in that the uncompressed file sizes are stored in the names of the compressed files. I really, really hate this but it's only way to have accurate file sizes without requiring List to download metadata and parse metadata files.

ncw · 2020-09-09T09:51:35Z

One think I forgot to mention in the first post is that this implementation obviously still retains one very ugly limitation in that the uncompressed file sizes are stored in the names of the compressed files. I really, really hate this but it's only way to have accurate file sizes without requiring List to download metadata and parse metadata files.

Yes I see what you mean. It would be possible (but a long way from ideal!) to report the size as unknown -1.

Hmm, unless we implement a general metadata service for rclone... Which is certainly something I've been putting background cycles into thinking about. Just about all the backends can store metadata of some type (even if it is only a description field).

buengese · 2020-09-15T14:27:25Z

I'm pretty much done with this for now (of course i still consider this experimental). What do you should I make beta and post in the forum?

ncw · 2020-09-26T16:23:10Z

I'm pretty much done with this for now (of course i still consider this experimental). What do you should I make beta and post in the forum?

Yes let's get some real world testing - if you push it to a branch on the main repo then it will make a beta.

I was thinking of making all pull requests by rclone team members to make a beta automatically. If I can figure that out then that would be quite useful I think. So if an rclone team member sends PR #4566 then this will make a beta with tag pr-4566 - what do you think? I think that would help with the workflow of fixing an issue and getting the user to test it.

I'll give your code a look now.

ncw

I've given that a first pass review - see comments inline.

I think it is looking nice!

Next thing is for me to play with it some more :-)

ncw · 2020-09-26T16:25:07Z

fs/hash/hash.go

 	return dst
 }

+// Sum returns the specified hash from the multihasher


Great idea - should have done this ages ago! This commit needs to stay on its own, but the others can be squashed.

The title of this commit should be hash: add MultiHasher.Sum() to retrieve a single specific hash probably

docs/content/press.md

backend/press/press.go

buengese · 2020-09-28T21:58:41Z

@ncw I've addressed all your comments now and also fixed a few more things I noticed while looking over the code again.

Now using the same 1MiB of data we use to determine mime time to check if the file is compressible. Currently I've hard coded 1.1 as the minimum compression ratio - not sure about that value yet. Maybe that should be config option.
Switched to json for the metadata and also dropped the gzip compression of the metadata. Not entirely convinced but it seemed to perform pretty well.
Also added one more experimental commit that changes the file name format to use base64 (url encoding without padding) for the file size. That should save a few more characters and I don't think that should cause any problems.

segator · 2020-11-07T19:34:25Z

this feature is pretty cool!! and I would love to see the metadata service @ncw mention. this meta service also will help for future chunking backends, allowing inline dedup/snapshot and other cool features.

ncw

I think this is ready to be merged - what do you think

Are you happy with the metadata format? Is it extensible enough for future needs?

If you are ready, can you squash it into 2 commits (one for hash one for press) and rebase?

I'm very shortly going to add more context.Context into the backend interface (into NewFs and ConfigFs) so depending on the order of things it might need an update.

docs/content/compress.md

buengese · 2020-11-11T20:10:49Z

Hopefully final rebase done.

ncw · 2020-11-11T20:41:18Z

That looks good to me - well done!

I'll leave it to you to press the merge button :-)

, rclone#1356, rclone#675 This remote implements transparent compression using gzip. Uses JSON as a for storing metadata. Co-authored-by: id01 <gaviniboom@gmail.com>

segator · 2020-11-13T14:19:38Z

congratulations!

PrivatePuffin · 2020-12-21T13:19:26Z

Interesting work, shame you guys have gone for gzip instead of the way more efficient zstd algorithm.

buengese · 2020-12-29T23:30:49Z

We're always open to improvements I did consider using zstd but the performance of the go implementation was still somewhat lacking compared to the gain in compression ratio so I decided against it for now.

buengese force-pushed the press-new-v2 branch from 51c7bf6 to 8532139 Compare September 4, 2020 13:25

buengese force-pushed the press-new-v2 branch from 8532139 to b5c0372 Compare September 14, 2020 21:23

buengese changed the title ~~[WIP] Experimental compression remote - third attempt~~ Experimental compression remote - third attempt Sep 14, 2020

buengese requested a review from ncw September 20, 2020 19:15

ncw reviewed Sep 26, 2020

View reviewed changes

buengese force-pushed the press-new-v2 branch 2 times, most recently from e24e890 to 6b40a53 Compare September 28, 2020 21:44

buengese force-pushed the press-new-v2 branch from 6b40a53 to 39d5ed9 Compare September 28, 2020 22:14

ivandeex added the new backend label Oct 24, 2020

ivandeex mentioned this pull request Oct 25, 2020

Feature Request: Compression remote (Zip) #675

Closed

buengese requested a review from ncw November 1, 2020 14:45

buengese force-pushed the press-new-v2 branch from 38784e1 to c96c4cf Compare November 2, 2020 21:10

ivandeex mentioned this pull request Nov 5, 2020

Experimental compression remote - continuation #3953

Closed

ncw reviewed Nov 9, 2020

View reviewed changes

docs/content/compress.md Outdated Show resolved Hide resolved

buengese force-pushed the press-new-v2 branch from c96c4cf to 86e91bd Compare November 11, 2020 18:59

hash: add MultiHasher.Sum() to retrieve a single specific hash

4cac496

buengese force-pushed the press-new-v2 branch 3 times, most recently from d720886 to 1f2acfd Compare November 11, 2020 20:09

buengese and others added 2 commits November 13, 2020 02:02

compress: added experimental compression remote - implements rclone#2098

b04eb56

, rclone#1356, rclone#675 This remote implements transparent compression using gzip. Uses JSON as a for storing metadata. Co-authored-by: id01 <gaviniboom@gmail.com>

compress: initial documentation

4b71a1d

buengese force-pushed the press-new-v2 branch from 1f2acfd to 4b71a1d Compare November 13, 2020 01:02

buengese merged commit aaa8b77 into rclone:master Nov 13, 2020

buengese deleted the press-new-v2 branch March 11, 2021 14:49

Aug	SEP	Oct
	27
2024	2025	2026

Uh oh!

Experimental compression remote - third attempt #4566

Experimental compression remote - third attempt #4566

Uh oh!

Conversation

buengese commented Sep 3, 2020

Uh oh!

ncw commented Sep 8, 2020

Uh oh!

buengese commented Sep 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ncw commented Sep 9, 2020

Uh oh!

buengese commented Sep 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ncw commented Sep 26, 2020

Uh oh!

ncw left a comment

Choose a reason for hiding this comment

Uh oh!

ncw Sep 26, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

buengese commented Sep 28, 2020

Uh oh!

segator commented Nov 7, 2020

Uh oh!

ncw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

buengese commented Nov 11, 2020

Uh oh!

ncw commented Nov 11, 2020

Uh oh!

segator commented Nov 13, 2020

Uh oh!

PrivatePuffin commented Dec 21, 2020

Uh oh!

buengese commented Dec 29, 2020

Uh oh!

Uh oh!

buengese commented Sep 8, 2020 •

edited

Loading

buengese commented Sep 15, 2020 •

edited

Loading