X Tutup
The Wayback Machine - https://web.archive.org/web/20250927201025/https://github.com/rclone/rclone/pull/4566
Skip to content

Conversation

buengese
Copy link
Member

@buengese buengese commented Sep 3, 2020

Let's give this another try. In the last few days I've once again reworked the experimental compression remote started by id01 and continued by me a while ago. I've never been quite happy with the original - it just seemed to be to complicated and trying to do too many things as result this third attempt is now much smaller scale.
Differences from the previous attempts.

  • only gzip compression. the entire compression interface is now gone though I might refactor this a bit further to at least retain the ability to add other compression backends
  • my own custom gzip library sgzip forked from klauspost's excellent pgzip that allows for seeking in the compressed data and takes care of multi threading. all of this was previously done inside rclone but I think a dedicated library with it's own rather extensive testsuite is better suited to this task.
  • the integration tests actually fully pass this time.

Like the previous attempts this is still in the work in progress stage. But I'm opening this now in case someone is willing to test it.

@ncw
Copy link
Member

ncw commented Sep 8, 2020

Let's give this another try. In the last few days I've once again reworked the experimental compression remote started by id01 and continued by me a while ago. I've never been quite happy with the original - it just seemed to be to complicated and trying to do too many things as result this third attempt is now much smaller scale.

That seems like a great idea to me!

Differences from the previous attempts.

  • only gzip compression. the entire compression interface is now gone though I might refactor this a bit further to at least retain the ability to add other compression backends

Is it 100% compatible gzip?

  • my own custom gzip library sgzip forked from klauspost's excellent pgzip that allows for seeking in the compressed data and takes care of multi threading. all of this was previously done inside rclone but I think a dedicated library with it's own rather extensive testsuite is better suited to this task.

Nice - yes that seems like a good solution.

You might want to upstream your changes to pgzip? Did you discuss with Klaus?

  • the integration tests actually fully pass this time.

Yipee!

Like the previous attempts this is still in the work in progress stage. But I'm opening this now in case someone is willing to test it.

Promote it on the forum?

@buengese
Copy link
Member Author

buengese commented Sep 8, 2020

Is it 100% compatible gzip?

Yes! The compression creates 100% standard gzip files no special sauce.

Nice - yes that seems like a good solution. You might want to upstream your changes to pgzip? Did you discuss with Klaus?

Not yet. The changes are not yet completely final, I consider them beta right know. More importantly I don't believe these changes are upstreamable. Pgzip has a very clear goal - being a faster (multithreaded) drop in replacement for compress/gzip. While the changes technically retain this ability and the compression part barely changed at all it still introduces some overhead that I'd see contrary to the goal of pgzip.
The decompression code is still designed to use an external file stroring block boundaries to achieve seeking (and more importantly reading from an offset). This design is very much tailored to rclone's use case. If you are not limited to an http stream and can freely jump around in the compressed file like it is the case with a file on disk there are potentially better ways to do this (and have done) that e.g. don't require an additional file. So even if one were to extend pgzip's scope to seekable gzip compression this isn't really a great fit.

Promote it on the forum?

Yes! As soon as I've done a final cleanup pass on this.

One think I forgot to mention in the first post is that this implementation obviously still retains one very ugly limitation in that the uncompressed file sizes are stored in the names of the compressed files. I really, really hate this but it's only way to have accurate file sizes without requiring List to download metadata and parse metadata files.

@ncw
Copy link
Member

ncw commented Sep 9, 2020

One think I forgot to mention in the first post is that this implementation obviously still retains one very ugly limitation in that the uncompressed file sizes are stored in the names of the compressed files. I really, really hate this but it's only way to have accurate file sizes without requiring List to download metadata and parse metadata files.

Yes I see what you mean. It would be possible (but a long way from ideal!) to report the size as unknown -1.

Hmm, unless we implement a general metadata service for rclone... Which is certainly something I've been putting background cycles into thinking about. Just about all the backends can store metadata of some type (even if it is only a description field).

@buengese buengese changed the title [WIP] Experimental compression remote - third attempt Experimental compression remote - third attempt Sep 14, 2020
@buengese
Copy link
Member Author

buengese commented Sep 15, 2020

I'm pretty much done with this for now (of course i still consider this experimental). What do you should I make beta and post in the forum?

@buengese buengese requested a review from ncw September 20, 2020 19:15
@ncw
Copy link
Member

ncw commented Sep 26, 2020

I'm pretty much done with this for now (of course i still consider this experimental). What do you should I make beta and post in the forum?

Yes let's get some real world testing - if you push it to a branch on the main repo then it will make a beta.

I was thinking of making all pull requests by rclone team members to make a beta automatically. If I can figure that out then that would be quite useful I think. So if an rclone team member sends PR #4566 then this will make a beta with tag pr-4566 - what do you think? I think that would help with the workflow of fixing an issue and getting the user to test it.

I'll give your code a look now.

Copy link
Member

@ncw ncw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've given that a first pass review - see comments inline.

I think it is looking nice!

Next thing is for me to play with it some more :-)

return dst
}

// Sum returns the specified hash from the multihasher
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea - should have done this ages ago! This commit needs to stay on its own, but the others can be squashed.

The title of this commit should be hash: add MultiHasher.Sum() to retrieve a single specific hash probably

@buengese buengese force-pushed the press-new-v2 branch 2 times, most recently from e24e890 to 6b40a53 Compare September 28, 2020 21:44
@buengese
Copy link
Member Author

@ncw I've addressed all your comments now and also fixed a few more things I noticed while looking over the code again.

  • Now using the same 1MiB of data we use to determine mime time to check if the file is compressible. Currently I've hard coded 1.1 as the minimum compression ratio - not sure about that value yet. Maybe that should be config option.
  • Switched to json for the metadata and also dropped the gzip compression of the metadata. Not entirely convinced but it seemed to perform pretty well.
  • Also added one more experimental commit that changes the file name format to use base64 (url encoding without padding) for the file size. That should save a few more characters and I don't think that should cause any problems.

@segator
Copy link
Contributor

segator commented Nov 7, 2020

this feature is pretty cool!! and I would love to see the metadata service @ncw mention. this meta service also will help for future chunking backends, allowing inline dedup/snapshot and other cool features.

Copy link
Member

@ncw ncw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is ready to be merged - what do you think

Are you happy with the metadata format? Is it extensible enough for future needs?

If you are ready, can you squash it into 2 commits (one for hash one for press) and rebase?

I'm very shortly going to add more context.Context into the backend interface (into NewFs and ConfigFs) so depending on the order of things it might need an update.

@buengese buengese force-pushed the press-new-v2 branch 3 times, most recently from d720886 to 1f2acfd Compare November 11, 2020 20:09
@buengese
Copy link
Member Author

Hopefully final rebase done.

@ncw
Copy link
Member

ncw commented Nov 11, 2020

That looks good to me - well done!

I'll leave it to you to press the merge button :-)

buengese and others added 2 commits November 13, 2020 02:02
, rclone#1356, rclone#675

This remote implements transparent compression using gzip. Uses JSON as a for storing metadata.

Co-authored-by: id01 <gaviniboom@gmail.com>
@buengese buengese merged commit aaa8b77 into rclone:master Nov 13, 2020
@segator
Copy link
Contributor

segator commented Nov 13, 2020

congratulations!

@PrivatePuffin
Copy link

Interesting work, shame you guys have gone for gzip instead of the way more efficient zstd algorithm.

@buengese
Copy link
Member Author

We're always open to improvements I did consider using zstd but the performance of the go implementation was still somewhat lacking compared to the gain in compression ratio so I decided against it for now.

@buengese buengese deleted the press-new-v2 branch March 11, 2021 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
X Tutup