X Tutup
The Wayback Machine - https://web.archive.org/web/20190417191939/https://github.com/EFForg/https-everywhere/issues/3999
Skip to content
Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

Learn more
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul repo rule storage #3999

Open
semenko opened this Issue Jan 25, 2016 · 7 comments

Comments

Projects
None yet
3 participants
@semenko
Copy link
Contributor

semenko commented Jan 25, 2016

Our ruleset storage is chaotic. Let's ponder overhauling it. (cc @fuglede)

@jsha

This comment has been minimized.

Copy link
Member

jsha commented Jan 27, 2016

I agree! Do you mean storage in the repo, or the XML format, or the SQLite format, or ...? :-)

I was thinking recently that it's probably time to move the rulesets to a new repo that gets included into the main one. Submodules would be one way, but I would also like to have the property that if you check out latest https-everywhere and build, you get the latest rulesets, without needed to constantly update the submodule head in the code repo. If we do this, I think it makes sense to start the new repo as a fork of https-everywhere, deleting everything but the XML files, and moving the rulesets higher in the tree. That way we retain the commit history for the rulesets, which is important. We would also need to make sure that the ruleset tests get run in the ruleset repo in addition to the main repo.

It may also be time to split up the rulesets directory by first letter.

@semenko

This comment has been minimized.

Copy link
Contributor Author

semenko commented Jan 27, 2016

I was pondering ... all of the above! But primarily the repo storage / XML format.

I think we should:

  • Move away from XML
  • Preserve separate files per-rule (so git merges don't become insane)
  • Develop a simpler way for people add "trivial rule" sites
  • Preserve a history of "broken" sites, rather than a complex git log

I've been pondering some YAML style rules, or perhaps simple, empty files for non-complex rules.

What if someone could touch somedomain.tld && git add somedomain.tld and have their rule added as a simple rule?

@semenko semenko changed the title Overhaul rule storage Overhaul repo rule storage Jan 27, 2016

@jsha

This comment has been minimized.

Copy link
Member

jsha commented Jan 28, 2016

Move away from XML

I disagree on this one. I've thought about it a number of times, but I think there's nothing fundamental about the XML format that makes it unsuitable, and programmatically reformatting all our XML into YAML or JSON would lose all history. I am not a huge fan of XML, but it's not bad enough to kill.

That said, I'm totally into improving our build-time transforms of the XML to improve stored size, in-memory size, and speed of access.

Develop a simpler way for people add "trivial rule" sites

Yes. Caveat: This will increase the total volume of submissions greatly. We need to do two things first: 1. Decide on a scope for ruleset inclusion in HTTPS Everywhere. Let's Encrypt has issued for ~450,000 unique domains. Even if we assume all those domains work perfectly, it's clearly not suitable to include them all in HTTPS Everywhere. It's just too much data. 2. Improve our fetch testing so that it runs at the time a PR is submitted, and checks for mixed content both on the root page of a domain and on a small set of pages crawled from the root.

Preserve a history of "broken" sites, rather than a complex git log

I'm not sure what you mean here.

@semenko

This comment has been minimized.

Copy link
Contributor Author

semenko commented Jan 28, 2016

Preserve a history of "broken" sites, rather than a complex git log

I'm not sure what you mean here.

e.g. #3983 -- we sometimes git rm very broken rules. Since rule .xml names aren't always equal to the domain, the history can be hard to find.

@vgturtle127

This comment has been minimized.

Copy link

vgturtle127 commented Feb 24, 2016

@semenko @jsha Has any work been done on this? I was just curious. This seems cool.

@jsha

This comment has been minimized.

Copy link
Member

jsha commented Mar 13, 2016

Only one small part of it: Now we ship the extension with a .json file containing XML rulesets, instead of a .sqlite file containing XML rulesets.

@vgturtle127

This comment has been minimized.

Copy link

vgturtle127 commented Mar 14, 2016

@jsha That is a step in the right direction, I would say. 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.
X Tutup