X Tutup
The Wayback Machine - https://web.archive.org/web/20221225055329/https://github.com/github/linguist/pull/4936
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for generic file extensions #4936

Merged
merged 16 commits into from Aug 30, 2020
Merged

Add support for generic file extensions #4936

merged 16 commits into from Aug 30, 2020

Conversation

Alhadis
Copy link
Collaborator

@Alhadis Alhadis commented Jul 29, 2020

Description

This pull-request adds non-trivial enhancements to Linguist's Heuristic and Extension strategies:

  1. Overly-common file extensions can now be identified as such.
    Generic extensions require files to satisfy a heuristic in order to match a candidate language.
  2. Heuristics can no longer target extensions not listed in a candidate's languages.yml entry.
    Linguist's tests will now fail if such a rule is found in heuristics.yml (see #4904 for rationale).
  3. Two redundant heuristics were modified to comply with point Nimrods, the whole lotta them. #2:
    • Open-ended fallback to Text removed from .builds heuristic.
      It's assumed the author intended for it to return nothing.
    • INI removed from .props heuristic.
      Only 32 files are affected by this change, so it shouldn't have been added in the first place.
  4. .sol has been added as a generic extension.
    Affected languages are Gerber Images and Solidity. Support for the latter was the original motivator for this pull-request; see #4904 for prior discussion.

/cc @alcuadrado, @pchaigno

@Alhadis
Copy link
Collaborator Author

Alhadis commented Jul 29, 2020

  1) Failure:
TestFileBlob#test_language [/home/runner/work/linguist/linguist/test/test_file_blob.rb:671]:
No language for /home/runner/work/linguist/linguist/test/fixtures/Solidity/ignored2.sol

 2) Failure:
TestBlob#test_language [/home/runner/work/linguist/linguist/test/test_blob.rb:276]:
No language for /home/runner/work/linguist/linguist/test/fixtures/Solidity/ignored2.sol

I'd like to know why these tests didn't fail when I ran them locally...

@Alhadis Alhadis mentioned this pull request Jul 29, 2020
3 tasks
@Alhadis Alhadis requested review from pchaigno and lildude Jul 29, 2020
Copy link
Collaborator

@pchaigno pchaigno left a comment

I think I'm okay with this approach, assuming:

  1. We don't hardcode the list of "generic" extensions.
  2. We add a lot more tests in fixtures/, with files from other languages using .sol, and actual Solidity samples.

I'd also like to better understand where we use the samples/ files, apart from end-to-end tests and training of the Bayesian classifier. Those two are fine because we use fixture files for end-to-end tests and we don't use the Bayesian classifier. If we usually rely on the samples/ files for other things, we might have problems...

lib/linguist/language.rb Outdated Show resolved Hide resolved
lib/linguist/heuristics.rb Outdated Show resolved Hide resolved
lib/linguist/heuristics.rb Show resolved Hide resolved
lib/linguist/strategy/extension.rb Show resolved Hide resolved
test/test_heuristics.rb Outdated Show resolved Hide resolved
lib/linguist/language.rb Outdated Show resolved Hide resolved
test/test_blob.rb Show resolved Hide resolved
test/test_file_blob.rb Show resolved Hide resolved
@Alhadis
Copy link
Collaborator Author

Alhadis commented Jul 30, 2020

I'd also like to better understand where we use the samples/ files, apart from end-to-end tests and training of the Bayesian classifier. Those two are fine because we use fixture files for end-to-end tests and we don't use the Bayesian classifier. If we usually rely on the samples/ files for other things, we might have problems...

We currently don't have any Solidity "samples", only test "fixtures" (our nomenclature really needs improvement…). Adding .sol files to other sample directories would actually break the tests, as there's a check to make sure only extensions listed in languages.yml appear in the samples/* directories.

That being said, we can certainly add more real-world examples to the test/fixtures/Solidity directory, but not as samples/.

@pchaigno
Copy link
Collaborator

pchaigno commented Jul 31, 2020

I'd also like to better understand where we use the samples/ files, apart from end-to-end tests and training of the Bayesian classifier. Those two are fine because we use fixture files for end-to-end tests and we don't use the Bayesian classifier. If we usually rely on the samples/ files for other things, we might have problems...

We currently don't have any Solidity "samples", only test "fixtures" (our nomenclature really needs improvement…). Adding .sol files to other sample directories would actually break the tests, as there's a check to make sure only extensions listed in languages.yml appear in the samples/* directories.

I understand that. My point was that we don't and can't have Solidity files in samples/ and I'm therefore wondering if that can have a negative impact. I.e., are there tests (or, less likely, features) we will be missing because of the lack of samples/ Solidity files.

That being said, we can certainly add more real-world examples to the test/fixtures/Solidity directory, but not as samples/.

Yes, that's my point 2 above.

@Alhadis
Copy link
Collaborator Author

Alhadis commented Aug 1, 2020

I've also added .sol as a Gerber Image extension. In-the-wild usage checks out:

Plus, adding this proves that more than one language can share a generic extension.

@Alhadis
Copy link
Collaborator Author

Alhadis commented Aug 5, 2020

@pchaigno Think we'd be able to include what #4258 attempted to fix? Though we already have a dedicated Manpage strategy, it makes more sense to mark numeric file-extensions as "generic" instead.

@Alhadis
Copy link
Collaborator Author

Alhadis commented Aug 13, 2020

@lildude Any input regarding @pchaigno's suggested change? To me, it makes more sense to have generic extensions added to a YAML database named generic.yml (a la vendored filenames and extensions). What do you think?

@lildude
Copy link
Member

lildude commented Aug 13, 2020

Sorry @Alhadis. Things are crazy busy for me at the moment as we're reading the next release of GitHub Enterprise. I'll try take a look when I've got a chance when things quieten down a bit.

@Alhadis
Copy link
Collaborator Author

Alhadis commented Aug 19, 2020

I understand.

@pchaigno What if I added a generic.yml file and modified strategy/extension.rb to use that instead of an array hardcoded into the strategy? I can always extend the scope of this PR to include other generic file extensions:

# generic.yml
# (Some concise but informative description of what this file does and how to update it)

extensions:
- ".1"
- ".2"
- ".3"
- ".4"
- ".5"
- ".6"
- ".7"
- ".8"
- ".9"
- ".in"
- ".mod"
- ".sol"

filenames:
- unoriginal_but_common_filename

@alcuadrado
Copy link

alcuadrado commented Aug 24, 2020

Hey @lildude, do you have a rough estimation about when you'll be able to review this? Not to pressure you, but just asking so we can better coordinate with @Alhadis. Thanks!

@lildude
Copy link
Member

lildude commented Aug 24, 2020

I'm looking at it right now. It might take me a few days to process it and respond, but I've got time this week for Linguist stuff.

Copy link
Member

@lildude lildude left a comment

Some initial thoughts from a quick review, but I don't have any major objections from my initial quick review.

Additionally, do we really need quite so many fixtures as they're only used for testing the heuristics and aren't used for generating the samples.json used by the classifier?

I'll ponder this a bit more this week.

lib/linguist/strategy/extension.rb Outdated Show resolved Hide resolved
lib/linguist/strategy/extension.rb Show resolved Hide resolved
@alcuadrado
Copy link

alcuadrado commented Aug 25, 2020

Thanks for the quick answer, @lildude. This is great news :)

Co-authored-by: Colin Seymour <colin@github.com>
@Alhadis
Copy link
Collaborator Author

Alhadis commented Aug 28, 2020

Excluding the numbered files @Alhadis has mentioned,

I'll leave those for a future PR. At the moment, I'm just anxious to get this under the door in time for the next Linguist release.

@lildude
Copy link
Member

lildude commented Aug 29, 2020

I've pondered this for the week, and I comfortable with the approach taken here.

This approach means we lose any further assistance from the XML and classifier strategies but I think this is OK as this we already expect the heuristics to be pretty specific to reduce the chances of falling through to the classifier. I think this will also be a rarely used "exception" (famous last words 😉) so I'm happy to experiment and iterate as needed.

My only request for this PR is to remove the unneeded examples from the test fixtures. I think we only need fixtures to confirm each of the heuristic scenarios, eg we only need one for "ignored" and Gerber and a few to match the various combos of the Solidity regex.

@Alhadis
Copy link
Collaborator Author

Alhadis commented Aug 29, 2020

My only request for this PR is to remove the unneeded examples from the test fixtures. I think we only need fixtures to confirm each of the heuristic scenarios, eg we only need one for "ignored" and Gerber and a few to match the various combos of the Solidity regex.

That was my line of thinking as well, but @pchaigno requested the extra samples for some reason, so… 🤷 Getting some mixed signals here.

Will nuke 'em, hang 5.

@Alhadis Alhadis changed the title Add Solidity support Add support for generic file extensions Aug 29, 2020
@Alhadis Alhadis requested a review from pchaigno Aug 29, 2020
@lildude
Copy link
Member

lildude commented Aug 30, 2020

I'm not sure why @lildude wanted to see these examples removed

Only cos they weren't really testing the heuristic so I didn't see much need for them.

can be please please please not squash all commits (🙏 @lildude).

Only if you say please once more 😜

It'll mean I have to do a little more work curating the release notes, but that's fine in this case.

@Alhadis
Copy link
Collaborator Author

Alhadis commented Aug 30, 2020

It'll mean I have to do a little more work curating the release notes, but that's fine in this case.

@pchaigno requested this before I renamed and rewrote the PR description to place more emphasis on the generic-extension feature, rather than the reasons for adding it (Solidity). I feel what I've written nicely encapsulates everything that's been changed here… I'm still waiting on @pchaigno to give his approval.

@pchaigno
Copy link
Collaborator

pchaigno commented Aug 30, 2020

In general not a big fan of squashing non-trivial PRs, but should be okay here with the new description.

Copy link
Member

@lildude lildude left a comment

🎉 Lets do this!! I'll make a release next week once all the other PRs I've approved have been checked off by you two too.

@Alhadis Alhadis merged commit 69e77e2 into master Aug 30, 2020
@Alhadis Alhadis deleted the generic-extensions branch Aug 30, 2020
@Alhadis
Copy link
Collaborator Author

Alhadis commented Aug 30, 2020

Awesome. 🎉 Linguist finally has Solidity support.

Now all that's left to restore balance to the universe is getting rid of that f\*(&#g colour proximity requirement. 😜

@lildude
Copy link
Member

lildude commented Aug 30, 2020

Oh. You merged it already. I just allowed merge commits and was going to then merge. Oh well.

@Alhadis
Copy link
Collaborator Author

Alhadis commented Aug 30, 2020

Uh, sorry, I mistook the sudden silence as "go ahead, you merge it". Sorry. 😅

@alcuadrado
Copy link

alcuadrado commented Aug 30, 2020

Great news!

🎉 Lets do this!! I'll make a release next week once all the other PRs I've approved have been checked off by you two too.

Does this mean that in about a week it will be live on github?

@Alhadis
Copy link
Collaborator Author

Alhadis commented Aug 30, 2020

Does this mean that in about a week it will be live on GitHub?

You bet your sweet arse it does.

@lildude
Copy link
Member

lildude commented Aug 30, 2020

Does this mean that in about a week it will be live on github?

Indeed. There are a few more PRs waiting reviews from @pchaigno and @Alhadis that I'd like to include but once approved, I'll get cracking with the release. I'm AFK tomorrow so won't start until Tuesday at the earliest.

@alcuadrado
Copy link

alcuadrado commented Aug 30, 2020

Awesome! 🥳🥳🥳

@alcuadrado
Copy link

alcuadrado commented Aug 31, 2020

Will this have any other impact apart from automatically associating the .sol extension to Solidity? I wonder if Linguist is used for other things at GitHub

@Alhadis
Copy link
Collaborator Author

Alhadis commented Aug 31, 2020

No, it won't. I was careful to ensure the existing filename logic wasn't affected.

As for what else Linguist is used for… I'm not staff, so I can't tell you, but I do know that Linguist is used by GitLab, and possibly other VCS-hosting services too.

@alcuadrado
Copy link

alcuadrado commented Aug 31, 2020

Thanks for your reply, @Alhadis

sambacha pushed a commit to freight-trust/linguist that referenced this pull request Sep 4, 2020
@pchaigno
Copy link
Collaborator

pchaigno commented Sep 5, 2020

Yep, GitLab is the only other consumer of Linguist I'm aware of, but they tend to lack a bit behind in terms of releases. So it may take some time before the change is visible on gitlab.com and related software.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants
X Tutup