Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #9507 Describe accurately acceptable package names #9508

Merged
merged 1 commit into from
Jan 15, 2025
Merged

Conversation

mpilgrem
Copy link
Collaborator

@mpilgrem mpilgrem commented Dec 10, 2023

Copy link
Collaborator

@ulysses4ever ulysses4ever left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Terrific, thank you! Only soft cosmetic suggestions from me.

doc/cabal-package-description-file.rst Outdated Show resolved Hide resolved
doc/cabal-package-description-file.rst Outdated Show resolved Hide resolved
doc/cabal-package-description-file.rst Outdated Show resolved Hide resolved
@Bodigrim
Copy link
Collaborator

Could someone explain how Unicode package names are supposed to work?

I’d expect that the issue is in parser, not in documentation.

@mpilgrem
Copy link
Collaborator Author

@Bodigrim, the short answer is that Unicode package names work like other package names with the exception that Hackage will not accept them for 'publication'. Looking at Cabal's code, it seems to have a long history of accepting Unicode package names (by using isAlphaNum as the core test). Also, looking at the oldest base on Hackage (4.0.0.0, from 2009):

-- | Selects alphabetic or numeric digit Unicode characters.
--
-- Note that numeric digits outside the ASCII range are selected by this
-- function but not by 'isDigit'.  Such digits may be part of identifiers
-- but are not used by the printer and reader to represent numbers.
isAlphaNum              :: Char -> Bool

@Bodigrim
Copy link
Collaborator

@mpilgrem Does cabal sdist or cabal install work on packages with non-ASCII names?

@Bodigrim
Copy link
Collaborator

Also Unicode package names are a terrible idea from security perspective: base, basе, bаse and bаsе are four different packages.

@mpilgrem
Copy link
Collaborator Author

mpilgrem commented Dec 10, 2023

@Bodigrim, on sdist and Unicode package names, I think the answer is 'Yes'. stack sdist is built on top of the Cabal function and it does not complain. EDIT: That is, I think sdist is Cabal-centric (what Cabal will accept) not Hackage-centric (what Hackage will accept). EDIT: I now think I was mistaken - my better understanding is below.

@mpilgrem
Copy link
Collaborator Author

I should say, I have no skin in the game on what the Cabal package description format specification should say and/or what the Cabal library/tool should do (given the specification). My objective is only that the Cabal User Guide should describe accurately what, in fact, Cabal accepts. If that is a 'bug' (perhaps a long-standing one) the Guide should warn users accordingly.

@ulysses4ever
Copy link
Collaborator

I agree with @mpilgrem here: the docs should explain what is the current state of affairs faithfully. If you don't like that state, that would be another ticket to open.

On the other hand. The docs may gently steer people away from Unicode names by, for instance, mentioning that you can't publish them on Hackage and they are not widely used and may have security implications.

Note that this patch clarifies the docs in several ways, not only w.r.t. the Unicode business.

@mpilgrem mpilgrem force-pushed the fix9507 branch 2 times, most recently from 4aa738e to fdb8ad4 Compare December 10, 2023 14:16
@mpilgrem
Copy link
Collaborator Author

I have put through the suggestions of @ulysses4ever. I have also done two other things:

  • In section 3.2.2, I have replaced 'only numbers' with 'only the digits 0 to 9'. That is because the earlier reference to 'numbers' in the same sentence means more characters than just the European digits; and

  • The definition of unqual-name is wrong. It refers to 'alphabetic character' and alpha (in the formula), when what it actually intends is an alphanumeric character that is not a digit 0 to 9. In the formula, I have tried to 'borrow' the notation that is used in the definition of version and express "alpha-num that is a member of the complement of ["0" ... "9"]" to replace "alpha". However, I can't test that what I have done actually renders to express that.

@ulysses4ever
Copy link
Collaborator

fyi, docs from this pr are rendered here: https://cabal--9508.org.readthedocs.build/en/9508/

@mpilgrem
Copy link
Collaborator Author

@ulysses4ever, many thanks for the pointer. I can confirm that the unqual-name formula is rendering as I intended.

@mpilgrem
Copy link
Collaborator Author

@Bodigrim, returning to cabal sdist, I think I was wrong, previously. I've now tried cabal sdist directly. For my test package (test١٢٣), it creates test١٢٣\dist-newstyle\sdist\test١٢٣-0.1.0.0.tar.gz, but inside that archive file and the tar, the files and directories have been renamed testabc (eg testabc.cabal). Stack does something different to force UTF8 encoding (extract):

let tarPath isDir fp =
        case Tar.toTarPath isDir (forceUtf8Enc (pkgIdName FP.</> fp)) of
          Left e -> prettyThrowIO $ ToTarPathException e
          Right tp -> pure tp
      -- convert a String of proper characters to a String of bytes in UTF8
      -- encoding masquerading as characters. This is necessary for tricking the
      -- tar package into proper character encoding.
      forceUtf8Enc = S8.unpack . T.encodeUtf8 . T.pack  

In my reading, I came accross open issue: #3758 (Lack of support for non-ASCII characters in cabal sdist) and haskell/tar#6 (Paths with unicode characters in them?).

@mpilgrem mpilgrem force-pushed the fix9507 branch 2 times, most recently from ba6d1c4 to 05305db Compare December 10, 2023 17:59
@mpilgrem
Copy link
Collaborator Author

mpilgrem commented Dec 10, 2023

I have improved/corrected the formula for unqual-name. I had not understood how to express correctly "an alpha-num that is a member of the complement of ["0" ... "9"]". I now do that using 'intersection'. EDIT: In passing, I fixed the lack of smart quotes ( “ ” ) in a couple of other formulas.

@ulysses4ever
Copy link
Collaborator

Hey @mpilgrem ! Do you still want to merge this?

@mpilgrem
Copy link
Collaborator Author

@ulysses4ever, thanks for the ping. I'll return to this and see what is the current position.

@mpilgrem mpilgrem force-pushed the fix9507 branch 4 times, most recently from 3c1b3b2 to a329900 Compare January 12, 2025 17:18
@mpilgrem
Copy link
Collaborator Author

mpilgrem commented Jan 12, 2025

I've run into a hurdle. Cabal's CI has a step where it generates a template for that part of the manual, and then compares it to the actual manual contents. The step that generates the template assumes that each item will be described with some narrative followed by a single 'formula'. That is no longer the case for my suggested correction, which was formatted to be useful to people and has two 'formulas':

  • the 'formula' that describes alpha-num-not-digit; and
  • the varied original 'formula' that replaces the incorrect alpha with alpha-num-not-digit.

Part of the template-generating code is in a package Cabal-described which is not published on Hackage. The relevant part is in module Distribution.Described:

reUnqualComponent :: GrammarRegex a
reUnqualComponent = RENamed "unqual-name" $
    REMunch1 (reChar '-') component
  where
    component
        = REMunch reEps (RECharSet csAlphaNum)
        -- currently the parser accepts "csAlphaNum `difference` "0123456789"
        -- which is larger set than CS.alpha
        --
        -- Hackage rejects non ANSI names, so it's not so relevant.
        <> RECharSet CS.alpha
        <> REMunch reEps (RECharSet csAlphaNum)

The code comment seems to reference the problem with the manual that my changes to it are seeking to fix.

Perhaps the path of least resistence is to refactor my two 'formulas' into one 'formula', even if that makes them more difficult for people to read. However, that means I will have to work out what is the grammar used in Distribution.Described. That may take a little time.

EDIT: I took a different path, and moved the alpha-num-not-digit 'formula' to the introduction to the relevant section of the manual page.

@mpilgrem mpilgrem force-pushed the fix9507 branch 3 times, most recently from 05d055a to 0ab62e1 Compare January 12, 2025 20:28
@mpilgrem mpilgrem force-pushed the fix9507 branch 4 times, most recently from 7a8a994 to d02ed81 Compare January 12, 2025 22:13
@mpilgrem
Copy link
Collaborator Author

@ulysses4ever, from my perspective, this is good to be merged.

@ulysses4ever
Copy link
Collaborator

Terrific and thank you! Do you want to do the honor and put the merge-me label on it @mpilgrem ?

EDIT: I took a different path, and moved the alpha-num-not-digit 'formula' to the introduction to the relevant section of the manual page.

Incidentally, that's what I was going to suggest.

@mpilgrem mpilgrem added the merge me Tell Mergify Bot to merge label Jan 13, 2025
@mergify mergify bot added the ready and waiting Mergify is waiting out the cooldown period label Jan 13, 2025
@mergify mergify bot added the merge delay passed Applied (usually by Mergify) when PR approved and received no updates for 2 days label Jan 15, 2025
Also extinguishes false positive in typo detection.

Also aims to make corresponding changes to buildinfo-reference-generator, including the `Cabal-described` package. Adds `alphanumNotDigit` and now specifies `alphanum` by inserting digits into `alphanumNotDigit`.

The description of `alpha-num-not-digit` is added to the introduction to the section on 'Non-terminals' to accommodate the limitation of buildinfo-reference-generator that only a single 'formula' can follow a single paragraph of narrative for each non-terminal production.
@mergify mergify bot merged commit 9b8540d into master Jan 15, 2025
47 checks passed
@mergify mergify bot deleted the fix9507 branch January 15, 2025 19:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation merge delay passed Applied (usually by Mergify) when PR approved and received no updates for 2 days merge me Tell Mergify Bot to merge ready and waiting Mergify is waiting out the cooldown period
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants