Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

maybe_string dealing with non-unicode strings #1329

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

wschuell
Copy link

Hi,

I am cloning and analyzing a big bunch of repositories, and therefore stumble on rare edge cases.
In this repo: https://github.com/LSSTDESC/DeblenderVAE
One of the refs turned out to be non-unicode (a branch name, '0xc3master' ), and that causes even clone to fail.

python3 -c "import pygit2; pygit2.clone_repository(path='temp_repo',url='https://github.com/LSSTDESC/DeblenderVAE')"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "pygit2/__init__.py", line 217, in clone_repository
    payload.check_error(err)
  File "pygit2/callbacks.py", line 97, in check_error
    raise self._stored_exception
  File "pygit2/callbacks.py", line 424, in wrapper
    return f(*args)
           ^^^^^^^^
  File "pygit2/callbacks.py", line 565, in _update_tips_cb
    s = maybe_string(refname)
        ^^^^^^^^^^^^^^^^^^^^^
  File "pygit2/utils.py", line 37, in maybe_string
    return ffi.string(ptr).decode('utf8')
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 20: invalid continuation byte

I tracked the function being wrapped: <function _update_tips_cb at 0x75b8ebb17420>

I found this, about refs names not being necessarily unicode:
https://stackoverflow.com/questions/69174955/what-character-encoding-is-used-in-git-symbolic-refs-especially-on-windows

No error if in .utils, I catch the error like this:

def maybe_string(ptr):
    if not ptr:
        return None

    try:
        return ffi.string(ptr).decode("utf8")
    except BaseException as e:
        return ffi.string(ptr).decode("latin1")

But of course there are other encodings than utf8 and latin1. What could be a generic solution? Does the output of maybe_string need to be a text string and not byte string?

If not, I suggest:

def maybe_string(ptr):
    if not ptr:
        return None

    try:
        return ffi.string(ptr).decode("utf8")
    except UnicodeDecodeError:
        return ffi.string(ptr)

which also clones without error. But it creates a case where the output is only in rare cases a byte string...

@jdavid
Copy link
Member

jdavid commented Nov 27, 2024

maybe_string(...) is used in a number of places, I would keep it returning a text string.

There are other solutions, these work for me:

  1. os.fsdecode(...)
  2. .decode("utf8", errors="replace")
  3. .decode("utf8", errors="surrogateescape")

I think I prefer number 3, could you try it?

Also, please add a unit test, it can simply clone this DeblenderVAE repo.

Thanks!

@wschuell
Copy link
Author

I implemented the surrogate solution, and a unit test using the testrepo. However I had to create the new branch with subprocess (hope git CLI is available though for CI), because the surrogate policy is not applied at encoding either.

I can look into it, but maybe you can tell me if what I did so far complies.

Also, I just created a cloned folder next to the testrepo folder, is it cleaned up automatically, or should I clean it up manually?

@wschuell
Copy link
Author

Just saw that black auto-formatted all the quotes, I can revert it if it's a problem

@jdavid
Copy link
Member

jdavid commented Nov 27, 2024

Yes please, don't make format changes

Everything should be cleaned up automatically, this is handled by pytest. For example when I tried your changes locally the cloned repo is at /tmp/pytest-of-jdavid/pytest-current/.../testrepo/test_nonunicode_repo

The test fails on Windows (AppVeyor), https://ci.appveyor.com/project/jdavid/pygit2/builds/51066441/job/ueawtoexy0bqussu

Better to create the \xc3master branch once in testrepo (test/data/testrepo.zip), then git won't be required when running the tests, and it may fix the tests in AppVeyor.

@wschuell
Copy link
Author

Commit title self-explanatory. Here the pytest output:

___________________________________ test_nonunicode_branchname[\xc3master] ___________________________________

testrepo = pygit2.Repository('/tmp/pytest-of-wschuell/pytest-18/test_nonunicode_branchname__xc0/testrepo/.git/')
bstring = b'\xc3master'

    def test_nonunicode_branchname(testrepo, bstring):
>       testrepo.branches.local.create(bstring.decode("utf8", errors="surrogateescape"),commit=testrepo.head.target)

test/test_nonunicode.py:44: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <pygit2.branches.Branches object at 0x78d035376e70>, name = '\udcc3master'
commit = 2be5719152d4f82c7302b1c0932d8e5f0a4a0e98, force = False

    def create(self, name: str, commit, force=False):
>       return self._repository.create_branch(name, commit, force)
E       UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 0: surrogates not allowed

/home/wschuell/.conda/envs/base_conda_12/lib/python3.12/site-packages/pygit2-1.15.1-py3.12-linux-x86_64.egg/pygit2/branches.py:79: UnicodeEncodeError
========================================== short test summary info ===========================================
FAILED test/test_nonunicode.py::test_nonunicode_branchname[\xc3master] - UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 0: surrogates not allowed
============================================= 1 failed in 0.04s ==============================================

I tried to track it down, but this goes into the C code.

@wschuell
Copy link
Author

Alternatively, it works with subprocess using

subprocess.check_output(cmd.decode('utf8',errors='surrogateescape').split(" "), cwd=testrepo.workdir)

The issue is then that non-unicode bytestrings cannot be passed to create branches, but at least my case is solved: cloning a repository with an already existing non-unicode branch.

@jdavid
Copy link
Member

jdavid commented Nov 27, 2024

Windows and macOS tests fail:

What I meant before was to add a new test repo, for example test/data/repo_notutf.zip which already has the non-utf8 branch \xc3master. However, I've tried this and somehow zip screws the file, the reference \xc3master becomes something else.

Alernatively what I can do is to add such a repo to https://github.com/pygit2/

@jdavid
Copy link
Member

jdavid commented Nov 27, 2024

Just did that, see https://github.com/pygit2/test_branch_notutf

Then:

python3 -c "import pygit2; pygit2.clone_repository(path='temp_repo', url='https://github.com/pygit2/test_branch_notutf.git')"

Could you change the unit test to clone https://github.com/pygit2/test_branch_notutf.git ?
The test should be simpler, and maybe it will work with macOS/Windows

@jdavid
Copy link
Member

jdavid commented Nov 29, 2024

Thanks!

The tests for macOS and Windows still fail though. I've created a branch just to try, but regardless the method to decode the errors are the same, see https://github.com/libgit2/pygit2/actions/workflows/tests.yml and https://ci.appveyor.com/project/jdavid/pygit2/history

There may be an issue with libgit2 clone in Windows and macOS, this needs some research..

@wschuell
Copy link
Author

Looking at the error it seems that there is a lockfile written with the name of the ref/branch, and the string is used not as bytestring, which apparently causes it to fail because it does not recognize the surrogates. I guess it could be able to write the files with filenames not utf8 if the string is passed as bytes, but one would have to track down where this happens. I'll have a look in a few days (unless you do before).

@jdavid
Copy link
Member

jdavid commented Dec 31, 2024

Hi (and happy new year).

Could you rebase? I've upgraded to libgit2 1.9, it would be nice to run the CI again, maybe we're lucky and the bug got fixed.

@wschuell
Copy link
Author

Hi,

Sorry for the delay... I thought I would find time before

Still not working but: I fixed the linting, and had a closer look. Did not find anything specific, just figured there might be something in _pygit2.pyi, with the various name and raw_name being bytes and str. But I also thought maybe it could be FS-related. I ll try on a mac tomorrow on an ext/FAT.

Feliz año nuevo tb :)

@wschuell
Copy link
Author

NB: I saw in the macos logs that the declared ARCH is arm64, but the workflow has in its name x86_64, is it a typo, or just me missing something?

@jdavid
Copy link
Member

jdavid commented Dec 31, 2024

NB: I saw in the macos logs that the declared ARCH is arm64, but the workflow has in its name x86_64, is it a typo, or just me missing something?

Good catch, I've clarified it, https://github.com/libgit2/pygit2/actions/runs/12561450033

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants