Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not all buckets listed, remove mass appropriation buckets #94

Open
ltguillaume opened this issue Nov 7, 2024 · 10 comments
Open

Not all buckets listed, remove mass appropriation buckets #94

ltguillaume opened this issue Nov 7, 2024 · 10 comments

Comments

@ltguillaume
Copy link

ltguillaume commented Nov 7, 2024

Example: searching for betterbird-nl shows 3 results, all of them (in some cases often outdated) copies of the original manifest from my bucket, which is not shown. It also doesn't show betterbird-nl, but only betterbird-future-nl.

The same holds for more of the original manifests from my bucket ltguillaume/schep.

This can cause significant problems for end users, who get outdated manifests from these big secondary buckets that copy original manifests from other buckets and fail to update when they change.

  • Any manual changes by the original author to fix the manifest are ignored
  • The version checks themselves aren't regular at all, sometimes resulting in manifests of over a year old, but often about 2 months, which causes the user to miss out on several updates.

I consider it bad practice to give these "mass appropriation buckets" a platform. They should be tagged as untrustworthy, or removed from scoop.sh completely.

Examples:
okibcn/ScoopMaster by @okibcn (with often very outdated versions)
samuelshi/ScoopBuckets-Unofficial by @samuelshi
cmontage/scoopbucket-third by @cmontage
anderlli0053/DEV-tools by @anderlli0053

@anderlli0053
Copy link

@ltguillaume Bite me!

I also have my own manifests in my bucket, the bots and other scripts are doing the rest... While it is true that some manifests/apps are outdated and the bot simply ignores them... it doesn't mean that i should delete my own bucket just because of that. (For other buckets, that you have listed alongside mine i cannot say anything)

In simple terms... if you don't like it(my bucket), don't add it to the Scoop and don't use it, simple as that....

also you could use this and it handles any problems from the bucket -> https://github.com/winpax/sfsu

@ltguillaume
Copy link
Author

ltguillaume commented Nov 7, 2024

@anderlli0053 Ah, I see the mistake I made here: I never meant to say that the buckets (the repos themselves) should be deleted: it's perfectly fine if you use it for your own personal benefit. What I meant to say is that they should be tagged as untrustworthy on scoop.sh or removed (read: unlisted) from there. This issue has been created within the context of scoop.sh, not of Scoop itself.

I wasn't aware of sfsu and the Sprinkles library, thanks for the tip.

@gpailler
Copy link
Collaborator

gpailler commented Nov 8, 2024

Hello @ltguillaume,

Your bucket is properly indexed, but according to the indexer logs, some of your manifests are malformed and cannot be parsed. Replacing the TAB control characters with spaces should resolve the issue, and all your manifests should appear within a few hours after this change.

[22:30:36 ERR 15] Unable to parse manifest bucket/betterbird-future-nl.json from https://github.com/ltguillaume/schep
[22:30:36 ERR 15] Unable to parse manifest bucket/betterbird-future.json from https://github.com/ltguillaume/schep
[22:30:36 ERR 18] Unable to parse manifest bucket/betterbird-nl.json from https://github.com/ltguillaume/schep
[22:30:36 ERR 18] Unable to parse manifest bucket/betterbird.json from https://github.com/ltguillaume/schep
[22:30:36 ERR 18] Unable to parse manifest bucket/firefox-esr-nl.json from https://github.com/ltguillaume/schep
[22:30:36 ERR 18] Unable to parse manifest bucket/firefox-esr.json from https://github.com/ltguillaume/schep
[22:30:36 ERR 18] Unable to parse manifest bucket/librewolf.json from https://github.com/ltguillaume/schep
image

Regarding the meta-buckets and the duplicate entries they produce, we have added a "Distinct manifests only" filter option (#58), which should address this issue. Once your manifest is indexed, it should be the only result when this option is selected.

We previously discussed completely removing meta-buckets from the search results. However, as @anderlli0053 mentioned, there are legitimate uses for meta-buckets. Some may contain a mix of original and copied manifests, and certain users might prefer a single meta-bucket instead of multiple smaller ones. We also want to avoid maintaining a manual list of "Untrustworthy" buckets, as it may not be accurate and/or even misleading for the users.

@ltguillaume
Copy link
Author

ltguillaume commented Nov 10, 2024

@gpailler Thank you so mucht for your complete response. I see some tabs had indeed snuck in since my last change. I'll make sure to run formatjson.ps1 after making any changes from now on.

As for the Distinct manifests only filter, I'm not really sure what the conditions are, but when I enable it, I see results that only prove my point:

  1. My librewolf manifest isn't listed at all, even though it differs from all the ones that are listed, and 22 out of 28 "distinct" results are significantly outdated, up to year or more. This is simply dangerous for the end user, especially when it comes to a browser.
  2. When searching for e.g. firefox-nl or firefox-esr-nl, the first result (why first?) is from anderlli0053/DEV-tools, which is a year old and offers a very old version, even if it were the bucket's intention to offer v115 for Windows 7/8 (offers 115.2.1 while the latest is 115.17.0). The whole point of Scoop is to easily keep software up-to-date, and these buckets evidently prevent that.
  3. The same thing happens with simplex-desktop with a 2 months old version via okibcn/ScoopMaster that's listed first.
  4. For gajim, my manifest is unique, but it's not listed. Only 2 out of 8 listed are up-to-date.

All this proves that the use of scoop.sh is completely unreliable by these mass appropriation buckets being indexed:

  • The search results are unnecessarily complicated by (1) too many results and (2) the lack of guarantee that the manifests will be updated at all
  • Necessary changes to the original manifests won't be applied, so the applications might stop working completely
  • Applications won't be updated for years in many cases, which can be very dangerous for the end user in terms of security, especially when it comes to browsers

As such, it undermines the very reason Scoop even exists: keeping your software up-to-date.

@gpailler
Copy link
Collaborator

Hello @ltguillaume,

To detect duplicates, used by the Distinct manifests only filter, I follow this approach:

  • A hash is created from the app's name and version. If multiple manifests share the same hash, they are treated as duplicates.

  • In the case of duplicates, manifests are sorted based on criteria: Official repository, Commit date, and Number of stars. The first result in this sorted list is considered the "original" manifest. You can find the logic here.

Regarding your examples:

  1. Your librewolf manifest is considered a duplicate of librewolf from the official extras bucket. Your version was committed one minute before the one from the extras bucket but the result from the official bucket is considered as the "original" one based on the rules described above.

  2. For firefox-esr-nl, the version from anderlli0053/DEV-tools appears first when sorted by Best match since, with this sorting, the number of stars in a repository is a primary factor. You pointed out a specific case where anderlli0053/DEV-tools has a very old version, but generally, these meta buckets tend to have recent versions, and the algorithm tends to prioritize popular buckets.

  3. Same as (2). With Best match sorting, the version from a popular bucket is shown first. The algorithm assumes that a popular bucket likely offers a stable version, whereas a less popular bucket might host a nightly build / a less popular version.

  4. Your gajim manifest is detected as a duplicate of gajim from mgziminsky/scoop-bucket/, which was committed about an hour before yours.

That said, I understand your concerns, but I don’t see anything particularly problematic here. https://scoop.sh is simply a search engine, and I trust users are capable of choosing the most up-to-date version or selecting from the most popular repository if they prefer a single bucket. I think https://scoop.sh offers a variety of sorting and filtering options to provides enough flexibility to meet users' needs and allow them to prioritize the results they’re looking for.

@ltguillaume
Copy link
Author

ltguillaume commented Nov 11, 2024

Thanks again for the extensive reply. I'll get around to commenting on the specific examples, but for now, may I propose an improvement for the hashing method? Some manifests may have the same name and version, but their manifest works different from the others. This will be mentioned in the description.

If you include the description in the calculation of the hash, this will make sure these distinct variants won't be discarded as being the same.

As for repo popularity vs. up-to-dateness, I think it's pretty clear which one should win in the ordering.

@gpailler
Copy link
Collaborator

It's a fair point. I need to check if adding the description to the hash doesn't show too many real duplicates, but it's worth a try.

@ltguillaume
Copy link
Author

ltguillaume commented Nov 11, 2024

  1. Your librewolf manifest is considered a duplicate of librewolf from the official extras bucket. Your version was committed one minute before the one from the extras bucket but the result from the official bucket is considered as the "original" one based on the rules described above.

My manifest does not use the portable launcher (ironic, since I created it), while keeping the profile inside scoop\persis\librewolf. This makes it possible to use direct shortcuts to the LibreWolf executable (e.g. in the taskbar) and to use LibreWolf's native shell associations.

The same goes for my firefox-esr* and betterbird* manifests.

  1. For firefox-esr-nl, the version from anderlli0053/DEV-tools appears first when sorted by Best match since, with this sorting, the number of stars in a repository is a primary factor.

I think it may be a mistake to let "popularity" be a factor in this, especially when it may "win" against recency. The whole reason for Scoop's existence is to keep your software up-to-date, so any "popular" bucket that has months or years old manifests, while many new versions of the software were released, are de facto "violating" the very goal of the software (if the offering of old versions is unintended and undocumented).

You pointed out a specific case where anderlli0053/DEV-tools has a very old version, but generally, these meta buckets tend to have recent versions, and the algorithm tends to prioritize popular buckets.
3. Same as (2). With Best match sorting, the version from a popular bucket is shown first. The algorithm assumes that a popular bucket likely offers a stable version, whereas a less popular bucket might host a nightly build / a less popular version.

The general rule doesn't apply in my sampling at all, which at least indicates that there is an issue. I propose to let go of the assumption of (3) and to address the issue by listing the manifests with the latest software versions first before applying a popularity contest to the filtering 😛

That said, I understand your concerns, but I don’t see anything particularly problematic here. https://scoop.sh is simply a search engine, and I trust users are capable of choosing the most up-to-date version or selecting from the most popular repository if they prefer a single bucket. I think https://scoop.sh offers a variety of sorting and filtering options to provides enough flexibility to meet users' needs and allow them to prioritize the results they’re looking for.

I really think it's unnecessarily messy to say the least. With the example of 28 results for the same software, no one will look at the second page. And considering you use the term "Best match" in your filters, it is highly likely that people choose a horribly outdated manifest and then trust their software is kept up-to-date.

This can be improved significantly, though, by letting newer versions be leading, instead of the repo's popularity.

@gpailler
Copy link
Collaborator

  1. Your librewolf manifest is considered a duplicate of librewolf from the official extras bucket. Your version was committed one minute before the one from the extras bucket but the result from the official bucket is considered as the "original" one based on the rules described above.

My manifest does not use the portable launcher (ironic, since I created it), while keeping the profile inside scoop\persis\librewolf. This makes it possible to use direct shortcuts to the LibreWolf executable (e.g. in the taskbar) and to use LibreWolf's native shell associations.

Your suggestion to add descriptions in the hash to differentiate duplicates could indeed help in this situation. Out of curiosity, why not propose a PR to extras with your version? What are the advantages and disadvantages of both approaches?


3. For firefox-esr-nl, the version from anderlli0053/DEV-tools appears first when sorted by Best match since, with this sorting, the number of stars in a repository is a primary factor.

I think it may be a mistake to let "popularity" be a factor in this, especially when it may "win" against recency. The whole reason for Scoop's existence is to keep your software up-to-date, so any "popular" bucket that has months or years old manifests, while many new versions of the software were released, are de facto "violating" the very goal of the software (if the offering of old versions is unintended and undocumented).

You pointed out a specific case where anderlli0053/DEV-tools has a very old version, but generally, these meta buckets tend to have recent versions, and the algorithm tends to prioritize popular buckets.
3. Same as (2). With Best match sorting, the version from a popular bucket is shown first. The algorithm assumes that a popular bucket likely offers a stable version, whereas a less popular bucket might host a nightly build / a less popular version.

The general rule doesn't apply in my sampling at all, which at least indicates that there is an issue. I propose to let go of the assumption of (3) and to address the issue by listing the manifests with the latest software versions first before applying a popularity contest to the filtering 😛

When I search for an app, I tend to choose a well-known or popular bucket, even if the version is slightly outdated. I use Scoop to maintain and update my applications easily, and stability is an important factor. A popular bucket is more likely to be maintained over time than a random one.
That said, I agree that the best-match scoring should be adjusted to decrease the score for very outdated manifests when newer versions exist.


I will implement the mentioned changes, and we’ll see if it produces more accurate results. But please keep in mind that with over 170,000 manifests indexed, it’s challenging to always deliver precisely the results each user is looking for—this isn't Google 😛

@ltguillaume
Copy link
Author

ltguillaume commented Nov 12, 2024

Out of curiosity, why not propose a PR to extras with your version? What are the advantages and disadvantages of both approaches?

I did try to, but there's a serious catch: with every update via Scoop, Firefox/Thunderbird and derivatives create a new profile in %Appdata% instead of staying with the one in scoop\persist. Most likely the app acts as if it is newly installed because of how Scoop recreates the apps\appname\current directory junction with every update. This leads to users (1) thinking they've lost their profile and potentially (2) assuming the active profile is still in scoop\persist, which could lead to data loss.

I added the workaround that starts the Profile Manager after the update is complete, so the user just needs to press Enter and the Scoop profile is set as default again, without the creation of another. This isn't pretty, but when used to it entirely acceptable.

When I search for an app, I tend to choose a well-known or popular bucket, even if the version is slightly outdated.

I don't think I can follow that reasoning.

  1. So you're choosing a bucket based on the sole reason that lots of people starred it. Stars are a totally random metric: starring means for only some people "hey this is cool, and I know that because use it", for many it means "this is probably cool, I dunnow", and for others "hey this may be cool, gotta check that out later". That's pretty much all it entails.
  2. If you use scoop.sh and end up with choosing that popular bucket, you're choosing that bucket despite the knowledge that it is not up-to-date and won't serve you the most recent version of the app. If the bucket isn't up-to-date now, what reason is there to assume it will be up-to-date in the future? Because it has stars?

A popular bucket is more likely to be maintained over time than a random one.

A popular bucket like main or extras, certainly, because people contribute to those buckets directly. I would argue that a "mass appropriation bucket" is, in fact, less likely to properly maintain all the manifests, because they are NOT put there by their original authors. I would ever go further than that and assume that all these outdated manifests in such buckets will go entirely unnoticed. If that weren't the case, we would see only the smaller buckets with outdated versions, while it is in fact oftentimes the mass appropriation buckets that show this.

Apart from that, it's probably easier to spot problems with smaller buckets than with those massive ones.

Come to think of it, it may be a great idea to add a field to the manifest that allows the author to specify the "source bucket" of the manifest.

I will implement the mentioned changes, and we’ll see if it produces more accurate results.

That's great, thank you!

But please keep in mind that with over 170,000 manifests indexed, it’s challenging to always deliver precisely the results each user is looking for—this isn't Google 😛

Well, I don't think my argument here is based on personal taste (like Google's personalized search results). It's really about how to guarantee that working with 3rd party buckets is more reliable for everyone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants