Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: label field does not point to a root cid and no DAG present in the car #446

Open
parkan opened this issue Aug 3, 2024 · 6 comments
Labels
bug Something isn't working triage

Comments

@parkan
Copy link
Contributor

parkan commented Aug 3, 2024

Description

The label field on deals is not defined in the protocol but has been conventionally used to store the root content CID. Singularity appears to follow this pattern, but in fact the label seems to point to a (random?) raw leaf.

For example:

Client Peer ID	12D3KooWREabaGZ5qkktoAyVRwavuEyq2M24Tv6PeMGdrioKBjS8
Signed Proposal CID	bafyreiezibqnjyqqkpbdnbuc3zysyttkikrfrnuazwokwuqqz5jrwx4r5e
Label	bafkreigq6iofoqaqhzqbatk2xdrisnmdfviq3r4ugwabqlhoaidxbzia4y

yields a 1MiB block: https://explore.ipld.io/#/explore/bafkreigq6iofoqaqhzqbatk2xdrisnmdfviq3r4ugwabqlhoaidxbzia4y

It's not clear whether a root block is created at all and simply not selected for the label field.



### Steps to Reproduce

Push deal as normal.


### Version

v0.5.16

### Operating System

Linux

### Database Backend

PostgreSQL

### Additional context

_No response_
@parkan parkan added bug Something isn't working triage labels Aug 3, 2024
@parkan
Copy link
Contributor Author

parkan commented Aug 3, 2024

this is confusing because we do in fact seem to try to set it from car.rootCID:

label, err := market.NewLabelFromString(cid.Cid(car.RootCID).String())

this is what the root node should look like, from an earlier deal (prepared using the anelace,carlet,stream-commp toolchain) https://explore.ipld.io/#/explore/bafybeifb7kgebegpnd4kxqvfjiuduxs6ey7ctenisk2voytcge56tqjrmm

@parkan
Copy link
Contributor Author

parkan commented Aug 3, 2024

ok I can verify this locally:

parkan@toolbox:~/singularity$ ./singularity prep create --name local_test2 --local-source docs --local-output cars
ID  Name         DeleteAfterExport  MaxSize      PieceSize    NoInline  NoDag  
2   local_test2  false              33822867456  34359738368  false     false  
    Source Storages:
        ID  Name       Type   Path                               
        3   docs-a9a9  local  /var/home/parkan/singularity/docs  
    Output Storages:
        ID  Name       Type   Path                               
        2   cars-8eec  local  /var/home/parkan/singularity/cars  
 parkan@toolbox:~/singularity$ ./singularity prep start-scan local_test2 docs-a9a9 && singularity run dataset-worker
 ...
 parkan@toolbox:~/singularity$ car inspect cars/baga6ea4seaqac3suuohjekbtrz63xm6dv5lopcgcxfix5hc7m5e3l4ydvbpicea.car 
Version: 1
Roots: bafkreiailpffdn7ycykojblw2k2ybzq5thcyeyvuxlxfmiuo5nppdz7zo4
Root blocks present in data: Yes
Block count: 947
Min / average / max block length (bytes): 220 / 9308 / 630634
Min / average / max CID length (bytes): 36 / 36 / 36
Block count per codec:
	raw: 947
CID count per multihash:
	sha2-256: 947

only raw blocks, whereas properly DAGed output looks like this:

parkan@toolbox:~/singularity$ car create --file cars/wrapped.car docs
parkan@toolbox:~/singularity$ car create --no-wrap --file cars/nowrapped.car docs
parkan@toolbox:~/singularity$ car inspect cars/wrapped.car 
Version: 2
Characteristics: 00000000000000000000000000000000
Data offset: 51
Data (payload) length: 8916955
Index offset: 8917006
Index type: car-multihash-index-sorted
Roots: bafybeiausqezcykeu2numuame3vu6abthzucc6whf3vh3uqmkzcyxb7agm
Root blocks present in data: Yes
Block count: 1074
Min / average / max block length (bytes): 55 / 8264 / 262144
Min / average / max CID length (bytes): 36 / 36 / 36
Block count per codec:
	raw: 952
	dag-pb: 122
CID count per multihash:
	sha2-256: 1074
parkan@toolbox:~/singularity$ car inspect cars/nowrapped.car 
Version: 2
Characteristics: 00000000000000000000000000000000
Data offset: 51
Data (payload) length: 8916863
Index offset: 8916914
Index type: car-multihash-index-sorted
Roots: bafybeidtzluhd7xl25565ojwz4konjx3jueisy3sv6bqt6c7rx6xe767qu
Root blocks present in data: Yes
Block count: 1073
Min / average / max block length (bytes): 56 / 8271 / 262144
Min / average / max CID length (bytes): 36 / 36 / 36
Block count per codec:
	raw: 952
	dag-pb: 121
CID count per multihash:
	sha2-256: 1073

@parkan parkan changed the title [Bug]: deal label points to random raw leaf instead of root CID [Bug]: daggen never runs in preparations Aug 3, 2024
@parkan parkan changed the title [Bug]: daggen never runs in preparations [Bug]: label field does not point to a root cid and no DAG present in the car Aug 3, 2024
@parkan
Copy link
Contributor Author

parkan commented Aug 3, 2024

for the record the desired behavior is identical to the car create --no-wrap

daggen appears to run (I initially through it did not as it does not produce any info output but I can confirm that it does) but no directory information seems to make it to the output CAR

@parkan
Copy link
Contributor Author

parkan commented Aug 3, 2024

ok it appears that explicitly running prep start-daggen emits a second car file that contains only the DAG with no leaves, so this may be a documentation issue (https://data-programs.gitbook.io/singularity/data-preparation/get-started makes no mention of this command)

I have no idea what happens with this DAG CAR when not using --local-output

it's also very unclear how it should be used in deals? furthermore if we are supposed to retain the DAG and only ask for leaves then the label behavior as "root" CID makes no sense

lastly, I don't understand what no-dag actually does given that the daggen needs to be started separately?

@Sankara-Jefferson
Copy link
Collaborator

@parkan In some cases, the label may seem to point to a raw leaf rather than the entire content. This can happen for several reasons:

Single-File CARs or Minimal DAGs:
If the CAR file or the data structure is minimal, containing only a single file or a small set of files, the root CID may directly correspond to a single block of data...(raw leaf). In such cases, the label derived from the root CID would point to this single piece of content.

Inconsistent or Incorrect Data Preparation:
During the preparation process, if the DAG is not correctly formed or if the data is not properly linked, the resulting root CID might only represent a part of the expected data structure. This could lead to a label pointing to a raw leaf instead of a representation of the entire dataset.

Inline Preparation and Direct Leaf Handling:
With inline preparation, the system may deal directly with individual files or pieces of data, generating a root CID that corresponds to a single item. This is especially likely if the data structure is not complex and does not require a full DAG to represent hierarchical content.

Documentation update and 'daggen':
I am working on updating the documentation. It can be very confusing to know what step/command to run and not. for instance, 'daggen' process should not be run when you are leveraging inline preparation because you are not expecting any intermediate files to be generated.

@parkan
Copy link
Contributor Author

parkan commented Aug 5, 2024

hmm ok I don't think any of the above map to our experience:

Single-File CARs or Minimal DAGs:
all of these are normal directories with multiple files

Inconsistent or Incorrect Data Preparation:
this happens with every single singularity data prep including the minimal example I provided above

Inline Preparation and Direct Leaf Handling
this one I am least sure about, we do use inline prep but it operates over directories, I can see it update the tree if I trace the program, but the final car output has no dag

Documentation update and 'daggen':
this is indeed confusing, I was able to produce a dag car in addition to the raw blocks car by manually using daggen as per my last update above but I don't see how it would fit into the workflow

the blocks are also always actually raw leaves of 1MiB, not well-formed unxifs or pb objects

I will also note that running ezprep on a directory will always output 2 (or more) car files, one with only raw leaves and one with the dag but no leaves

to be clear, I am actually very attracted to the possibility of manipulating DAGs separately from the raw data, as it could give us a lot of flexibility, but the way it fits into the current workflow or what the minimal set of steps is to get a retrievable tree is not clear

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage
Projects
None yet
Development

No branches or pull requests

2 participants