Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrap all opaque types as interfaces #112

Merged
merged 21 commits into from
Apr 12, 2024

Conversation

klueska
Copy link
Contributor

@klueska klueska commented Apr 10, 2024

No description provided.

@klueska klueska requested a review from elezar April 10, 2024 12:17
@klueska klueska force-pushed the wrap-type-as-interfaces branch from f85ef81 to 7e28d72 Compare April 10, 2024 12:21
Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial change look good. The main question I have is whether to top-level library functions that operate ON other types make sense, or whether we should always use the functions off the types?

pkg/nvml/lib.go Show resolved Hide resolved
pkg/nvml/api.go Outdated
}

// Define package level methods as aliases to Interface methods of libnvml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have toolking to generate this automatically?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at the moment. It was fairly easy to do though. I just grepped for func (l* library) copied that in and did some search and replace on it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tool has been added in the latest revision.

var DeviceGetComputeRunningProcesses = deviceGetComputeRunningProcesses_v1
var DeviceGetGraphicsRunningProcesses = deviceGetGraphicsRunningProcesses_v1
var DeviceGetMPSComputeRunningProcesses = deviceGetMPSComputeRunningProcesses_v1
var deviceGetComputeRunningProcesses = deviceGetComputeRunningProcesses_v1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a silly question: Can you remind me why we don't have an nvml prefix here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nvml "functions" in this block come from #defines in nvml.h. The underlying versioned functions that get assigned to these variables are generated as part of the bindings with the v1, v2, v3, etc. prefixes and then assigned in updateVersionedSymbols below.

The difference with these functions is that we need to wrap the underlying nvml*_v1, nvml*_v2, etc. functions with some extra, version specific logic. Because of this, we create a level of indirection before assinging them back out to the "unversioned" variable.

pkg/nvml/api.go Outdated Show resolved Hide resolved
pkg/nvml/gpm.go Outdated Show resolved Hide resolved
pkg/nvml/api.go Outdated Show resolved Hide resolved
pkg/nvml/api.go Outdated
@@ -279,7 +279,7 @@ type Interface interface {
DeviceSetVgpuSchedulerState(Device Device, PSchedulerState *VgpuSchedulerSetState) Return
DeviceGetVgpuSchedulerCapabilities(Device Device) (VgpuSchedulerCapabilities, Return)
GpuInstanceGetComputeInstancePossiblePlacements(GpuInstance GpuInstance, Info *ComputeInstanceProfileInfo) ([]ComputeInstancePlacement, Return)
GpuInstanceCreateComputeInstanceWithPlacement(GpuInstance GpuInstance, Info *ComputeInstanceProfileInfo, Placement *ComputeInstancePlacement, ComputeInstance *ComputeInstance) Return
GpuInstanceCreateComputeInstanceWithPlacement(GpuInstance GpuInstance, Info *ComputeInstanceProfileInfo, Placement *ComputeInstancePlacement) (ComputeInstance, Return)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a bug? Does it make sense to do this in a separate PR so as to not have it go "missing" in the changelog?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is a bug. However, if I do it as a separate PR, then I can't make the updates in the commit that follows to turn ComputeInstance into an Interface. I'd rather have all the interface conversion in a single PR. I could look at maybe updating it in a "pre" PR, but I don't know how hard it will be to rebase if I do that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to leave this in this PR as long as we remember to call it out in the release notes once we tag a new version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will make sure and call it out in the release notes

pkg/nvml/api.go Outdated Show resolved Hide resolved
pkg/nvml/api.go Outdated Show resolved Hide resolved
@klueska klueska force-pushed the wrap-type-as-interfaces branch from 7e28d72 to 81e362c Compare April 10, 2024 13:58
@klueska klueska changed the title Draft: Wrap all opaque types as interfaces Wrap all opaque types as interfaces Apr 10, 2024
@klueska
Copy link
Contributor Author

klueska commented Apr 11, 2024

Initial change look good. The main question I have is whether to top-level library functions that operate ON other types make sense, or whether we should always use the functions off the types?

I'd be fine returning concrete types for everything (including the top level library) and moving the API definitions to the testing package. I went back and forth on this and don't really have a preference, so long as we have a way to pull in a mock version of the library (and its other types) from somewhere.

@elezar
Copy link
Member

elezar commented Apr 11, 2024

Initial change look good. The main question I have is whether to top-level library functions that operate ON other types make sense, or whether we should always use the functions off the types?

No, that's not what I meant -- although it is a valid question. I meant that we have, as an example, three functions that do the same thing:

DeviceGetComputeMode(device Device)  (ComputeMode, Return)
library.DeviceGetComputeMode(device Device) (ComputeMode, Return)
Device.GetComputeMode()  (ComputeMode, Return)

My question was whether we would consider moving towards requiring the use of Device.GetComputeMode() and at least deprecate the other usages? This would drastically reduce the size of the top-level Interface.

@klueska
Copy link
Contributor Author

klueska commented Apr 11, 2024

Initial change look good. The main question I have is whether to top-level library functions that operate ON other types make sense, or whether we should always use the functions off the types?

No, that's not what I meant -- although it is a valid question. I meant that we have, as an example, three functions that do the same thing:

DeviceGetComputeMode(device Device)  (ComputeMode, Return)
library.DeviceGetComputeMode(device Device) (ComputeMode, Return)
Device.GetComputeMode()  (ComputeMode, Return)

My question was whether we would consider moving towards requiring the use of Device.GetComputeMode() and at least deprecate the other usages? This would drastically reduce the size of the top-level Interface.

I would be inclined to keep them since that is what matches the actual NVML API. That said, we should add tooling to autogenerate the top level Interface as well as the package level aliases from the functions hanging off the library type.

@klueska klueska force-pushed the wrap-type-as-interfaces branch 3 times, most recently from 8cfa643 to c4bbc89 Compare April 12, 2024 09:40
Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments on the generation toolking to start with.

gen/nvml/generateapi.go Outdated Show resolved Hide resolved

func getWriter(outputFile string) (io.Writer, func() error, error) {
if outputFile == "-" {
return os.Stdout, nil, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of nil as the "Closer" can we return:

func() error { return nil }

then we don't need to check for nil at the close site?

Alternatively, we can return a io.WriteCloser and implement a noopCloser that we combine with os.Stdout:

type noopCloser struct {
   io.Writer
}

func (n *noopCloser) Close() error {
   return nil
}

And then our return statement becomes:

return &noopCloser{Writer: os.Stdout}, nil

(and we can just return os.Create(outputFile) directly in the file case).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with the first suggestion

Copy link
Contributor Author

@klueska klueska Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you know, I am not a fan of "noop code" and prefer to be explicit in the caller (which is why I hesitated even changing this at all), but this is minor, so its fine.

return nil
})
if err != nil {
return nil, fmt.Errorf("walking the embed.FS path: %w\n", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

embed.FS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover from an old iteration, removing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

gen/nvml/generateapi.go Show resolved Hide resolved
@klueska klueska force-pushed the wrap-type-as-interfaces branch 3 times, most recently from 217cd99 to b02ebca Compare April 12, 2024 10:23
elezar
elezar previously approved these changes Apr 12, 2024
Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment about where we put our Mock sytems implementations.

go.mod Show resolved Hide resolved
gen/nvml/generateapi.go Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a thought. Does it make sense to have the "implementations" under something like mock/system or mock/platform?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What value does the extra level of nesting bring? To me it seems better to say mock.DGXA100Server or mock.A100Device over system.A100Device or platform.*.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be happy with a dgxa100 folder though. Then we could change the names to dgxa100.Server and dgxa100.Device, dgxa100.GpuInstance, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this change -- let me know what you think

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What value does the extra level of nesting bring?
It's more about organisation than anything else. The mocks themselves are all generated wherease the systems are implemented.

klueska added 3 commits April 12, 2024 12:40
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
As part of this, redefine all package level methods as methods hanging off of
the 'library' type and create aliases for all package level methods to those
methods hanging off of the default 'libnvml' instance of the 'library' type.

Signed-off-by: Kevin Klues <kklues@nvidia.com>
@klueska klueska force-pushed the wrap-type-as-interfaces branch from 0d26ac8 to 8d282c0 Compare April 12, 2024 12:40
elezar
elezar previously approved these changes Apr 12, 2024
Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @klueska.

I have some local changes in the device plugin that could benefit from the "Mock" implementations already!

pkg/nvml/mock/dgxa100/dgxa100.go Outdated Show resolved Hide resolved
elezar
elezar previously approved these changes Apr 12, 2024
@klueska klueska force-pushed the wrap-type-as-interfaces branch 2 times, most recently from 7b57959 to abf8e8e Compare April 12, 2024 17:44
@@ -85,6 +113,10 @@ func (l *library) load() (rerr error) {
return fmt.Errorf("error opening %s: %w", l.path, err)
}

// Update the errorStringFunc to point to nvml.ErrorString
errorStringFunc = nvmlErrorString
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to set this to the default once we close the library?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Fixed.

Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Looks good.

Minor comment regarding the error string.

}
return nvmlErrorString(r)
func (l *library) ErrorString(r Return) string {
return r.Error()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We either need to keep the lookup logic as before or reset the default if the library is unloaded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't leave it as before because in the r.Error() and r.String() methods I don't have access to l. I will update it reset the function on unload.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

pkg/nvml/lib.go Outdated
l.dl = dl.New(o.path, o.flags)
}

func (l *library) GetExtendedInterface() ExtendedInterface {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sold on GetExtendedInterface and would prefer GetExtensions(), but this is not a blocker. I'm happy to keep it as is and revisit this later once we have some usage examples.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated as suggested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I went for a hybrid -- the interface is called ExtendedInterface to be symmetrical with Interface, but the method call is Extensions() (which is the only actual symbol anyone will see in practice), e.g.:

nvml.Extensions().LookupSymbol("symbol")

or

envml := nvml.Extensions()
envml.LookupSymbol("symbol")

Its possible that the type itself might be embedded in a struct, but I only see that happening in testing (if at all) and I like the symmetry of that with Interface for the core API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds fine. I'm not too concerned about the type name, since as you mention that isn't visible at the call site.

klueska added 18 commits April 12, 2024 20:48
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
A new ComputeInstance should be returned, not passed in by reference

Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
A new GpmSample should be returned, not passed in by reference. As part of
this, add methods to hang of of the GpmSample type (which were missing
previously).

Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Signed-off-by: Kevin Klues <kklues@nvidia.com>
We should eventually expand this package to include a unified mock server like
we have in the go-nvlib testing and mig-parted testing.

Signed-off-by: Kevin Klues <kklues@nvidia.com>
This code was pulled over (mostly) directly from 'mig-parted'

Signed-off-by: Kevin Klues <kklues@nvidia.com>
The methods in this interface represent extensions to the core NVML API that
are only accessible through calling GetExtensions() against the Interface in
use (or at the package level for the default interface).

Signed-off-by: Kevin Klues <kklues@nvidia.com>
@klueska klueska force-pushed the wrap-type-as-interfaces branch from abf8e8e to 1fa43fd Compare April 12, 2024 20:55
Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now!

@klueska klueska merged commit a97d07c into NVIDIA:main Apr 12, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants