Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XAFS/pre_edge.py _finde0 recieves numpy arrays and pandas series #544

Open
jonvanbuskirk0 opened this issue Jan 1, 2025 · 6 comments
Open

Comments

@jonvanbuskirk0
Copy link

Hi,

Thanks for curating such a great set of tools. I've been using the XAFS package, and running into an error with the find_e0 function, which redirects to the _finde0 function, and hits the error on line 80 of pre_edge.py, which is "mu = mu_input[ordered]". It seems that I was accidentally passing pandas data series, and the find_e0 function was only converting these to numpy arrays some of the time. The function accepts either a numpy array (seemingly the intended data type) or a pandas series, which causes a particularly undescriptive error. I've found that importing pandas into pre_edge.py and adding an if statement to check for pandas series and convert to numpy array fixes the error. The if statement that works for me is:

import pandas as pd # placed at the front of the file, and quite slow, unfortunately

if isinstance(mu_input, pd.core.series.Series): # check if pandas series
mu_input = np.array(mu_input.values) # extract values and convert to numpy array

placed directly before line 80 in pre_edge.py

Of course it would be best to simply only pass numpy arrays, but this addition could be very useful for minimizing the debugging others might have to do.

Thanks!

@newville
Copy link
Member

newville commented Jan 1, 2025

@jonvanbuskirk0 yeah, Pandas Series work as Numpy arrays until they don't. ;)

Larch does have pandas as a dependency, but it is currently only used in one place, and not for working with Series or Dataframes, but for Styling output... Pandas is not anywhere in the xafs module. So, I would probably rather use

if not isinstance(mu_input, np.ndarray): 
     mu_input = np.array(mu_input) 

which might also work with other "almost-like-a-numpy-array" objects, including xarray and plain lists.

The challenge is how many functions need to do these tests. I'll guess that pre_edge() is the most likely to need this.

@maurov
Copy link
Member

maurov commented Jan 2, 2025

@jonvanbuskirk0, as already said by Matt, we do not really support Pandas objects as input, so the quick solution to your issue is to simply pass a Numpy array to _finde0 (or any Larch function).

@newville I think we should remove the Pandas dependency, but this could be discussed at a next developers meeting.

I would probably rather use

if not isinstance(mu_input, np.ndarray): 
     mu_input = np.array(mu_input) 

Personally, I am reluctant in manually type checking into the code because there is an elegant solution already provided by Python, which is the use of type hinting. We should go for it and convert as much as possible Larch code with this coding style.

For example, in the particular case of _finde0:

import numpy as np
from typing import Union, Any
def _finde0(energy: np.ndarray[np.float64, Any], mu_input: np.ndarray[np.float64, Any], estep: Union[float, None] = None, use_smooth: bool = True):

@newville what do you think?

@newville
Copy link
Member

newville commented Jan 2, 2025

@maurov Working toward dropping Pandas (it is used only in xrd/struct2xas.py and math/deglitch.py) is OK. I don't find it too much trouble to have. I think that "generally improving CSV support" would be good too -- pandas might be acceptable for that, though csvfile is probably sufficient.

I am OK with using type hints here and other places. I am not opposed - they just don't come that naturally to me, and I've been a slow adopter. Maybe using them more is a good New Year's goal.

Many of the "xafs functions" allow a Group or arrays. This could also be handled with type hints or a more complex decorator.

I am also OK with saying that pre_edge() and find_e0() are the most likely functions to be called first, and can be overly cautious about data types, NaNs, and out-of-order data. The remove_nans2() function in math/utils.py was already used in pre_edge(), but not in find_e0() -- using that should fix this problem of passing in a pandas Series: that function does coerce to ndarrays.

So, I would say "all of the above".

@maurov
Copy link
Member

maurov commented Jan 3, 2025

@newville I also agree on all. I will add here only few comments, but for the sake of coordination and clarity, I will open specific issues. I will assign to myself those I can handle by next release:

@maurov Working toward dropping Pandas (it is used only in xrd/struct2xas.py and math/deglitch.py) is OK. I don't find it too much trouble to have. I think that "generally improving CSV support" would be good too -- pandas might be acceptable for that, though csvfile is probably sufficient.

I am OK with using type hints here and other places. I am not opposed - they just don't come that naturally to me, and I've been a slow adopter. Maybe using them more is a good New Year's goal.

Many of the "xafs functions" allow a Group or arrays. This could also be handled with type hints or a more complex decorator.

Yes, type hints will be very beneficial in validation and testing. I would push for it. Furthermore, we could use Group (or inherited versions) as base data containers for all Larch functions instead of accepting Numpy arrays too. IMHO, this would make our developer life much easier. I propose to discuss this at the next developers meeting.

I am also OK with saying that pre_edge() and find_e0() are the most likely functions to be called first, and can be overly cautious about data types, NaNs, and out-of-order data. The remove_nans2() function in math/utils.py was already used in pre_edge(), but not in find_e0() -- using that should fix this problem of passing in a pandas Series: that function does coerce to ndarrays.

Yes, remove_nans is very useful function that we could as a setter decorator for Group instead of repeating it in the specific functions. This is also part of revisiting the Group data class architecture and may be discussed elsewhere.

@newville
Copy link
Member

newville commented Jan 3, 2025

@maurov I like all of that. I would be OK with moving toward a more formal "XASGroup" (or similar name), and also a separate "XRFGroup" and/or "XESGroup" (where, for example, the array called "energy" might mean "emitted" instead of "incident"), and have "the xafs functions" work only on XASGroup.

I would still want to keep plain functions over methods for back-comppatibility: call autobk on an XASGroup rather than create an XASGroup and call its autobk method. Having both would be OK.

And, right, "remove_nans2()" sort of became "clean input arrays" -- maybe that should be clearer and either a decorator or method of XASGroup....

@maurov
Copy link
Member

maurov commented Jan 8, 2025

@maurov I like all of that. I would be OK with moving toward a more formal "XASGroup" (or similar name), and also a separate "XRFGroup" and/or "XESGroup" (where, for example, the array called "energy" might mean "emitted" instead of "incident"), and have "the xafs functions" work only on XASGroup.

I would still want to keep plain functions over methods for back-comppatibility: call autobk on an XASGroup rather than create an XASGroup and call its autobk method. Having both would be OK.

Yes, this is what I have in mind too. The Group object is the data container. It should have fixed attributes names, well defined by a @dataclass decorator. More specific *Group* will inherit from it. This will permit simple validation and test. The Larch functions will take as input a specific Group data object (plus arguments), act on it and return it. This should be backward compatible, simply forces to use specific Group.

I may propose closing this very specific issue and discuss this in a separate thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants