-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYNPY-1551] Tables refactor #1151
base: develop
Are you sure you want to change the base?
Conversation
Hello @BryanFauble! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2025-01-10 22:36:03 UTC |
My only comment for now: should we have the concept of "snapshot" which just mints a version. |
Good call! I implemented that logic in this commit: cf2c69a Changes are shown on the doc site here: https://synapsepythonclient--1151.org.readthedocs.build/en/1151/reference/oop/table_refactor/#synapseclient.models.Table.snapshot I also added some logic to clean the bearer token out of debug log messages (I saw this while looking into this logic). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some quesions/discussion points. Overall I like the feel of the new interface and I think it's much more straightforward and user-friendly.
I'm curious how you are thinking about the interface for storing/deleting/alterring rows in this new framework. Will there be an in-memory representation of the contents of the table stored on the Table
object? What would an operation look like if a user wanted to change the name of a column and alter row content in between store
calls?
of this interface. For example, if you wish to rename a column you may do so by | ||
changing the name attribute of the Column object. The key in the OrderedDict does | ||
not need to be changed. The next time you store the table the column will be updated | ||
in Synapse with the new name and the key in the OrderedDict will be updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we can't update the key when the name of the column is changed in the Column object it is mapped to? Or do you see it as advantageous to still be using the old key name?
It could be a little awkward if a user wants to perform multiple operations on the Table/Column before store
-ing it again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we can't update the key when the name of the column is changed in the Column object it is mapped to?
This is the issue I see that I couldn't figure out a graceful way to resolve:
my_table.columns["old_name"].name = "new_name"
When we do this there is no way to go into the parent object and update the keys like:
column_instance ["new_name"] = column_instance .pop("old_name")
We cannot do this because it calls the setter function on the name attribute of the selected Column
instance. It has no context about the parent class that contained the selected Column
.
The only way I could think of is providing something like a .update_column_name
function on the Table
instance. But that also seems a little clunky too.
With this approach (I'm open to suggestions on changes) what we're able to do is during the table.store()
operation we'll be able to loop over the Column
objects and make sure the .columns
attribute on the table instance that is being stored is updated. That is, since we are starting at your table
instance and then acting on all columns it has.
Or do you see it as advantageous to still be using the old key name?
I'd rather these always be kept in sync so they don't differ.
It could be a little awkward if a user wants to perform multiple operations on the Table/Column before store-ing it again
I agree. There are possible issues with name collisions too if they are swapping columns around for some reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I couldn't think of a clean way to do it with a single operation. You could add a new dictionary entry at the end with the new key, pop the old entry while also making note of its position within the OrderedDict, and then move the new entry to that position or something along these lines but it certainly wouldn't be graceful.
I was not planning to store a reference to the content on the I had set
We need to duplicate the
I had considered something like being able to use a pre-defined string within the query that gets replaced with the ID of the table, ie:
The problem with this thought is that you couldn't ever take this query and execute it in the Synapse UI, or other client that doesn't contain this exact behavior.
The syntax should look something like this:
Question: Do you think we should expect that the user is updating the name of the column in the pandas dataframe? Should we provide a |
) -> str: | ||
"""Takes in a path to a CSV and stores the rows to Synapse. | ||
# TODO: Finish implementation | ||
async def store_rows_async( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the upsert
functionality be part of this class or would it be elsewhere? (Essentially partial updates)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great question. I ended up putting this into a new method for that specific logic since partial updates are quite a bit different for how it needs to be implemented. It's in a upsert_rows
method detailed here:
I also got an implementation working (At least the partial update portion) to test out how Synapse does it.
Thanks for the explanations on the row operations and querying. That all makes sense.
This gets at the feeling that I was getting which is that having the Column objects completely divorced from the table contents in the OOP model might cause some problems. I can imagine confusion for some users if they were only able to make updates to the Columns in the Table object but then only able to make updates to table contents/rows with the dataframe interface. I think the If multiple ways of updating columns, rows, and/or table contents in general are to be supported, that definitely makes the |
Problem:
Solution:
The main functions that exist on the Table to interact with required functionality include:
Convenience functions (Functions that smooth out setting attributes, but require a call to
.store()
to persist to Synapse)EntityView
andDataset
concepts in Synapse. They're similar in concept to a Table, but serve different purposes for users. When we start to implement those 2 models we will be able to extract out common logic and re-use it across all 3 models.TODO:
store
andstore_rows
that prints out the changes to the schema that will be applied (Determine if we could do a diff of the data, if it's feasible to implement - If not, do not implement)delete_rows
logic, or document the process well. Current behavior requires both a row_id and version_number to delete sourced from query results. As querying for data changed, so will the delete logic._columns_to_delete
concept within the models to prevent unintentional deletion of columns and require that deleting a column go through thedelete_column
function.