[SYNPY-1551] Tables refactor #1151

BryanFauble · 2025-01-07T23:16:02Z

Problem:

Working with tables in the current implementation is a difficult task. Many different functions must be strung together to accomplish some tasks, features like renaming columns is missing, and all data must go through a CSV written to disk when interacting with Synapse.

Solution:

Start the process of updating the initial Tables OOP model to bring forward a number of changes:

The main functions that exist on the Table to interact with required functionality include:

store
store_rows
get
delete
delete_rows
query

Convenience functions (Functions that smooth out setting attributes, but require a call to .store() to persist to Synapse)

delete_column
add_column
reorder_column
set_columns

Once the logic for these functions is nailed down the same ideas may be applied to both EntityView and Dataset concepts in Synapse. They're similar in concept to a Table, but serve different purposes for users. When we start to implement those 2 models we will be able to extract out common logic and re-use it across all 3 models.

TODO:

…d an easier to use class

pep8speaks · 2025-01-07T23:16:12Z

Hello @BryanFauble! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file synapseclient/api/table_services.py:

Line 43:89: E501 line too long (212 > 88 characters)

In the file synapseclient/models/protocols/table_protocol.py:

Line 96:89: E501 line too long (218 > 88 characters)
Line 97:89: E501 line too long (102 > 88 characters)
Line 110:89: E501 line too long (229 > 88 characters)
Line 170:89: E501 line too long (218 > 88 characters)
Line 171:89: E501 line too long (102 > 88 characters)
Line 189:89: E501 line too long (120 > 88 characters)
Line 198:89: E501 line too long (125 > 88 characters)
Line 240:89: E501 line too long (115 > 88 characters)
Line 278:89: E501 line too long (92 > 88 characters)
Line 297:89: E501 line too long (116 > 88 characters)
Line 410:89: E501 line too long (142 > 88 characters)
Line 435:89: E501 line too long (134 > 88 characters)

In the file synapseclient/models/table.py:

Line 272:89: E501 line too long (172 > 88 characters)
Line 273:89: E501 line too long (176 > 88 characters)
Line 361:89: E501 line too long (94 > 88 characters)
Line 362:89: E501 line too long (95 > 88 characters)
Line 363:89: E501 line too long (96 > 88 characters)
Line 366:89: E501 line too long (101 > 88 characters)
Line 367:89: E501 line too long (103 > 88 characters)
Line 369:89: E501 line too long (94 > 88 characters)
Line 370:89: E501 line too long (91 > 88 characters)
Line 381:89: E501 line too long (109 > 88 characters)
Line 384:89: E501 line too long (100 > 88 characters)
Line 389:89: E501 line too long (94 > 88 characters)
Line 390:89: E501 line too long (93 > 88 characters)
Line 391:89: E501 line too long (92 > 88 characters)
Line 395:89: E501 line too long (94 > 88 characters)
Line 396:89: E501 line too long (91 > 88 characters)
Line 397:89: E501 line too long (94 > 88 characters)
Line 398:89: E501 line too long (91 > 88 characters)
Line 399:89: E501 line too long (93 > 88 characters)
Line 400:89: E501 line too long (96 > 88 characters)
Line 476:89: E501 line too long (107 > 88 characters)
Line 527:89: E501 line too long (117 > 88 characters)
Line 532:89: E501 line too long (110 > 88 characters)
Line 637:89: E501 line too long (93 > 88 characters)
Line 638:89: E501 line too long (95 > 88 characters)
Line 652:89: E501 line too long (101 > 88 characters)
Line 655:89: E501 line too long (92 > 88 characters)
Line 675:89: E501 line too long (173 > 88 characters)
Line 790:89: E501 line too long (91 > 88 characters)
Line 885:89: E501 line too long (218 > 88 characters)
Line 886:89: E501 line too long (102 > 88 characters)
Line 899:89: E501 line too long (229 > 88 characters)
Line 944:89: E501 line too long (121 > 88 characters)
Line 992:89: E501 line too long (104 > 88 characters)
Line 1001:89: E501 line too long (111 > 88 characters)
Line 1068:89: E501 line too long (218 > 88 characters)
Line 1069:89: E501 line too long (102 > 88 characters)
Line 1087:89: E501 line too long (120 > 88 characters)
Line 1097:89: E501 line too long (125 > 88 characters)
Line 1139:89: E501 line too long (89 > 88 characters)
Line 1182:89: E501 line too long (131 > 88 characters)
Line 1246:89: E501 line too long (110 > 88 characters)
Line 1253:89: E501 line too long (116 > 88 characters)
Line 1257:89: E501 line too long (174 > 88 characters)
Line 1267:89: E501 line too long (145 > 88 characters)
Line 1351:89: E501 line too long (108 > 88 characters)
Line 1373:89: E501 line too long (132 > 88 characters)
Line 1498:89: E501 line too long (203 > 88 characters)
Line 1545:89: E501 line too long (89 > 88 characters)
Line 1610:89: E501 line too long (89 > 88 characters)
Line 1763:89: E501 line too long (98 > 88 characters)
Line 1764:89: E501 line too long (101 > 88 characters)
Line 1842:89: E501 line too long (117 > 88 characters)
Line 1885:89: E501 line too long (120 > 88 characters)
Line 1886:89: E501 line too long (116 > 88 characters)
Line 1887:89: E501 line too long (146 > 88 characters)
Line 1888:89: E501 line too long (241 > 88 characters)
Line 1892:89: E501 line too long (92 > 88 characters)
Line 1954:89: E501 line too long (142 > 88 characters)
Line 1984:89: E501 line too long (134 > 88 characters)

Comment last updated at 2025-01-10 22:36:03 UTC

thomasyu888 · 2025-01-08T09:56:44Z

My only comment for now: should we have the concept of "snapshot" which just mints a version.

BryanFauble · 2025-01-08T18:30:57Z

My only comment for now: should we have the concept of "snapshot" which just mints a version.

Good call! I implemented that logic in this commit: cf2c69a

Changes are shown on the doc site here: https://synapsepythonclient--1151.org.readthedocs.build/en/1151/reference/oop/table_refactor/#synapseclient.models.Table.snapshot

I also added some logic to clean the bearer token out of debug log messages (I saw this while looking into this logic).

BWMac

Left some quesions/discussion points. Overall I like the feel of the new interface and I think it's much more straightforward and user-friendly.

I'm curious how you are thinking about the interface for storing/deleting/alterring rows in this new framework. Will there be an in-memory representation of the contents of the table stored on the Table object? What would an operation look like if a user wanted to change the name of a column and alter row content in between store calls?

synapseclient/models/table.py

BWMac · 2025-01-09T18:11:53Z

synapseclient/models/table.py

+            of this interface. For example, if you wish to rename a column you may do so by
+            changing the name attribute of the Column object. The key in the OrderedDict does
+            not need to be changed. The next time you store the table the column will be updated
+            in Synapse with the new name and the key in the OrderedDict will be updated.


Is there a reason we can't update the key when the name of the column is changed in the Column object it is mapped to? Or do you see it as advantageous to still be using the old key name?

It could be a little awkward if a user wants to perform multiple operations on the Table/Column before store-ing it again

Is there a reason we can't update the key when the name of the column is changed in the Column object it is mapped to?

This is the issue I see that I couldn't figure out a graceful way to resolve:

my_table.columns["old_name"].name = "new_name"

When we do this there is no way to go into the parent object and update the keys like:

column_instance ["new_name"] = column_instance .pop("old_name")

We cannot do this because it calls the setter function on the name attribute of the selected Column instance. It has no context about the parent class that contained the selected Column.

The only way I could think of is providing something like a .update_column_name function on the Table instance. But that also seems a little clunky too.

With this approach (I'm open to suggestions on changes) what we're able to do is during the table.store() operation we'll be able to loop over the Column objects and make sure the .columns attribute on the table instance that is being stored is updated. That is, since we are starting at your table instance and then acting on all columns it has.

Or do you see it as advantageous to still be using the old key name?

I'd rather these always be kept in sync so they don't differ.

It could be a little awkward if a user wants to perform multiple operations on the Table/Column before store-ing it again

I agree. There are possible issues with name collisions too if they are swapping columns around for some reason.

Yeah I couldn't think of a clean way to do it with a single operation. You could add a new dictionary entry at the end with the new key, pop the old entry while also making note of its position within the OrderedDict, and then move the new entry to that position or something along these lines but it certainly wouldn't be graceful.

synapseclient/models/table.py

BryanFauble · 2025-01-09T19:49:01Z

@BWMac

I'm curious how you are thinking about the interface for storing/deleting/alterring rows in this new framework.

store_rows (Also used for update): Pass the values to store and insert them into Synapse. In the case of altering rows we would query for the rows to update first, and then add the row_id and row_version columns to the data. We would be doing the same kind of logic as this:

synapsePythonClient/synapseclient/table.py

Lines 2219 to 2332 in 5a37760

    
           def from_data_frame( 
        
               cls, 
        
               schema, 
        
               df, 
        
               filepath=None, 
        
               etag=None, 
        
               quoteCharacter='"', 
        
               escapeCharacter="\\", 
        
               lineEnd=str(os.linesep), 
        
               separator=",", 
        
               header=True, 
        
               includeRowIdAndRowVersion=None, 
        
               headers=None, 
        
               **kwargs, 
        
           ): 
        
               # infer columns from data frame if not specified 
        
               if not headers: 
        
                   cols = as_table_columns(df) 
        
                   headers = [SelectColumn.from_column(col) for col in cols] 
        
               # if the schema has no columns, use the inferred columns 
        
               if isinstance(schema, Schema) and not schema.has_columns(): 
        
                   schema.addColumns(cols) 
        
               # convert row names in the format [row_id]_[version] or [row_id]_[version]_[etag] back to columns 
        
               # etag is essentially a UUID 
        
               etag_pattern = r"[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[1-5][0-9a-fA-F]{3}-[89abAB][0-9a-fA-F]{3}-[0-9a-fA-F]{12}" 
        
               row_id_version_pattern = re.compile(r"(\d+)_(\d+)(_(" + etag_pattern + r"))?") 
        
               row_id = [] 
        
               row_version = [] 
        
               row_etag = [] 
        
               for row_name in df.index.values: 
        
                   m = row_id_version_pattern.match(str(row_name)) 
        
                   row_id.append(m.group(1) if m else None) 
        
                   row_version.append(m.group(2) if m else None) 
        
                   row_etag.append(m.group(4) if m else None) 
        
               # include row ID and version, if we're asked to OR if it's encoded in row names 
        
               if includeRowIdAndRowVersion or ( 
        
                   includeRowIdAndRowVersion is None and any(row_id) 
        
               ): 
        
                   df2 = df.copy() 
        
                   cls._insert_dataframe_column_if_not_exist(df2, 0, "ROW_ID", row_id) 
        
                   cls._insert_dataframe_column_if_not_exist( 
        
                       df2, 1, "ROW_VERSION", row_version 
        
                   ) 
        
                   if any(row_etag): 
        
                       cls._insert_dataframe_column_if_not_exist(df2, 2, "ROW_ETAG", row_etag) 
        
                   df = df2 
        
                   includeRowIdAndRowVersion = True 
        
               f = None 
        
               try: 
        
                   if not filepath: 
        
                       temp_dir = tempfile.mkdtemp() 
        
                       filepath = os.path.join(temp_dir, "table.csv") 
        
                   f = io.open(filepath, mode="w", encoding="utf-8", newline="") 
        
                   test_import_pandas() 
        
                   import pandas as pd 
        
                   if isinstance(schema, Schema): 
        
                       for col in schema.columns_to_store: 
        
                           if col["columnType"] == "DATE": 
        
                               def _trailing_date_time_millisecond(t): 
        
                                   if isinstance(t, str): 
        
                                       return t[:-3] 
        
                               df[col.name] = pd.to_datetime( 
        
                                   df[col.name], errors="coerce" 
        
                               ).dt.strftime("%s%f") 
        
                               df[col.name] = df[col.name].apply( 
        
                                   lambda x: _trailing_date_time_millisecond(x) 
        
                               ) 
        
                   df.to_csv( 
        
                       f, 
        
                       index=False, 
        
                       sep=separator, 
        
                       header=header, 
        
                       quotechar=quoteCharacter, 
        
                       escapechar=escapeCharacter, 
        
                       lineterminator=lineEnd, 
        
                       na_rep=kwargs.get("na_rep", ""), 
        
                       float_format="%.12g", 
        
                   ) 
        
                   # NOTE: reason for flat_format='%.12g': 
        
                   # pandas automatically converts int columns into float64 columns when some cells in the column have no 
        
                   # value. If we write the whole number back as a decimal (e.g. '3.0'), Synapse complains that we are writing 
        
                   # a float into a INTEGER(synapse table type) column. Using the 'g' will strip off '.0' from whole number 
        
                   # values. pandas by default (with no float_format parameter) seems to keep 12 values after decimal, so we 
        
                   # use '%.12g'.c 
        
                   # see SYNPY-267. 
        
               finally: 
        
                   if f: 
        
                       f.close() 
        
               return cls( 
        
                   schema=schema, 
        
                   filepath=filepath, 
        
                   etag=etag, 
        
                   quoteCharacter=quoteCharacter, 
        
                   escapeCharacter=escapeCharacter, 
        
                   lineEnd=lineEnd, 
        
                   separator=separator, 
        
                   header=header, 
        
                   includeRowIdAndRowVersion=includeRowIdAndRowVersion, 
        
                   headers=headers, 
        
               )

delete_rows: Let's you run a query and delete all matching rows

Will there be an in-memory representation of the contents of the table stored on the Table object?

I was not planning to store a reference to the content on the Table object. The content of the table depends on what you query for. My intention is that any time you want to get data you'd run a fresh query and pass the results around. My concern around storing this on the Table object is that if I keep a reference to the result the garbage collector wouldn't clean it up until the reference to the Table was removed. I can see cases where you might want to query a table multiple times for different things - get results and act on those results, query again and act on those results.... I'd rather that the data is intentionally passed to what needs it.

I had set query to be a staticmethod on the Table class. The thing that would prevent me from want to make it a non-staticmethod is that you would still need to provide the table ID in the query you execute (ie: select * from syn123 examples). Meaning it'd look like this in code if that were the case:

my_table = Table(id="syn1234")
query_results = my_table.query(query="select * from syn1234")

We need to duplicate the syn1234 twice in this case. This is an issue present in my proposed delete_rows method too:

from synapseclient import Synapse
from synapseclient.models import Table

syn = Synapse()
syn.login()

Table(id="syn1234").delete_rows(query="SELECT ROW_ID, ROW_VERSION FROM syn1234 WHERE foo = 'asdf'")

I had considered something like being able to use a pre-defined string within the query that gets replaced with the ID of the table, ie:

from synapseclient import Synapse
from synapseclient.models import Table

syn = Synapse()
syn.login()

# Current expectation
query_results = Table.query(query="SELECT ROW_ID, ROW_VERSION FROM syn1234 WHERE foo = 'asdf'")

# 'pre-defined' string idea
query_results = Table(id="syn1234").query(query="SELECT ROW_ID, ROW_VERSION FROM :id WHERE foo = 'asdf'")

The problem with this thought is that you couldn't ever take this query and execute it in the Synapse UI, or other client that doesn't contain this exact behavior.

What would an operation look like if a user wanted to change the name of a column and alter row content in between store calls?

The syntax should look something like this:

# Query for the rows to update
content_to_update = Table.query(query="select * from syn1234 where foo = 'bar'")

# Update those rows
content_to_update["some_content"] = ["some-new-values"]

# include_columns to rename an existing column
my_table = Table(id="syn1234").get(include_columns=True)
my_table.columns["some_content"].name = "some_new_column_name"

# The `store_rows` function could look for the new column in the pandas dataframe; If present - Use the data from that column, if not present - Look for the previous column name
my_table.store_rows(values=content_to_update)

Question: Do you think we should expect that the user is updating the name of the column in the pandas dataframe? Should we provide a rename_column function that takes in: old_column_name: str, new_column_name: str, rows: pd.Dataframe = None that would 1) Update the .columns attribute dictionary key, 2) Update the Column class instance .name attribute, 3) Update the name of the column in the passed in Dataframe?

thomasyu888 · 2025-01-10T07:07:03Z

synapseclient/models/table.py

-    ) -> str:
-        """Takes in a path to a CSV and stores the rows to Synapse.
+    # TODO: Finish implementation
+    async def store_rows_async(


Would the upsert functionality be part of this class or would it be elsewhere? (Essentially partial updates)

Great question. I ended up putting this into a new method for that specific logic since partial updates are quite a bit different for how it needs to be implemented. It's in a upsert_rows method detailed here:

https://synapsepythonclient--1151.org.readthedocs.build/en/1151/reference/oop/table_refactor/#synapseclient.models.Table.upsert_rows

I also got an implementation working (At least the partial update portion) to test out how Synapse does it.

BWMac · 2025-01-10T13:22:26Z

Thanks for the explanations on the row operations and querying. That all makes sense.

Do you think we should expect that the user is updating the name of the column in the pandas dataframe?

This gets at the feeling that I was getting which is that having the Column objects completely divorced from the table contents in the OOP model might cause some problems. I can imagine confusion for some users if they were only able to make updates to the Columns in the Table object but then only able to make updates to table contents/rows with the dataframe interface. I think the rename_column functionality you described could be valuable. I also think in general our users are likely familiar with pandas/pandas dataframes so interfaces using that data structure will probably be easy to understand.

If multiple ways of updating columns, rows, and/or table contents in general are to be supported, that definitely makes the dry_run functionality you have planned for the store function even more important so users can see how their changes will resolve in Synapse before making their operations final.

BryanFauble added 4 commits January 7, 2025 15:47

Implmenting a number of changes to the tables model to support OOP an…

9b9ae83

…d an easier to use class

correct func name

c48bb94

Add delete_rows

ea4e27e

Add NotImplementedError

4b59968

BryanFauble added 6 commits January 7, 2025 16:18

Extend page grid

2e0f3f6

Correct docstring

606fc98

Correct examples in docstring

daabee4

More corrections

eaaa953

Remove from docstring

9766eb0

docstring corrections

91fd101

Implementing logic to support snapshotting tables

cf2c69a

BryanFauble added 2 commits January 8, 2025 12:01

Adding expansion concepts for storing data in a table

c0a4d8b

Default to not modfying schema by default

df0dc3c

BWMac self-requested a review January 9, 2025 15:43

Proposed delete_rows function for the table

5dfa4c5

BWMac reviewed Jan 9, 2025

View reviewed changes

BryanFauble added 3 commits January 9, 2025 12:00

Grammer

fdc1b75

delete staticmethod from delete_rows

48aa561

Correct example

5a37760

BWMac self-requested a review January 9, 2025 21:44

Adding additional examples to the methods

9b3cd5d

thomasyu888 reviewed Jan 10, 2025

View reviewed changes

Add support for upsert_rows

b18580c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYNPY-1551] Tables refactor #1151

[SYNPY-1551] Tables refactor #1151

BryanFauble commented Jan 7, 2025 •

edited

Loading

pep8speaks commented Jan 7, 2025 •

edited

Loading

thomasyu888 commented Jan 8, 2025

BryanFauble commented Jan 8, 2025

BWMac left a comment

BWMac Jan 9, 2025

BryanFauble Jan 9, 2025

BWMac Jan 10, 2025

BryanFauble commented Jan 9, 2025

thomasyu888 Jan 10, 2025

BryanFauble Jan 10, 2025

BWMac commented Jan 10, 2025

[SYNPY-1551] Tables refactor #1151

Are you sure you want to change the base?

[SYNPY-1551] Tables refactor #1151

Conversation

BryanFauble commented Jan 7, 2025 • edited Loading

pep8speaks commented Jan 7, 2025 • edited Loading

Comment last updated at 2025-01-10 22:36:03 UTC

thomasyu888 commented Jan 8, 2025

BryanFauble commented Jan 8, 2025

BWMac left a comment

Choose a reason for hiding this comment

BWMac Jan 9, 2025

Choose a reason for hiding this comment

BryanFauble Jan 9, 2025

Choose a reason for hiding this comment

BWMac Jan 10, 2025

Choose a reason for hiding this comment

BryanFauble commented Jan 9, 2025

thomasyu888 Jan 10, 2025

Choose a reason for hiding this comment

BryanFauble Jan 10, 2025

Choose a reason for hiding this comment

BWMac commented Jan 10, 2025

BryanFauble commented Jan 7, 2025 •

edited

Loading

pep8speaks commented Jan 7, 2025 •

edited

Loading