Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYNPY-1551] Tables refactor #1151

Draft
wants to merge 19 commits into
base: develop
Choose a base branch
from
Draft

Conversation

BryanFauble
Copy link
Contributor

@BryanFauble BryanFauble commented Jan 7, 2025

Problem:

  1. Working with tables in the current implementation is a difficult task. Many different functions must be strung together to accomplish some tasks, features like renaming columns is missing, and all data must go through a CSV written to disk when interacting with Synapse.

Solution:

  1. Start the process of updating the initial Tables OOP model to bring forward a number of changes:

The main functions that exist on the Table to interact with required functionality include:

  • store
  • store_rows
  • get
  • delete
  • delete_rows
  • query

Convenience functions (Functions that smooth out setting attributes, but require a call to .store() to persist to Synapse)

  • delete_column
  • add_column
  • reorder_column
  • set_columns
  1. Once the logic for these functions is nailed down the same ideas may be applied to both EntityView and Dataset concepts in Synapse. They're similar in concept to a Table, but serve different purposes for users. When we start to implement those 2 models we will be able to extract out common logic and re-use it across all 3 models.

TODO:

  • Get feedback on proposed changes and smooth out rough edges
  • Add additional examples and doc strings to all functions
  • Implement logic to support loading results from Synapse directly into a pandas dataframe (Without requiring the usage of an intermediate CSV file written to disk)
  • Implement querying and returning results via a DataFrame
  • Add verbose integration tests for all logic
  • Add verbose unit tests for all logic
  • Swap the submission of data to Synapse to use a single transaction via this api: https://rest-docs.synapse.org/rest/POST/entity/id/table/transaction/async/start.html
  • Implement a dry-run feature to both store and store_rows that prints out the changes to the schema that will be applied (Determine if we could do a diff of the data, if it's feasible to implement - If not, do not implement)
  • Factor the delete_rows logic, or document the process well. Current behavior requires both a row_id and version_number to delete sourced from query results. As querying for data changed, so will the delete logic.
  • Implement storing of rows for each of the formats we are going to let users submit data: Csv, list of lists, dictionary, pandas dataframe
  • Implement a _columns_to_delete concept within the models to prevent unintentional deletion of columns and require that deleting a column go through the delete_column function.

@pep8speaks
Copy link

pep8speaks commented Jan 7, 2025

Hello @BryanFauble! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 43:89: E501 line too long (212 > 88 characters)

Line 96:89: E501 line too long (218 > 88 characters)
Line 97:89: E501 line too long (102 > 88 characters)
Line 110:89: E501 line too long (229 > 88 characters)
Line 170:89: E501 line too long (218 > 88 characters)
Line 171:89: E501 line too long (102 > 88 characters)
Line 189:89: E501 line too long (120 > 88 characters)
Line 198:89: E501 line too long (125 > 88 characters)
Line 240:89: E501 line too long (115 > 88 characters)
Line 278:89: E501 line too long (92 > 88 characters)
Line 297:89: E501 line too long (116 > 88 characters)
Line 410:89: E501 line too long (142 > 88 characters)
Line 435:89: E501 line too long (134 > 88 characters)

Line 272:89: E501 line too long (172 > 88 characters)
Line 273:89: E501 line too long (176 > 88 characters)
Line 361:89: E501 line too long (94 > 88 characters)
Line 362:89: E501 line too long (95 > 88 characters)
Line 363:89: E501 line too long (96 > 88 characters)
Line 366:89: E501 line too long (101 > 88 characters)
Line 367:89: E501 line too long (103 > 88 characters)
Line 369:89: E501 line too long (94 > 88 characters)
Line 370:89: E501 line too long (91 > 88 characters)
Line 381:89: E501 line too long (109 > 88 characters)
Line 384:89: E501 line too long (100 > 88 characters)
Line 389:89: E501 line too long (94 > 88 characters)
Line 390:89: E501 line too long (93 > 88 characters)
Line 391:89: E501 line too long (92 > 88 characters)
Line 395:89: E501 line too long (94 > 88 characters)
Line 396:89: E501 line too long (91 > 88 characters)
Line 397:89: E501 line too long (94 > 88 characters)
Line 398:89: E501 line too long (91 > 88 characters)
Line 399:89: E501 line too long (93 > 88 characters)
Line 400:89: E501 line too long (96 > 88 characters)
Line 476:89: E501 line too long (107 > 88 characters)
Line 527:89: E501 line too long (117 > 88 characters)
Line 532:89: E501 line too long (110 > 88 characters)
Line 637:89: E501 line too long (93 > 88 characters)
Line 638:89: E501 line too long (95 > 88 characters)
Line 652:89: E501 line too long (101 > 88 characters)
Line 655:89: E501 line too long (92 > 88 characters)
Line 675:89: E501 line too long (173 > 88 characters)
Line 790:89: E501 line too long (91 > 88 characters)
Line 885:89: E501 line too long (218 > 88 characters)
Line 886:89: E501 line too long (102 > 88 characters)
Line 899:89: E501 line too long (229 > 88 characters)
Line 944:89: E501 line too long (121 > 88 characters)
Line 992:89: E501 line too long (104 > 88 characters)
Line 1001:89: E501 line too long (111 > 88 characters)
Line 1068:89: E501 line too long (218 > 88 characters)
Line 1069:89: E501 line too long (102 > 88 characters)
Line 1087:89: E501 line too long (120 > 88 characters)
Line 1097:89: E501 line too long (125 > 88 characters)
Line 1139:89: E501 line too long (89 > 88 characters)
Line 1182:89: E501 line too long (131 > 88 characters)
Line 1246:89: E501 line too long (110 > 88 characters)
Line 1253:89: E501 line too long (116 > 88 characters)
Line 1257:89: E501 line too long (174 > 88 characters)
Line 1267:89: E501 line too long (145 > 88 characters)
Line 1351:89: E501 line too long (108 > 88 characters)
Line 1373:89: E501 line too long (132 > 88 characters)
Line 1498:89: E501 line too long (203 > 88 characters)
Line 1545:89: E501 line too long (89 > 88 characters)
Line 1610:89: E501 line too long (89 > 88 characters)
Line 1763:89: E501 line too long (98 > 88 characters)
Line 1764:89: E501 line too long (101 > 88 characters)
Line 1842:89: E501 line too long (117 > 88 characters)
Line 1885:89: E501 line too long (120 > 88 characters)
Line 1886:89: E501 line too long (116 > 88 characters)
Line 1887:89: E501 line too long (146 > 88 characters)
Line 1888:89: E501 line too long (241 > 88 characters)
Line 1892:89: E501 line too long (92 > 88 characters)
Line 1954:89: E501 line too long (142 > 88 characters)
Line 1984:89: E501 line too long (134 > 88 characters)

Comment last updated at 2025-01-10 22:36:03 UTC

@thomasyu888
Copy link
Member

My only comment for now: should we have the concept of "snapshot" which just mints a version.

@BryanFauble
Copy link
Contributor Author

My only comment for now: should we have the concept of "snapshot" which just mints a version.

Good call! I implemented that logic in this commit: cf2c69a

Changes are shown on the doc site here: https://synapsepythonclient--1151.org.readthedocs.build/en/1151/reference/oop/table_refactor/#synapseclient.models.Table.snapshot

I also added some logic to clean the bearer token out of debug log messages (I saw this while looking into this logic).

@BWMac BWMac self-requested a review January 9, 2025 15:43
Copy link
Contributor

@BWMac BWMac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some quesions/discussion points. Overall I like the feel of the new interface and I think it's much more straightforward and user-friendly.

I'm curious how you are thinking about the interface for storing/deleting/alterring rows in this new framework. Will there be an in-memory representation of the contents of the table stored on the Table object? What would an operation look like if a user wanted to change the name of a column and alter row content in between store calls?

synapseclient/models/table.py Show resolved Hide resolved
synapseclient/models/table.py Show resolved Hide resolved
synapseclient/models/table.py Outdated Show resolved Hide resolved
Comment on lines +398 to +401
of this interface. For example, if you wish to rename a column you may do so by
changing the name attribute of the Column object. The key in the OrderedDict does
not need to be changed. The next time you store the table the column will be updated
in Synapse with the new name and the key in the OrderedDict will be updated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we can't update the key when the name of the column is changed in the Column object it is mapped to? Or do you see it as advantageous to still be using the old key name?

It could be a little awkward if a user wants to perform multiple operations on the Table/Column before store-ing it again

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we can't update the key when the name of the column is changed in the Column object it is mapped to?

This is the issue I see that I couldn't figure out a graceful way to resolve:

my_table.columns["old_name"].name = "new_name"

When we do this there is no way to go into the parent object and update the keys like:

column_instance ["new_name"] = column_instance .pop("old_name")

We cannot do this because it calls the setter function on the name attribute of the selected Column instance. It has no context about the parent class that contained the selected Column.

The only way I could think of is providing something like a .update_column_name function on the Table instance. But that also seems a little clunky too.


With this approach (I'm open to suggestions on changes) what we're able to do is during the table.store() operation we'll be able to loop over the Column objects and make sure the .columns attribute on the table instance that is being stored is updated. That is, since we are starting at your table instance and then acting on all columns it has.

Or do you see it as advantageous to still be using the old key name?

I'd rather these always be kept in sync so they don't differ.

It could be a little awkward if a user wants to perform multiple operations on the Table/Column before store-ing it again

I agree. There are possible issues with name collisions too if they are swapping columns around for some reason.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I couldn't think of a clean way to do it with a single operation. You could add a new dictionary entry at the end with the new key, pop the old entry while also making note of its position within the OrderedDict, and then move the new entry to that position or something along these lines but it certainly wouldn't be graceful.

synapseclient/models/table.py Show resolved Hide resolved
synapseclient/models/table.py Outdated Show resolved Hide resolved
@BryanFauble
Copy link
Contributor Author

@BWMac

I'm curious how you are thinking about the interface for storing/deleting/alterring rows in this new framework.

  • store_rows (Also used for update): Pass the values to store and insert them into Synapse. In the case of altering rows we would query for the rows to update first, and then add the row_id and row_version columns to the data. We would be doing the same kind of logic as this:
    def from_data_frame(
    cls,
    schema,
    df,
    filepath=None,
    etag=None,
    quoteCharacter='"',
    escapeCharacter="\\",
    lineEnd=str(os.linesep),
    separator=",",
    header=True,
    includeRowIdAndRowVersion=None,
    headers=None,
    **kwargs,
    ):
    # infer columns from data frame if not specified
    if not headers:
    cols = as_table_columns(df)
    headers = [SelectColumn.from_column(col) for col in cols]
    # if the schema has no columns, use the inferred columns
    if isinstance(schema, Schema) and not schema.has_columns():
    schema.addColumns(cols)
    # convert row names in the format [row_id]_[version] or [row_id]_[version]_[etag] back to columns
    # etag is essentially a UUID
    etag_pattern = r"[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[1-5][0-9a-fA-F]{3}-[89abAB][0-9a-fA-F]{3}-[0-9a-fA-F]{12}"
    row_id_version_pattern = re.compile(r"(\d+)_(\d+)(_(" + etag_pattern + r"))?")
    row_id = []
    row_version = []
    row_etag = []
    for row_name in df.index.values:
    m = row_id_version_pattern.match(str(row_name))
    row_id.append(m.group(1) if m else None)
    row_version.append(m.group(2) if m else None)
    row_etag.append(m.group(4) if m else None)
    # include row ID and version, if we're asked to OR if it's encoded in row names
    if includeRowIdAndRowVersion or (
    includeRowIdAndRowVersion is None and any(row_id)
    ):
    df2 = df.copy()
    cls._insert_dataframe_column_if_not_exist(df2, 0, "ROW_ID", row_id)
    cls._insert_dataframe_column_if_not_exist(
    df2, 1, "ROW_VERSION", row_version
    )
    if any(row_etag):
    cls._insert_dataframe_column_if_not_exist(df2, 2, "ROW_ETAG", row_etag)
    df = df2
    includeRowIdAndRowVersion = True
    f = None
    try:
    if not filepath:
    temp_dir = tempfile.mkdtemp()
    filepath = os.path.join(temp_dir, "table.csv")
    f = io.open(filepath, mode="w", encoding="utf-8", newline="")
    test_import_pandas()
    import pandas as pd
    if isinstance(schema, Schema):
    for col in schema.columns_to_store:
    if col["columnType"] == "DATE":
    def _trailing_date_time_millisecond(t):
    if isinstance(t, str):
    return t[:-3]
    df[col.name] = pd.to_datetime(
    df[col.name], errors="coerce"
    ).dt.strftime("%s%f")
    df[col.name] = df[col.name].apply(
    lambda x: _trailing_date_time_millisecond(x)
    )
    df.to_csv(
    f,
    index=False,
    sep=separator,
    header=header,
    quotechar=quoteCharacter,
    escapechar=escapeCharacter,
    lineterminator=lineEnd,
    na_rep=kwargs.get("na_rep", ""),
    float_format="%.12g",
    )
    # NOTE: reason for flat_format='%.12g':
    # pandas automatically converts int columns into float64 columns when some cells in the column have no
    # value. If we write the whole number back as a decimal (e.g. '3.0'), Synapse complains that we are writing
    # a float into a INTEGER(synapse table type) column. Using the 'g' will strip off '.0' from whole number
    # values. pandas by default (with no float_format parameter) seems to keep 12 values after decimal, so we
    # use '%.12g'.c
    # see SYNPY-267.
    finally:
    if f:
    f.close()
    return cls(
    schema=schema,
    filepath=filepath,
    etag=etag,
    quoteCharacter=quoteCharacter,
    escapeCharacter=escapeCharacter,
    lineEnd=lineEnd,
    separator=separator,
    header=header,
    includeRowIdAndRowVersion=includeRowIdAndRowVersion,
    headers=headers,
    )
  • delete_rows: Let's you run a query and delete all matching rows

Will there be an in-memory representation of the contents of the table stored on the Table object?

I was not planning to store a reference to the content on the Table object. The content of the table depends on what you query for. My intention is that any time you want to get data you'd run a fresh query and pass the results around. My concern around storing this on the Table object is that if I keep a reference to the result the garbage collector wouldn't clean it up until the reference to the Table was removed. I can see cases where you might want to query a table multiple times for different things - get results and act on those results, query again and act on those results.... I'd rather that the data is intentionally passed to what needs it.

I had set query to be a staticmethod on the Table class. The thing that would prevent me from want to make it a non-staticmethod is that you would still need to provide the table ID in the query you execute (ie: select * from syn123 examples). Meaning it'd look like this in code if that were the case:

my_table = Table(id="syn1234")
query_results = my_table.query(query="select * from syn1234")

We need to duplicate the syn1234 twice in this case. This is an issue present in my proposed delete_rows method too:

from synapseclient import Synapse
from synapseclient.models import Table

syn = Synapse()
syn.login()

Table(id="syn1234").delete_rows(query="SELECT ROW_ID, ROW_VERSION FROM syn1234 WHERE foo = 'asdf'")

I had considered something like being able to use a pre-defined string within the query that gets replaced with the ID of the table, ie:

from synapseclient import Synapse
from synapseclient.models import Table

syn = Synapse()
syn.login()

# Current expectation
query_results = Table.query(query="SELECT ROW_ID, ROW_VERSION FROM syn1234 WHERE foo = 'asdf'")

# 'pre-defined' string idea
query_results = Table(id="syn1234").query(query="SELECT ROW_ID, ROW_VERSION FROM :id WHERE foo = 'asdf'")

The problem with this thought is that you couldn't ever take this query and execute it in the Synapse UI, or other client that doesn't contain this exact behavior.

What would an operation look like if a user wanted to change the name of a column and alter row content in between store calls?

The syntax should look something like this:

# Query for the rows to update
content_to_update = Table.query(query="select * from syn1234 where foo = 'bar'")

# Update those rows
content_to_update["some_content"] = ["some-new-values"]

# include_columns to rename an existing column
my_table = Table(id="syn1234").get(include_columns=True)
my_table.columns["some_content"].name = "some_new_column_name"

# The `store_rows` function could look for the new column in the pandas dataframe; If present - Use the data from that column, if not present - Look for the previous column name
my_table.store_rows(values=content_to_update)

Question: Do you think we should expect that the user is updating the name of the column in the pandas dataframe? Should we provide a rename_column function that takes in: old_column_name: str, new_column_name: str, rows: pd.Dataframe = None that would 1) Update the .columns attribute dictionary key, 2) Update the Column class instance .name attribute, 3) Update the name of the column in the passed in Dataframe?

@BWMac BWMac self-requested a review January 9, 2025 21:44
) -> str:
"""Takes in a path to a CSV and stores the rows to Synapse.
# TODO: Finish implementation
async def store_rows_async(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the upsert functionality be part of this class or would it be elsewhere? (Essentially partial updates)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question. I ended up putting this into a new method for that specific logic since partial updates are quite a bit different for how it needs to be implemented. It's in a upsert_rows method detailed here:

https://synapsepythonclient--1151.org.readthedocs.build/en/1151/reference/oop/table_refactor/#synapseclient.models.Table.upsert_rows

I also got an implementation working (At least the partial update portion) to test out how Synapse does it.

@BWMac
Copy link
Contributor

BWMac commented Jan 10, 2025

Thanks for the explanations on the row operations and querying. That all makes sense.

Do you think we should expect that the user is updating the name of the column in the pandas dataframe?

This gets at the feeling that I was getting which is that having the Column objects completely divorced from the table contents in the OOP model might cause some problems. I can imagine confusion for some users if they were only able to make updates to the Columns in the Table object but then only able to make updates to table contents/rows with the dataframe interface. I think the rename_column functionality you described could be valuable. I also think in general our users are likely familiar with pandas/pandas dataframes so interfaces using that data structure will probably be easy to understand.

If multiple ways of updating columns, rows, and/or table contents in general are to be supported, that definitely makes the dry_run functionality you have planned for the store function even more important so users can see how their changes will resolve in Synapse before making their operations final.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants