Update tap with fastsync #8

brose7230 · 2022-03-17T17:38:27Z

No description provided.

…ataframe

eric-roll · 2022-05-18T21:08:28Z

tap_mssql/sync_strategies/full_table.py

+
+        columns.extend(['_SDC_EXTRACTED_AT','_SDC_DELETED_AT','_SDC_BATCHED_AT'])
+
+        query_df = df = pd.DataFrame(columns=columns) #TODO: delete?


Do we need Pandas here? This can slow down performance and may not be needed if the goal is to just create a csv.

https://stuartsplace.com/computing/programming/python/python-and-sql-server-exporting-data-csv

Since this works as is right now and will require rewriting/testing this, are you cool with leaving it as pandas and I can create another ticket in the R&R epic for updating this method? If its best to get it switched to the most efficient asap, then i can make time this week to update this last part, but if its not pressing then i can do the ticket and get back to it.

tap_mssql/sync_strategies/full_table.py

eric-roll · 2022-05-18T21:09:56Z

tap_mssql/sync_strategies/full_table.py

+        raise Exception('Length must be at least 1!')
+
+    if 0 < length < 8:
+        LOGGER.info('Length is too small! consider 8 or more characters')


Nit, but this could be upgraded to a Warn

eric-roll · 2022-05-19T17:29:40Z

tap_mssql/sync_strategies/common.py

 import datetime
+import glob


I don't think any of these imports are being used here.

eric-roll · 2022-05-19T17:36:47Z

tap_mssql/sync_strategies/common.py

-    select_sql = "SELECT {} FROM {}.{}".format(
-        ",".join(escaped_columns), escaped_db, escaped_table
-    )
+    if fastsync:


Can we consider moving this logic to a new folder called fast_sync? Then in full_table.py the conditional formatting can be used to decide whether to use generate_select_sql or the fast_sync_generate_select_sql.

This isn't essential, just an idea to keep fast sync separate from other methods already being used in other replication types (like incremental)

oh yeah good call. I planned on having a condition within full_table.sync_table() that would call one or the other, so that was something i meant to do but forgot to make note of. This is updated

oops, i think i read 'folder' as 'function' lol

eric-roll · 2022-05-19T17:46:50Z

tap_mssql/sync_strategies/full_table.py


        if catalog_entry.tap_stream_id == "dbo-InputMetadata":
            revert_ouput_converter(open_conn, prev_converter)
+


Delete blank line

eric-roll · 2022-05-19T17:47:54Z

tap_mssql/sync_strategies/full_table.py

+
+
+
+def generate_random_string(length: int = 8) -> str:


This could be another function that makes sense to move to a file in a new fast_sync directory.

eric-roll · 2022-05-19T17:53:50Z

tap_mssql/sync_strategies/split_gzip.py

@@ -0,0 +1,172 @@
+"""Functions that write chunked gzipped files."""


It may make sense to move this file up one level out of sync_strategies to tap_mssql.

i used that earlier on during the fastsync work and it is no longer needed so i trashed it in the latest push

eric-roll · 2022-05-19T18:11:26Z

tap_mssql/sync_strategies/full_table.py

@@ -2,9 +2,18 @@
 # pylint: disable=duplicate-code,too-many-locals,simplifiable-if-expression


Can we add a Fast Sync section to the readme.md? There are a few required config changes like - "fastsync_batch_rows" that should be called out.

eric-roll · 2022-05-19T18:26:17Z

tap_mssql/sync_strategies/common.py

@@ -2,11 +2,19 @@
 # pylint: disable=too-many-arguments,duplicate-code,too-many-locals

 import copy
+import csv


I think a few changes available in master aren't included here. The change to logical.py that updates the state even if no records changed is not here. See line 301 of logical.py in master - this isn't anywhere in this PR.

brose7230 · 2022-06-07T05:58:46Z

@eric-roll i didnt want to miss anything from master so i just created a new branch with all these updates: #11

brose7230 added 30 commits March 17, 2022 12:37

FastSync to tap-mssql update

d91d69f

extra logger before adding code in

6a04e76

Fixed logger position

9971f0c

extra logging for pk values

bc17b43

removed unused function

15d7ef1

fixed ANOTHER error

e3be333

fixed results

a575a15

added utils and testing initial sync duration

4e05478

length of rows

4eb1df3

number of rows

03f592c

fixed extracted/batched vars

26f4882

added copy_table function for chunking results into gzips

e84016c

fixed logging error

866da9c

params

d5c0f32

removed utils and moved 2 functions to common

8fa8f2c

trying pandas;

0b9e0c3

fixed connection

10b60b3

Fixed pandas

730c1b7

removed from common

96bd429

added pandas to setup

4d3d559

Update iwth singer rows

08b81c4

fixed imports

07fb8b7

write_message record

25c1020

updated record_message

27e5e59

in common

81a3e8b

fixed json to dict

c02a49b

Fixed time and version

acacfcd

defined time extracted

8e96274

cleanup

dec761e

cleanup

98d0994

brose7230 added 25 commits March 17, 2022 23:07

fixed pandas columns

0b632f9

removed some logging

303a51b

increased chunk size

b158d02

Changed method of sending

a851137

Updated singer message with full sync

cd169ef

fixed name and message to singer

1ac0851

added sys stdout write for FASTSYNC

2c12035

fixed logger spelling

9d5a7f6

converted dict to str

9590f3b

fixed dict

c2355b5

updated json stinrg

7c087d1

added file names to singer record

3266a4d

changed to filename

1721c8c

added _sdc_ fields to query/dataframe

3d49e18

fixed fastync typo in generate_select_sql function

a514b91

changed to_csv to use chunk_dataframe instead of appending a larger d…

8d4665a

…ataframe

sort columns when fastsync

0c19653

removed some logging and cleaned up unused functions

8a96bbe

creates dir if not existing

411749f

removed mkdir

1c1b38d

moved max_pk_values and last_pk_fetched alignment

16e0c84

removed else in sync_table

d0431a1

removed unneeded code

5594a17

changed export_batch_rows to fastsync_batch_rows

edc7799

comment

5c62571

eric-roll suggested changes May 19, 2022

View reviewed changes

brose7230 added 2 commits June 6, 2022 23:53

review changes

c718040

removed gzip import

592f78f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update tap with fastsync #8

Update tap with fastsync #8

brose7230 commented Mar 17, 2022

eric-roll May 18, 2022

eric-roll May 19, 2022

brose7230 Jun 7, 2022

eric-roll May 18, 2022

brose7230 Jun 7, 2022

eric-roll May 19, 2022

brose7230 Jun 7, 2022

eric-roll May 19, 2022

brose7230 Jun 7, 2022

brose7230 Jun 7, 2022

eric-roll May 19, 2022

brose7230 Jun 7, 2022

eric-roll May 19, 2022

eric-roll May 19, 2022

brose7230 Jun 7, 2022 •

edited

Loading

eric-roll May 19, 2022

brose7230 Jun 7, 2022

eric-roll May 19, 2022

brose7230 commented Jun 7, 2022


		columns.extend(['_SDC_EXTRACTED_AT','_SDC_DELETED_AT','_SDC_BATCHED_AT'])

		query_df = df = pd.DataFrame(columns=columns) #TODO: delete?


		if catalog_entry.tap_stream_id == "dbo-InputMetadata":
		revert_ouput_converter(open_conn, prev_converter)

		@@ -0,0 +1,172 @@
		"""Functions that write chunked gzipped files."""

		@@ -2,9 +2,18 @@
		# pylint: disable=duplicate-code,too-many-locals,simplifiable-if-expression

Update tap with fastsync #8

Are you sure you want to change the base?

Update tap with fastsync #8

Conversation

brose7230 commented Mar 17, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brose7230 Jun 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brose7230 commented Jun 7, 2022

brose7230 Jun 7, 2022 •

edited

Loading