Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integer overflow in CohortRelationship.sql #1150

Open
msuchard opened this issue Jan 6, 2025 · 4 comments
Open

Integer overflow in CohortRelationship.sql #1150

msuchard opened this issue Jan 6, 2025 · 4 comments
Assignees
Labels
bug Something isn't working fixed in develop

Comments

@msuchard
Copy link
Member

msuchard commented Jan 6, 2025

@azimov -- CohortDiagnostics within Strategus continues to throw SQL errors in the VA environment, and getting these solved is a high-priority for finishing our ExampleStrategusStudy for tutorials (with @schuemie). Here is the SQL error, and I suspect it is occurring in the SUM().

DBMS:
sql server

Error:
com.microsoft.sqlserver.jdbc.SQLServerException: Arithmetic overflow error converting expression to data type int.

SQL:
SELECT t.cohort_definition_id cohort_id,
	c.cohort_definition_id comparator_cohort_id,
	CAST(1 AS INT) time_id,
	COUNT_BIG(DISTINCT c.subject_id) subjects,
	-- present in both target and comparator
	COUNT_BIG(DISTINCT CASE 
			WHEN c.cohort_start_date < DATEADD(day, -9999, t.cohort_start_date)
				THEN c.subject_id
			ELSE NULL
			END) sub_cs_before_ts,
	-- comparator cohort start date before target start date (offset) [How many subjects in comparator cohort start prior to first target cohort start]
	COUNT_BIG(DISTINCT CASE 
			WHEN c.cohort_start_date = DATEADD(day, -9999, t.cohort_start_date)
				THEN c.subject_id
			ELSE NULL
			END) sub_cs_on_ts,
	-- comparator cohort start date on target start date (offset) [How many subjects in comparator cohort start with first target cohort start]
	COUNT_BIG(DISTINCT CASE 
			WHEN c.cohort_start_date > DATEADD(day, -9999, t.cohort_start_date)
				THEN c.subject_id
			ELSE NULL
			END) sub_cs_after_ts,
	-- comparator cohort start date after target start date (offset) [How many subjects in comparator cohort start after first target cohort start]
	COUNT_BIG(DISTINCT CASE 
			WHEN c.cohort_start_date < DATEADD(day, -9999, t.cohort_end_date)
				THEN c.subject_id
			ELSE NULL
			END) sub_cs_before_te,
	-- comparator cohort start date before target end date (offset) [How many subjects in comparator cohort start after first target cohort end]
	COUNT_BIG(DISTINCT CASE 
			WHEN c.cohort_start_date = DATEADD(day, -9999, t.cohort_end_date)
				THEN c.subject_id
			ELSE NULL
			END) sub_cs_on_te,
	-- comparator cohort start date on target end date (offset) [How many subjects in comparator cohort start on first target cohort end]
	COUNT_BIG(DISTINCT CASE 
			WHEN c.cohort_start_date > DATEADD(day, -9999, t.cohort_end_date)
				THEN c.subject_id
			ELSE NULL
			END) sub_cs_after_te,
	-- comparator cohort start date after target end date (offset) [How many subjects in comparator cohort start after first target cohort end]
	COUNT_BIG(DISTINCT CASE 
			WHEN c.cohort_start_date >= DATEADD(day, -9999, t.cohort_start_date)
				AND c.cohort_start_date <= DATEADD(day, 0, t.cohort_end_date)
				THEN c.subject_id
			ELSE NULL
			END) sub_cs_window_t,
	-- comparator cohort subjects start within period (incidence) relative to target start date and end date
	-- [How many subjects in comparator cohort start within a window of days relative to first target start date and first target end date]
		COUNT_BIG(DISTINCT CASE 
			WHEN c.cohort_end_date >= DATEADD(day, -9999, t.cohort_start_date)
				AND c.cohort_end_date <= DATEADD(day, 0, t.cohort_end_date)
				THEN c.subject_id
			ELSE NULL
			END) sub_ce_window_t,
	-- comparator cohort subjects start within period (incidence) relative to target start date and end date
	-- [How many subjects in comparator cohort start within a window of days relative to first target start date and first target end date]
	COUNT_BIG(DISTINCT CASE 
			WHEN c.cohort_start_date >= DATEADD(day, -9999, t.cohort_start_date)
				AND c.cohort_start_date <= DATEADD(day, 0, t.cohort_start_date)
				THEN c.subject_id
			ELSE NULL
			END) sub_cs_window_ts,
	-- comparator cohort subjects start within period (incidence) relative to target start date [How many subjects in comparator cohort start within a 
	-- window of days relative to first target start date]
	COUNT_BIG(DISTINCT CASE 
			WHEN c.cohort_start_date >= DATEADD(day, -9999, t.cohort_end_date)
				AND c.cohort_start_date <= DATEADD(day, 0, t.cohort_end_date)
				THEN c.subject_id
			ELSE NULL
			END) sub_cs_window_te,
	-- comparator cohort subjects start within period (incidence) relative to target end date [How many subjects in comparator cohort start within a 
	-- window of days relative to first target end date]
	COUNT_BIG(DISTINCT CASE 
			WHEN c.cohort_end_date >= DATEADD(day, -9999, t.cohort_start_date)
				AND c.cohort_end_date <= DATEADD(day, 0, t.cohort_start_date)
				THEN c.subject_id
			ELSE NULL
			END) sub_ce_window_ts,
	-- comparator cohort subjects end within period (incidence) relative to target start date [How many subjects in comparator cohort end within a 
	-- window of days relative to first target start date]
	COUNT_BIG(DISTINCT CASE 
			WHEN c.cohort_end_date >= DATEADD(day, -9999, t.cohort_end_date)
				AND c.cohort_end_date <= DATEADD(day, 0, t.cohort_end_date)
				THEN c.subject_id
			ELSE NULL
			END) sub_ce_window_te,
	-- comparator cohort subjects end within period (incidence) relative to target end date [How many subjects in comparator cohort end within a 
	-- window of days relative to first target end date]
	COUNT_BIG(DISTINCT CASE 
			WHEN c.cohort_start_date >= DATEADD(day, -9999, t.cohort_start_date)
				AND c.cohort_start_date <= DATEADD(day, 0, t.cohort_end_date)
				AND c.cohort_end_date >= DATEADD(day, -9999, t.cohort_start_date)
				AND c.cohort_end_date <= DATEADD(day, 0, t.cohort_end_date)
				THEN c.subject_id
			ELSE NULL
			END) sub_c_within_t,
	-- comparator cohort days within target (offset) days [How many subjects in comparator cohort have their entire cohort period within first target cohort period]
	SUM((
			CASE -- comparator cohort start date before target start date (offset)
				WHEN c.cohort_start_date < DATEADD(day, -9999, t.cohort_start_date)
					THEN datediff(dd,
					              c.cohort_start_date, 
					              CASE --min of comparator end date/target start dates (offset)
          								WHEN c.cohort_end_date < DATEADD(day, -9999, t.cohort_start_date)
          									THEN c.cohort_end_date
          								ELSE DATEADD(day, -9999, t.cohort_start_date)
								          END)
				ELSE 0
				END
			) + 1) c_days_before_ts,
	-- comparator cohort days before target start date (offset)
	SUM((
			CASE -- comparator cohort start date before target end date (offset)
				WHEN c.cohort_start_date < DATEADD(day, -9999, t.cohort_end_date)
					THEN datediff(dd, 
					              c.cohort_start_date, 
					              CASE --min of comparator end date/target end dates (offset)
								          WHEN c.cohort_end_date < DATEADD(day, -9999, t.cohort_end_date)
									          THEN c.cohort_end_date
								          ELSE DATEADD(day, -9999, t.cohort_end_date)
								          END)
				ELSE 0
				END
			) + 1) c_days_before_te,
	-- comparator cohort days before target end date (offset)
	SUM((
			CASE -- comparator cohort days within target days (offset)
				WHEN  c.cohort_end_date >= DATEADD(day, -9999, t.cohort_start_date)
					    AND c.cohort_start_date <= DATEADD(day, 0, t.cohort_end_date)
					THEN datediff(dd, 
					              CASE --min of comparator start date/target start dates (offset)
								            WHEN c.cohort_start_date < DATEADD(day, -9999, t.cohort_start_date)
									          THEN DATEADD(day, -9999, t.cohort_start_date)
								          ELSE c.cohort_start_date
								          END, 
								        CASE --min of comparator end date/target end dates (offset)
								            WHEN c.cohort_end_date > DATEADD(day, 0, t.cohort_end_date)
									          THEN DATEADD(day, 0, t.cohort_end_date)
								          ELSE c.cohort_end_date
								          END)
				ELSE 0
				END
			) + 1) c_days_within_t_days,
	-- comparator cohort days within target cohort days (offset)
		SUM((
			CASE -- comparator cohort end date after target start date (offset)
				WHEN c.cohort_end_date > DATEADD(day, -9999, t.cohort_start_date)
					THEN datediff(dd, 
					              CASE --max of comparator start date/target start dates (offset)
								            WHEN c.cohort_start_date < DATEADD(day, -9999, t.cohort_start_date)
									          THEN DATEADD(day, -9999, t.cohort_start_date)
								          ELSE c.cohort_start_date
								          END,
								          c.cohort_end_date)
				ELSE 0
				END
			) + 1) c_days_after_ts,
	-- comparator cohort days after target start date (offset)
		SUM((
			CASE -- comparator cohort end date after target end date (offset)
				WHEN c.cohort_end_date > DATEADD(day, -9999, t.cohort_end_date)
					THEN datediff(dd, 
					              CASE --max of comparator start date/target start dates (offset)
								            WHEN c.cohort_start_date < DATEADD(day, -9999, t.cohort_end_date)
									          THEN DATEADD(day, -9999, t.cohort_end_date)
								          ELSE c.cohort_start_date
								          END,
								          c.cohort_end_date)
				ELSE 0
				END
			) + 1) c_days_after_te,
	-- comparator cohort days after target end date (offset)
	SUM(datediff(dd, DATEADD(day, -9999, t.cohort_start_date), DATEADD(day, 0, t.cohort_end_date)) + 1) t_days,
	-- target cohort days (no offset)
	SUM(datediff(dd, c.cohort_start_date, c.cohort_end_date) + 1) c_days
-- comparator cohort days (offset)
INTO #cohort_rel_output
FROM #target_cohort_table t
INNER JOIN ORD_OHDSI.bv.StrategusExample c ON c.subject_id = t.subject_id
	AND c.cohort_definition_id != t.cohort_definition_id
	-- comparator cohort overlaps with target cohort during the offset period
	AND c.cohort_end_date >= DATEADD(day, -9999, t.cohort_start_date)
	AND c.cohort_start_date <= DATEADD(day, 0, t.cohort_end_date)
WHERE t.cohort_definition_id IN (19021,19022,19023,19024,19059,19137,19021001,19022101,19137001)
    AND c.cohort_definition_id IN (19022,19023,19024,19059,19137,19021001,19022101,19137001,19021)
GROUP BY t.cohort_definition_id,
	c.cohort_definition_id

R version:
R version 4.4.1 (2024-06-14 ucrt)

Platform:
x86_64-w64-mingw32

Attached base packages:
- stats
- graphics
- grDevices
- datasets
- utils
- methods
- base

Other attached packages:
- Strategus (1.0.0)
- CohortGenerator (0.11.2)
- R6 (2.5.1)
- DatabaseConnector (6.3.2)

To sum using BIGINT each argument should (I believe) be first CAST to BIGINT, like SUM(CAST(column) AS BIGINT)

@schuemie
Copy link
Member

schuemie commented Jan 7, 2025

Agreed. On SQL Server the DATEDIFF function returns an INT, which should be cast to BIGINT before summing.

(The reason we haven't seen this in for example J&J is because we were using RedShift, where DATEDIFF returns a BIGINT. Adding the CAST as Marc suggested would make sure the code runs on all platforms, even including DataBricks, where DATEDIFF also returns an INT).

@azimov : could we have a release soon that implements this? It should be a fairly straightforward change.

@azimov
Copy link
Collaborator

azimov commented Jan 9, 2025

@msuchard @schuemie I have attempted a at hotfix for all datediffs and will attempt to release today. Please note that I will be away for some time from the 10th January but will try to fully test this before then

@azimov
Copy link
Collaborator

azimov commented Jan 10, 2025

@msuchard @schuemie - Ok I have created a PR #1154 that converts all date diffs to big int and the release is ready to go (this is a cherry pick so develop will be in conflict). However, there are two points:

  1. It would be good if this could be tested on the data where this failed - my hotfix is a stab in the dark as I cannot seem to reproduce the issue on my own.

  2. I won't be around to fix any unforeseen consequences of this release

  3. Regardless, its never ideal to release on a Friday

If you can test it on some suitable data, please feel empowered to merge.

@azimov azimov self-assigned this Jan 10, 2025
@azimov azimov added bug Something isn't working fixed in develop labels Jan 10, 2025
@msuchard
Copy link
Member Author

@azimov -- i will coordinate getting this tested in VA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working fixed in develop
Projects
None yet
Development

No branches or pull requests

3 participants