-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Additional checks for vague date ranges required? #23
Comments
I think there might be an issue already raised with GBIF related to this. Last time I checked they couldn't handle date ranges in the eventDate field. |
The GBIF guidance suggests otherwise, unless you mean that there is currently a bug report open. Couldn't your process do some error checking to compare the interpreted and original event dates? For example, in the case above, there is clearly an error in the interpration of the originally supplied date. Seems like a fairly important issue for modelling trends in IAS, which presumably the aggregated dataset is going to be used for. |
We could, but I have a suspicion that the |
Closed in error |
The simplest approach would be just to manually check any record resolved to a day where that day was Julian day 0. This would at least exclude the most egregious errors. Over the past 6 years at BRC I have never seen an automated pipeline that didn't benefit from some manual checks or intervention. |
I agree about manual checks, but we do need to keep this to a minimum for what we envision. In the case of Belgian data we are also the publishers of most of the data, so some problems can and should be fixed in the publication pipelines too. |
Looks like the |
Thanks @sacrevert for your observation. Screening observation via querying the API endpoint About the parsing of
The eventDate "1700-01-01/2009-02-04" is correct according to the ISO-standard. There are still parsing issues at GBIF side. Once the GBIF issue is solved, we can think to assign the occurrence randomly to a specific year and add the column |
GBIF does now "handle" date ranges, as taking the first date of the range (see gbif/portal-feedback#652 (comment)). That is already an improvement from ignoring the date altogether, which was the case before. |
It's up to you guys really, I was just pointing out that early dates resolved to single years are often wrong, and this was obvious within about 10 seconds of looking at your "occurrence cube". My personal opinion is that extremely vague dates should not be arbitrarily assigned to single years, particularly if one is ultimately going to be producing trends for policy or broader ecological use or interpretation. Either the records should be ignored, or presented with full known range, so that later they can either be excluded or known to fall within a particular date range for modelling. I suppose randomly assigning a year is one potential solution, although I would personally choose to exclude such data points, as they don't add any information and are liable to be misinterpreted by any uninitiated users downstream. This would assume that the dates were missing completely at random (in the statistical jargon), which is also unlikely, as the missingness is probably correlated with the true date of collection. |
@sacrevert completely agree. I just suggested to GBIF to flag such records, so we can exclude them in the future: gbif/gbif-api#4 (comment) |
Thanks @sacrevert for your comment. If these data would not be such useful, adding temporal uncertainty column is just making data processing more computationally demanding with no benefit for the researcher and making output larger and less readable. |
Early records are less likely to be resolved to single years.
For example, the first exemplar row here
https://zenodo.org/record/3635510#.Xj1LLWj7SHt
1700 | 1kmE3802N3133 | 2287615 | 1 | 301
apparently derives from the GBIF record here https://www.gbif.org/occurrence/477065724
but this seems to misrepresent the original
https://mczbase.mcz.harvard.edu/guid/MCZ:Mala:152567
which gives a collecting date of 1700-2009 (i.e. presumably unknown or not digitised?)
The text was updated successfully, but these errors were encountered: