I am curious if anyone is aware of any existing fit for use flags for data? For example, a flag that indicate a record can be used for quantification/abundance. Or this data can be used for presence/absence and not for abundance, something like that.
An example use case is a net of a standard gear broke during the sampling, but something is still caught. The sample is kept, but because the net broke in that specific event, that data cannot be used for abundance. But it is still meaningful to say the organism (caught) is still present at that time and place. In such situation, this record will be flagged to indicate that it can be used for presence/absence but not for abundance. User who sees this flag will understand that the net could have catch much more organisms (e.g. krills) so they are aware that the individualCount or organismQuantity is unreliable (or why these fields are left empty).
The scientists of our project hope to flag the data they produce to inform users on the when their data is relevant and how users can use them.
These are different from what I expected, but I appreciate you sharing nonetheless. I have some considerations but the following is my biggest question:
I believe this should be annotation and not part of the data, because these are not facts, but recommendations based on a set of criteria, so I feel like it may not make sense to publish this as part of the data? I also don’t know where should this information live?
What I understood from the conversation with scientists who are experienced in certain sampling gears and protocols is that, there are so many things that can go wrong during sampling and this may not be obvious for downstream user who gets the data from aggregator even if they document the facts in e.g. eventRemarks.
The scientists being the primary user of the data they produced has the experiences and expert knowledge to evaluate and recommend whether a record can be used for qualitative (e.g. presence-absence) or quantitative (e.g. abundance) analyses. I am sure there could be more examples, but I am not an expert in this.
Regarding the flags shared, it is not very clear to me how useful good or bad are or what exactly QC criteria was. A “bad” data for abundance analysis can be a “good” data for presence-absence modelling. My gut feelings tell me that the flags may not be suitable for biodiversity data, but it could be that I am not knowledgeable enough.
As a side note and my personal opinion - I want our data providers to feel encouraged and the negative wordings like “bad” data can be discouraging.
EMODnet Biology started an effort to create these flags and a logic behind them. Unfortunately other tasks were prioritized but we were close to having something functional.
I’ve used your post to prompt the right people at VLIZ but they will not be working again on this until at least, this spring. If needed, I can include a brief explanation of it in the next OBIS DCG meeting.
The idea is that based on the accuracy and completeness of the data we would annotate each occurrence (and maybe dataset?). These annotations would be developed into filters in the EMODnet data portal so people could select data that are suitable for different types of qualitative or quantitative analysis as you said @ymgan
Background info since Ming’s point is slightly different than my postpublication fitness for use labels suggestion. Although both topics are related and could be addressed at the same time.
Prepublication fitness for use labels - Ming (AntOBIS)
Data originators/providers want to flag their records so warn users (and aggregators) so those records are not used for specific types of analysis.
E.g. sampling issues (net breaking midtrawl) that could underestimate the actual data collection size, making the record not comparable to other similar quantifications. In these cases, data providers should be able to flag the record so it can be used for presence only analysis but not for quantitative analysis. For the moment these annotation are giong to measurementRemarks, DynamicProperties, eventRemarks, etc.
Postpublication fitness for use labels - Ruben (EurOBIS)
Creation of labels that are annotated to records based on a given set of known criteria, usually completeness and quality of relevant fields (coordinates, date, abundances, taxonomy, temperatures, sampling methodolgies). These labels/flags are annotated by the data aggregator/system via an automated method.
Those labels would distinguish how appropriate the record is to be used in data product creation or for other specific applications. The categories are (feel free to suggest changes):
Fitness for use C: (species distribution analysis)
Good metadata: includes citation, title, license, and abstract with >100 characters. A script to check this exists and can be re-used. The field contact should be checked. Can be included in future updates.
Coordinates present
Coordinates uncertainty <5000m.
Year present
Species level taxa
Fitness for use B: (presence / absence analysis)
All the checks from category C plus
Sampling device present
Sampling effort present
Fitness for use A: (quantitative analysis)
All the checks in C and B plus
Individual count in sample
Abundance
Biomass
Fitness for use A+: (e.g. habitat modelling)
Abiotic data is mapped to BODC terms
Points in common and Next steps
Creation of fitness for use labels
Creation of logic to assign each label to each record
Based on data collection remarks for the prepublication labels
Based in completeness and accuracy of other fields for postpublication labels
Finding a location (DwC field?) where those flags/labels can be annotated
Automation of label annotation (Logic implementation) for postpublication labels
Could be looking at the chosen flag field and taking it into account for the calculation. Either skip if already filled in by originator or something in those lines.
EDIT: Requested extra step → Propose this as an Unconference topic in living data conference 2025
I hope this is a good summary of the topic, @ymgan please feel free to suggest/clarify.
Hi @ymgan and @rubenperper, I think this is something we can also discuss in the Products Coordination group, as this kind of flagging would be very much welcome to facilitate application of the data in products.
@rubenperper does EurOBIS currently use any of those flags that you shared or this is something that is still on the plans?