Fit for use flag for data

ymgan · January 16, 2025, 2:13pm

Hey all,

I am curious if anyone is aware of any existing fit for use flags for data? For example, a flag that indicate a record can be used for quantification/abundance. Or this data can be used for presence/absence and not for abundance, something like that.

An example use case is a net of a standard gear broke during the sampling, but something is still caught. The sample is kept, but because the net broke in that specific event, that data cannot be used for abundance. But it is still meaningful to say the organism (caught) is still present at that time and place. In such situation, this record will be flagged to indicate that it can be used for presence/absence but not for abundance. User who sees this flag will understand that the net could have catch much more organisms (e.g. krills) so they are aware that the individualCount or organismQuantity is unreliable (or why these fields are left empty).

The scientists of our project hope to flag the data they produce to inform users on the when their data is relevant and how users can use them.

Thanks a lot!

pieterprovoost · January 16, 2025, 2:28pm

Hi Ming,

IODE has a quality flag scheme, but there are others in oceanography. The IODE flags are in NVS, maybe these can be adopted?

IODE scheme: OceanExpert | Document
Mapping between IODE and other schemes: https://odv.awi.de/fileadmin/user_upload/odv/misc/ODV4_QualityFlagSets.pdf
NVS vocab: NVS

Laurent · January 16, 2025, 3:06pm

Dear @pieterprovoost and @ymgan I will move this discussion to the Data publishing channel.

ymgan · January 17, 2025, 10:56am

Thank you very much @pieterprovoost !

These are different from what I expected, but I appreciate you sharing nonetheless. I have some considerations but the following is my biggest question:

I believe this should be annotation and not part of the data, because these are not facts, but recommendations based on a set of criteria, so I feel like it may not make sense to publish this as part of the data? I also don’t know where should this information live?

What I understood from the conversation with scientists who are experienced in certain sampling gears and protocols is that, there are so many things that can go wrong during sampling and this may not be obvious for downstream user who gets the data from aggregator even if they document the facts in e.g. eventRemarks.

The scientists being the primary user of the data they produced has the experiences and expert knowledge to evaluate and recommend whether a record can be used for qualitative (e.g. presence-absence) or quantitative (e.g. abundance) analyses. I am sure there could be more examples, but I am not an expert in this.

Regarding the flags shared, it is not very clear to me how useful good or bad are or what exactly QC criteria was. A “bad” data for abundance analysis can be a “good” data for presence-absence modelling. My gut feelings tell me that the flags may not be suitable for biodiversity data, but it could be that I am not knowledgeable enough.

As a side note and my personal opinion - I want our data providers to feel encouraged and the negative wordings like “bad” data can be discouraging.

rubenperper · January 17, 2025, 2:42pm

Hi,

EMODnet Biology started an effort to create these flags and a logic behind them. Unfortunately other tasks were prioritized but we were close to having something functional.

I’ve used your post to prompt the right people at VLIZ but they will not be working again on this until at least, this spring. If needed, I can include a brief explanation of it in the next OBIS DCG meeting.

The idea is that based on the accuracy and completeness of the data we would annotate each occurrence (and maybe dataset?). These annotations would be developed into filters in the EMODnet data portal so people could select data that are suitable for different types of qualitative or quantitative analysis as you said @ymgan

rubenperper · February 21, 2025, 11:34am

Background info since Ming’s point is slightly different than my postpublication fitness for use labels suggestion. Although both topics are related and could be addressed at the same time.

Prepublication fitness for use labels - Ming (AntOBIS)
Data originators/providers want to flag their records so warn users (and aggregators) so those records are not used for specific types of analysis.

E.g. sampling issues (net breaking midtrawl) that could underestimate the actual data collection size, making the record not comparable to other similar quantifications. In these cases, data providers should be able to flag the record so it can be used for presence only analysis but not for quantitative analysis. For the moment these annotation are giong to measurementRemarks, DynamicProperties, eventRemarks, etc.

Postpublication fitness for use labels - Ruben (EurOBIS)
Creation of labels that are annotated to records based on a given set of known criteria, usually completeness and quality of relevant fields (coordinates, date, abundances, taxonomy, temperatures, sampling methodolgies). These labels/flags are annotated by the data aggregator/system via an automated method.

Those labels would distinguish how appropriate the record is to be used in data product creation or for other specific applications. The categories are (feel free to suggest changes):

Fitness for use C: (species distribution analysis)
- Good metadata: includes citation, title, license, and abstract with >100 characters. A script to check this exists and can be re-used. The field contact should be checked. Can be included in future updates.
- Coordinates present
- Coordinates uncertainty <5000m.
- Year present
- Species level taxa
Fitness for use B: (presence / absence analysis)
- All the checks from category C plus
- Sampling device present
- Sampling effort present
Fitness for use A: (quantitative analysis)
- All the checks in C and B plus
- Individual count in sample
- Abundance
- Biomass
Fitness for use A+: (e.g. habitat modelling)
- Abiotic data is mapped to BODC terms

Points in common and Next steps

Creation of fitness for use labels
Creation of logic to assign each label to each record
- Based on data collection remarks for the prepublication labels
- Based in completeness and accuracy of other fields for postpublication labels
Finding a location (DwC field?) where those flags/labels can be annotated
Automation of label annotation (Logic implementation) for postpublication labels
- Could be looking at the chosen flag field and taking it into account for the calculation. Either skip if already filled in by originator or something in those lines.
EDIT: Requested extra step → Propose this as an Unconference topic in living data conference 2025

I hope this is a good summary of the topic, @ymgan please feel free to suggest/clarify.

ymgan · February 21, 2025, 12:13pm

Thanks @rubenperper !! I believe your interpretation is correct!

I would like to point out that the following was being discussed here and there.

Finding a location (DwC field?) where those flags/labels can be annotated

This is what I gathered. I think DwC field is probably not the place for long term because:

I felt that the flag is not data and it can change based on its criteria.
Annotation of annotation can happen
Please have a look at an opened issue from TDWG BDQ group about this TG2- Storage of Annotations · Issue #149 · tdwg/bdq · GitHub

I would like to suggest another next step if that makes sense - please propose this as an Unconference topic in living data conference this year

silas.principe · March 6, 2025, 4:11pm

Hi @ymgan and @rubenperper, I think this is something we can also discuss in the Products Coordination group, as this kind of flagging would be very much welcome to facilitate application of the data in products.

@rubenperper does EurOBIS currently use any of those flags that you shared or this is something that is still on the plans?

rubenperper · March 17, 2025, 5:07pm

Hi @silas.principe I understand that part of it could be a common effort with OBIS PCG indeed.

EurOBIS currently do not use those flags, work in progress for the moment.

silas.principe · March 18, 2025, 3:45pm

Ah ok @rubenperper! I will bring this topic to the PCG next time

Topic		Replies	Views
About the Data use category Data use	0	8	November 29, 2024
Request for OBIS Support: Draft Term Change for occurrenceStatus Data publishing	2	62	May 5, 2025
NA vs blank in data publishing and data use General	0	50	March 11, 2025
License on OBIS snapshot data Data use	2	37	March 25, 2025
How to Make EOV Datasets Visible on the BioEco Portal? General	0	17	July 16, 2025

Fit for use flag for data

Related topics