Transcript from the OBIS-GBIF Darwin Core Data Package (DwC-DP) meeting (28 May 2025)

Laurent · June 18, 2025, 11:45am

This transcript was written by Yi Ming Gan (@ymgan / OBIS Antarctica) and is posted on her behalf.

Slides featured during the meeting:

Interoperability & Transition

Q: Is DwC-DP interoperable with DwC-A?

A: Yes, DwC-DP is interoperable with DwC-A, and it is anticipated that data expressed in Darwin Core Archive (DwC-A) can be converted into Darwin Core Data Package (DwC-DP) format. The reverse is very difficult to achieve.

Q: In what cases going forward would we recommend the usage of DwC-A vs DwC-DP?
A: Once the Darwin Core Data Package (DwC-DP) is implemented, the ideal recommendation is to use DwC-DP for all datasets, especially for those that cannot be easily expressed using the Darwin Core Archive (DwC-A) format. However, if the data already aligns well with DwC-A and there is no current need, capacity, or confidence to transition to DwC-DP, then continuing with DwC-A remains acceptable. DwC-DP is a safe and future-proof option. GBIF is also considering a hybrid approach: even if datasets are produced using DwC-A in the IPT, a DwC-DP version could be automatically generated in parallel.

Tooling & Interface

Q: Will there be tools to support the conversion of DwC-A into DwC-DP?
A: Development is already underway on a DwC-A to DwC-DP converter, available as:
Java-based converter library and Web-based conversion tool. While not fully complete, the converter is already capable of handling many cases robustly and losslessly, except for some edge cases where data has been inappropriately structured in DwC-A.

The conversion tool is planned to be offered as:

A standalone utility
A web interface
And will likely be embedded in the IPT, enabling publishers to upscale or augment their datasets to the DwC-DP format with minimal friction.

The core software stack is shared across all these implementations, and technical users can start testing it.

Q: Are the tools mentioned open source, or will they be implemented primarily on GBIF platforms?
A: All tools developed by GBIF are open source, with the sole exception of internal configuration files.

Q: Do the tools focus on the interpretation/utility of enriched data only or are they tools for transformation of data to DwC-DP?
A: So far, the work has primarily focused on producing the Darwin Core Data Package (DwC-DP), rather than tools for interpretation or transformation.

Q: Will there be tools available to help data providers not only map fields but also choose the appropriate tables when transforming data into DwC-DP?
A: Yes, that’s something we’re actively discussing. While IPT has historically been good at mapping spreadsheets to Darwin Core Archives, more advanced support will be needed for DwC-DP. We’re exploring ways to help with data restructuring, including potentially smarter tools and identifier generation. This is all part of making the transformation process more accessible and scalable.

Q: Is there a roadmap to make a graphic interface (i.e. update the IPT) for mapping DwC-DP files? It is difficult to create a JSON file without prior knowledge.
A: Yes, there is a development branch of the IPT that allows users to create DwC-DP files through a graphical interface here. For example, see: https://dwc2-ipt.gbif-uat.org/resource?r=broke-west-fish. While it is not yet ready for public release, you are welcome to request an account to try it out. Details on how to get access can be found here.

Q: The experience in mapping fields to DwC-DP with tools like the IPT is expected to be similar in difficulty to the current process. However, what about the choice of tables to be used, will each table that wants to be used have to be selected manually like it happens now? If so, although it’s not more difficult, it will definitely be more work.
A: Work is going to be needed on tools that make it easier. Currently the test IPT we have ONLY does what it always did so allows you to select the target table and map to it - we need to consider ways to make this easier for people.

Q: Would it be possible to have a graph visualisation of the dataset in the IPT to help users understand the relationships described in the mappings?
A: A visual representation sounds like a good idea indeed. If anyone has suggestions or ideas on how to approach this, we would be keen to collaborate.

Q: Assuming that guidelines will be designed to help choose a use case for dataset mapping, are there tools available/planned to facilitate mapping to DwC-DP, not just at the field level (as in the IPT) but also for selecting and structuring tables?
A: Yes, the goal is to move beyond manual field-level mapping in the IPT. Future tools are expected to assist users more holistically with the selection and structuring of DwC-DP tables based on use case.
There are ongoing discussions about how the IPT could evolve to support this, for example:

Normalising flat files into multiple linked tables, and
Automatically generating record IDs using hashes of fields.

These enhancements aim to reduce the technical burden on data providers and improve consistency in DwC-DP mapping.

Q: Why are there multiple JSON files in the Data Package (DP)? Is it to make it easier for computers or for people?
A: This is actually an artifact of the Data Package specification, which allows for multiple local schema files or for the table descriptions to be inlined directly within the datapackage.json.
Please see this example for comparison:

DwC-A format (inlined schema): archive.zip
DwC-DP format (separate JSONs): dwc-dp.zip

Technical Requirements & User Support

Q: What technical knowledge do people need to have to create DwC-DPs beyond what they already know? Is it only about knowing what other tables can be used, or will they also need to learn other skills such as basic programming?
A: The main challenge lies in understanding the structure and conceptual model of DwC-DP, which can be more complex than DwC-A. While the actual mechanics of creating a DwC-DP are similar to preparing a DwC-A, users will need to handle additional identifiers and relationships (i.e. more tables and JOINs).
Basic programming skills are not strictly required, but a good grasp of relational data concepts (e.g. how tables relate via IDs) will be helpful. The technical lift is more about modelling than coding.

Q: I’m enthusiastic about the Darwin Core Data Package and eager to start working with it. However, I have two main concerns. First, how can we effectively support data providers in understanding and using the model? Many of our data providers already struggle with the star schema. Will the shift to DwC-DP be too complex or unscalable for us to manage?
A: You’re asking all the right questions, and it’s great to hear your enthusiasm. You’re right, some data providers struggle with Darwin Core Archives for different reasons. For some, it’s the challenge of understanding how to join tables. For others, it’s the need to flatten data unnaturally, which can be frustrating and error-prone.

Darwin Core Data Packages can actually help address the latter issue by allowing a more natural relational structure. However, to support broader adoption, we will absolutely need tools to help data providers transform and structure their data. Expect developments in tooling, such as features in IPT or perhaps even the use of large language models (LLMs) to suggest data mappings – this is not something that we have explored yet. We also foresee the need for tools to help generate stable identifiers, which are often required across tables.

At least in the beginning, DwC-DP may be better suited to engineering teams or organisations that can script their exports, like ICES.dk. They’re already exploring how DwC-DP could be a fit for complex datasets like pollution or fish gut analysis and working with John Wieczorek on two sample datasets.

Guidance, Community Support & Implementation Strategy

Q: When and how can we develop guidance on engaging with the marine community for GBIF nodes?
A: Efforts are in the early stages of developing community guidance. The goal is to gather real questions from users and build documentation that responds to practical needs. A FAQ document has been initiated (led by Kate) to collect community questions. These will inform the development of tailored guidance materials.
It is anticipated that “recipes” will be created for common or recurring use case families.

These recipes will show how to fill out tables following a standard pattern and will be shared in a GitHub examples repository to ensure datasets follow a consistent structure.

Contributions are welcome, though no formal infrastructure is in place yet. For now, GitHub issues can be used, with labels to help categorise submissions (e.g. recipe, guidance, overview). Contributors are encouraged to proceed with whatever communication method is most useful to them at this stage. The process will evolve as the community grows.

It would be useful for OBIS to identify the overlap in usage guide for marine use cases and collaborate on this effort.

Q: Regarding the GBIF DwC implementation transparency: Is or will the GBIF implementation of the DwC-DP be publicly available so others can adapt it similarly?
A: Yes, the intention is for the Darwin Core Data Package (DwC-DP) to be a TDWG standard, not specific to GBIF. A public review is expected to open in September 2025, with potential ratification in early 2026. Reference implementations, such as in Java, are anticipated and will be openly accessible. Currently, there are already examples available:

A test implementation in the metabarcoding tool: MDT example dataset (click “files available”).
A working IPT prototype: dwcdp-ipt.gbif-test.org

Note: The current user experience may be cumbersome, as data must be pre-structured appropriately. There is a recognised need to create tooling to help users reshape data more easily.

Q: When is a good time to start thinking about how OBIS should index the tables?
A: Thinking can start now. GBIF has not yet implemented anything. Current efforts at GBIF (with John and others) are focused on structuring datasets that are currently difficult to mobilise, especially those rich in contextual information but not easily translated into DwCA.

A major concern from data contributors is that publishing data in a DwCA can lead to a loss of the essence of the study, making the resulting DOI unsuitable as supplementary material in scientific publications.

To address this, there is an emphasis on publishing richer datasets that preserve scientific value and are worthy of citation via DOI. GBIF plans to enhance dataset discovery, allowing searches based on sampling protocols, types of measurements, and contextual metadata. The traditional focus on occurrences may no longer apply in cases involving specimens, material samples, or sequence-based data (e.g., barcoding, eDNA). GBIF and OBIS may need to infer and materialise occurrences from such data and determine the appropriate approach for doing so.

The influx of information via multiple data routes (natural history collections, INSDC, BOLD, scientific literature, etc.) cannot be fully controlled; hence, clustering will be important to identify related records and represent a single occurrence in nature that is supported by multiple linked data elements (specimens, sequences, citations). This direction will likely require changes in how occurrence stores are managed and how event information is integrated.

A material catalogue may also be necessary to support user needs for locating and revisiting physical material for reanalysis or sequencing.

It is a timely opportunity for OBIS to start planning how to index such tables and collaborate at the engineering level as well as on the data publishing and network capacity level.

Data Interpretation & Portals

Q: The Occurrence concept is central to interacting with the GBIF data system, but the DP allows modelling “occurrences” using just classes such as Material/NucleotideSequence and Identification. How do you envision users interacting (querying, downloading) with the system?
A: Great question, and the answer is: it will depend. The plan is to infer occurrences by linking them to their underlying records of evidence (e.g. multiple specimen parts, sequenced material, etc.). The Occurrence catalogue will remain in place, as users still need event context for Occurrences. We also anticipate offering Material-related services. Additionally, GBIF aims to enhance dataset discovery, allowing users to search for datasets based on taxa, measurements, and other attributes (“I’m looking for datasets that cover these taxa, that have etc”). In many cases, users may directly download the well-structured source data.

Q: My second concern is about data portals. How will portals like OBIS be able to read and present complex relational data from DwC-DP in a user-friendly, filterable way?
A: That’s a critical question. While you may not need to implement this yourself, it’s good to be informed. Translating relational data structures into simple, filterable interfaces will require additional tooling and potentially design adjustments in portals. This will be a collaborative effort between data publishers, portal developers, and community stakeholders to ensure usability doesn’t suffer with increased data richness.

Q: What about controlled vocabularies? Will there be more of them in the future to enable better filtering and discovery in data portals?
A: Absolutely. Controlled vocabularies are essential for building portals with meaningful filters. As we move forward with DwC-DP, expanding and integrating vocabularies will be an important step to ensure richer and more standardised metadata across datasets.

Topic		Replies	Views
Summary of TDWG Working Session 2025 General	4	84	May 12, 2025
Synchronise OBIS nodes scopes with their endorsed data Nodes	8	112	April 15, 2025
DuckDB tutorials Data use	0	51	September 26, 2025
OBIS eDNA metabarcoding toolkit training Nodes	1	36	March 11, 2025
Ocean Census and OBIS/WoRMS connection General	2	42	March 18, 2025