The Specify 6 Paleontological Data Model
Specify Software Project Staff
31 March 2009
As part of the design of Specify 6, we sought to improve Specify's handling of paleontological data. The information model for Specify 5.x for Paleo collections was incomplete and we had received useful feedback about additional requirements paleo collections researchers needed to accommodate. Of central importance is modeling of stratigraphic information, in particular biostratigraphy, chronostratigraphy, and lithostratigraphy. Paleo collections researchers conceptualize stratigraphy as another index upon which to query and sort holdings. This dimension is complex and dynamic. Stratigraphies, like taxonomic classifications, are comprised of tree-structured data, and like biological taxonomies, stratigraphic classifications vary through time. Various authorities update reference standards on regular or irregular schedules. Stratigraphic usage also varies through space, with different classifications preferred by different countries.
The discussion below traces the path we took to derive a new data model for handling paleontological localities and associated stratigraphic data. In our journey, accompanied by colleagues, we noted that Paleo Localities are not the same class of concept as Collection Localities for neontological collections; in terms of data they are not merely points or polygons with x,y coordinate descriptors describing a space where something was found. Localities in paleontology are defined in spatial and temporal dimensions, with additional elements of ordinal position and rank (stratigraphies).
In short, in Specify 6 we model Paleo Context (and stratigraphic information) as direct relations of Collection Objects (Figure 8). In Specify 5, we considered paleo locality data to be indirectly related to Collection Objects, through a link to Collecting Event or Collection Locality. But that approach does not allow unambiguous assignment of the paleo location attributes to individual Collection Objects, given our existing formalizations for neontological Collecting Events and Collection Localities.
Paleontology Collection Data Model Requirements
The following discussion is a walk through the data modeling issues for paleontological 'locality' data, which led us to adopt a new arrangement of data table relationships. We begin by discussing the limitations of the Specify 5.x model.
In Specify 5, our Paleo Collection model consists largely of Collection Object, Collecting Event, (Collecting) Locality, Stratigraphy, and Geologic Time Period data objects (see Figure 1). Here is a summary of those data objects:
||Represents a collected specimen or core
||Collection Object is related to Collecting Event as a Many:One
||Represents person or group performing a collecting 'action' such as performing a core sample or digging and retrieving a specimen
||A Collecting Event refers to a Locality and a Stratigraphy
||Represents a place, a description of the Locality and an optional georeference (typically lat/long)
||Represents a Lithostratigraphy. Contains fields: superGroup, lithoGroup, formation, member, bed
||Refers to a Geologic Time Period
|Geologic Time Period
||Represents time (ChronoStratigraphy) with these fields: rankId, name, fullName, standard, startPeriod, startUncertainty, endPeriod, endUncertainty
Figure 1 illustrates current Specify 5.x relationships among Collection Object, Collecting Event, Locality, Stratigraphy and Geologic Time Period. But the Specify 5 data model is insufficient to capture basic locality data in some collecting scenarios. For example, Specify 5 doesn't support the information associated with core sampling where a collector would need to capture information delimiting multiple discrete periods of geological time for one surface Locality and Collecting Event. The Specify 5 model also binds LithoStratigraphy (Stratigraphy) to the ChronoStratigraphy (Geologic Time Period) which is an over simplification.
Figure 1. Specify 5.x Data Model
Specify 5.x considers Locality to be defined as a modern day place on the earth where something biological was collected or observed; it is usually characterized by latitude and longitude, or UTM coordinates. Specify 5 also accommodates depth and height as a third dimension, but for most spatial queries, mapping and visualizations today, the Locality of a specimen is characterized only by its x,y coordinates on the earth's surface. The 5.x Locality is just that and only that--the place where a collection was taken dyring a Collecting Event.
In words: one or more collection objects can be taken during a Collecting Event from a single Locality. In Specify 5 usage, if the Locality changes, by definition a new Collecting Event record is created. A Collecting Event can have only one Locality in the database. This spatial model has proven itself sufficient several years for neontological biology collections. For disciplines like botany, which do not recognize Collecting Events as an explicit concept (every collection object essentially has it's own locality) we simply minimize Collecting Event from the Specify user interface, but the relationship is maintained in our model without the knowledge of the user.
Paleontological collections however have additional 'localization' needs. Paleo researchers need to know not only where the Collection Object came from, but also the paleo placement of that Locality. We have been calling that informally the Paleo Locality, but that is a misleading term, so we have adopted "Paleo Context" as a concept and term from Paul Morris of Harvard University.
Here is an example of a Paleo spatial data requirement, using a vertical core scenario. If you take a core sample through several strata, there is one set of modern lat/long values for everything related to the whole core, but the offset from the surface is different for each slice, as is the geological context for each slice. If we try to put all of that information into the Specify 5 Locality concept (for example by relating Chronostratigraphy one:one to Locality, then Locality becomes a broader and more complex concept. It is no longer just the place on earth where the Collection Objects were taken and where a Collecting Event occurred, but it is now also is defined partially as the Paleo Context (PC) (or geological time context) for that Collecting Event. Now we have an implementation problem.
The conflict with the existing Specify data model is with the cardinality of Collecting Event to Locality, which as we saw is many-to-one. The conflict arises because, a Collecting Event by definition can only have one Locality, but now a single Collecting Event (like a core) can have two or more Paleo Contexts (as in our core example), so the effective relationship would look like Figure 2:
Figure 2. Linking Paleo Context to Locality
We add the relationship between Locality and Paleo Context as M:1, because by definition a Locality can only have one Paleo Context (a core slice), and Paleo Context may have one or more Localities in the database, i.e. a collection may have multiple Localities from the Cretaceous.
But that does not work, because the 5.x model does not accommodate multiple Localities for a Collecting Event. And we need that if we are going to define Locality with Paleo Context attributes. Each slice of the core would have a different Paleo Context and a correspondingly different Locality, as Locality and Paleo Context are related M:1. This model does not work.
This problem can be solved if we change the Specify 5 model to make the relationship between Collecting Event and Locality to be Many:Many as in Figure 3.
Figure 3. Adding a M:M relationship between Collecting Event and Locality
Doing that would mean that Specify databases for all disciplines would have an additional join table between Collecting Event and Locality. The relationship between Collecting Event table and the Locality table is at the core of most database transactions within Specify. Most database queries use this relationship and changing it from M:1 to M:M, adds another 'join' between the tables, and another table as a 'join table' to make the M:M relationship work. Adding a join table will have significant negative performance consequences for all Specify users for most operations, but that's not the most serious problem. The biggest problem is that there is now no way with a set operation to determine which Collection Object came from which Paleo Context, for a particular Collecting Event. The M:1 between Collection Object and Collecting Event, means that there is only one Collecting Event record for all of the Collection Objects from that event. So there is no way to point individual Collection Objects to particular Locality records (and thus particular Paleo Context records). So Collection Objects from a particular slice of the core cannot be linked to a particular PaleoContext from that core. This solution solves one problem, but creates a performance issue and another modeling problem which is unworkable.
One way to fix the modeling problem, as shown in Figure 4, would be to make the relationship between Collection Object and collecting Event, Many:Many.
Figure 4. Adding another M:M
But then we need to add another join table between Collection Object and Collecting Event. Empirically, we documented that performance degrades more than 30% when we test this for a neontological (Entomology) database, over supporting the Collection Object:Collecting Event as a M:1. We would make all disciplines pay a steep performance price, to accomodate this model, for probably the most important relationship in the database, in terms of core functions.
Another way to solve this problem would be to make a new relationship (Figure 5) between Collection Object and Paleo Context, so that each Collection Object would have a Paleo Context explicitly assigned as a M:1 relationship. A collection object would have one Paleo Context, but a particular Paleo Context could describe multiple Collection Objects in the collection. This presents many challenges when thinking about the query execution paths and joins that would be needed to put a UI on that. Not to mention the performance issues, again.
Figure 5. Adding a Relationship between Collection Object and Paleo Context
All of the above modeling option considerations brought us back to the starting point--our base information model. We asked: Is Paleo Context actually related to Specify's concept of (Collecting) Locality? Is it a type of Locality? A refinement of it? An attribute of Locality?
For the following reasons, we decided that Paleo Context is not directly related to Locality.
Paleo Context has attributes that are unique to Paleo collections; it incorporates the time dimension as part of its definition of space. A Paleo Locality in Specify 5.x semantics is not simply the x,y coordinate on the surface of the earth where the specimen came from. It is defined by the geological context which may be time based, or biologically based, or geologically based, or some or all of those strata, or even others (Chemostrata, etc.) labeling that concept 'Paleo Context" helps to disambiguate it from "Locality", when thinking about Paleo Collecting Events and Localities and the geological dimension.
If we don't attach the Paleo Context to Locality in the Specify model, then where do we put it? As a table linked to Collecting Event or to Collection Object? (or possibly as attributes of Collecting Event or Collection Object?)
The case for linking Paleo Context to Collecting Event (Figure 6):
Figure 6. Linking Paleo Context to Collecting Event
Pro: This fits the science conceptualization that a Collecting Event can generate multiple collection objects all from the same Paleo Context. And that a particular Paleo Context could have multiple Collecting Events (in the collection).
Con: This breaks the core sampling use case, where one has one Collecting Event producing many Collection Objects (slices of the core or fossils from different slices of the core), and each Collection Object has its own Paleo Context. Each Collecting Event will have multiple Paleo Contexts. But we cannot support that with this model, as Collecting Event:Paleo Context is M:1. If we left it like this, in practice, a new Collecting Event record would need to be created for each unique combination of Collection Object and Paleo Context, which redefines Collecting Event.
We could solve that problem by making Collecting Event:Paleo Context a M:M relationship, but then we would need to redefine Collecting Event, not as one event of extracting the entire core, but as the process of taking each slice. Collecting Event would become a join table for Paleo Context and Collection Object.
The case for linking Paleo Context to Collection Object (Figure 7):
Figure 7. Linking Paleo Context to Collection Object
Pro: Core sampling use case works. Each Collection Object from a core slice, or from a quarry dig, has its own Paleo Context, allowing for one Collecting Event and one Locality for the entire core or quarry dig. We went to Jones quarry, we found three strata, we collected two fossils from each of the strata, resulting in: six collection objects, one Locality, one collecting event, and three paleo contexts.
One could make the case that the Paleo Context data are simply attributes of Collection Objects and forget about the Paleo Context as a table. This would be a denormalized model as a Paleo Context could have multiple Collection Objects, and Paleo Context data would likely be unnecessarily duplicated. Also all neontological Specify collections, would have those fields in the Collection Object table, which will have some performance hit, but more importantly the fields are not useful for those collections. For those reasons, a separate table seems optimal.
Con: We can't think of any. Paleo context is conceptualized as a time (chronostrat), substrate (lithostrat) or a biological association (Biostrat) dimension, and not a "space" dimension. All types of stratigraphy are relative indexes of time or of association. ChronoStrat is obvious. LithoStrat is not physically where the stratum is from the surface, like depth and canopy height would be, those are spatial dimensions, but Litho is a 'space' defined by a geological order dimension. A stratum does occupy space at each site, but (e.g.) the descriptor 'Devonian shale' characterizes a time dimension and not a physical space dimension, although the stratum does have spatial attributes. In other words, "Devonian shale" is not a property value of a spatial dimension. Same with Biostratigraphy, which deals with biological associations not Localities in a spatial sense, e.g. a biostrat labeled 'Trilobite level' is not describing a spatial dimension.
In conclusion, we decided that Paleo Context does not have a direct relationship with Locality in Specify's traditional definition of a (collecting) Locality as a place on earth where something was collected. Paleo Context and Locality are two very different things. If we try to keep them artificially directly linked, major data model problems emerge. By linking Paleo Context as a separate table to Collection Object, every Paleo field collection use case we know of can be supported, and only Paleo database users pay the price for any performance hit, as there would be no costly additional M:M join tables among Collection Object, Collecting Event and Locality.
There is an interesting use case where the spatial description of collecting Locality becomes congruent with the Paleo Context. When researchers identify the original location of a paleo organism on the earth's surface based on the paleo period during which the organism lived, by modelling the movement of land masses, then the locality for a Paleo Collection Object, merges with its Paleo Context. Simulation data from those projections of the global position of localities in geological time are not currently stored in our schema, but they could be added if it becomes useful for Paleontological collections to cache modeled georeferenced localities in the future, of the past.
Figure 8 shows the Specify 6 data model with the new and rearranged data objects.
Figure 8. Specify 6 Paleo Context and Stratigraphy Data Model
We added a 'parent pointer' to Locality. This provides a way to sub-divide Localities with little overhead or impact to other disciplines. The introduction of a recursive parent/child pointer, enables Locality/sub-Locality relationships and allows needed flexibility for other field collection localization methods. For example, with sub-Localities we can accomodate precise locality descriptions from points along transects, or from field site grids and subgrid units.
Most non-Paleo collections take a Collection Object-centric view of their holdings data. That perspective can be limiting for Paleo collections, many of which prefer a Collecting Event or Stratigraphy view their collection information. Figure 9 shows a Collecting Event view of the new data model:
Figure 9. Prototype Data Form for Paleo Collecting Events
In this form, it would be possible to remove the Determination information from the form and display/edit it by launching a popup dialog form via a button. The "1 of 1" record controller at the bottom of the form is for creating and controlling the Collecting Events.
We would like to thank Roger Burkhalter, University of Oklahoma; Paul Morris, Harvard University; Jill Krebs, Rudy Serbet, Jill Hardesty, John Counts and Bruce Lieberman, all of the University of Kansas, for their stimulating intellectual discussion and suggestions on this topic.