Part 2: Parsing Heterogeneous Metadata: A Case Study in Multi-Format ETL
The UK Centre for Ecology & Hydrology publishes 210 environmental datasets. Each dataset has metadata available in four different formats:
- JSON — A custom API format
- ISO 19115 XML — The international standard for geospatial metadata (UK GEMINI profile)
- RDF Turtle — Linked data format
- Schema.org JSON-LD — For search engine optimization
Same data. Four representations. I needed to extract it all into a unified data model.
This is the story of building that ETL pipeline, the design patterns that saved me, the edge cases that nearly broke me, and the normalization strategies that made it work.
The Problem: Four Formats, One Truth
Here’s what the same author looks like across formats:
JSON:
{
"responsibleParties": [{
"fullName": "Morton, R.D.",
"role": "author",
"nameIdentifier": "https://orcid.org/0000-0003-3947-6463",
"organisationName": "UK Centre for Ecology & Hydrology"
}]
}
ISO 19115 XML:
<gmd:CI_ResponsibleParty>
<gmd:individualName>
<gco:CharacterString>Morton, R.D.</gco:CharacterString>
</gmd:individualName>
<gmd:organisationName>
<gco:CharacterString>UK Centre for Ecology & Hydrology</gco:CharacterString>
</gmd:organisationName>
<gmd:role>
<gmd:CI_RoleCode codeListValue="author"/>
</gmd:role>
</gmd:CI_ResponsibleParty>
Schema.org:
{
"@type": "Dataset",
"creator": {
"@type": "Person",
"name": "Morton, R.D.",
"affiliation": {
"@type": "Organization",
"name": "UK Centre for Ecology & Hydrology"
}
}
}
The ORCID is in the JSON but not the XML. The role is in both but expressed differently. The name format is inconsistent (Morton, R.D. vs a structured first/last name).
Every format has different fields, different structures, different ways of being incomplete.
The Solution: A Unified Data Model
First, define what you want to end up with:
class AuthorInfo(BaseModel):
"""Unified author representation."""
full_name: str
given_name: Optional[str] = None
family_name: Optional[str] = None
orcid: Optional[str] = None
email: Optional[str] = None
role: Optional[str] = None
organisation: Optional[str] = None
@field_validator('orcid', mode='before')
@classmethod
def normalize_orcid(cls, v: Any) -> Optional[str]:
"""Extract ORCID from various formats."""
if not v:
return None
v = str(v).strip()
# Handles full URLs
# "https://orcid.org/0000-0003-3947-6463" -> "0000-0003-3947-6463"
if 'orcid.org/' in v:
v = v.split('orcid.org/')[-1].strip('/')
# Validate format: 0000-0000-0000-000X
if re.match(r'^\d{4}-\d{4}-\d{4}-\d{3}[\dX]$', v):
return v
return None
The normalize_orcid validator is crucial. The JSON API gives us full URLs (https://orcid.org/0000-...). Some sources give just the ID. Some give garbage. The validator handles all cases and rejects invalid formats.
Similarly for the complete dataset:
class DatasetMetadata(BaseModel):
"""Unified dataset metadata model."""
identifier: str
title: str
description: Optional[str] = None
abstract: Optional[str] = None
lineage: Optional[str] = None
publication_date: Optional[date] = None
metadata_date: Optional[date] = None
authors: List[AuthorInfo] = Field(default_factory=list)
organisations: List[OrganisationInfo] = Field(default_factory=list)
keywords: List[KeywordInfo] = Field(default_factory=list)
spatial_extent: Optional[SpatialExtent] = None
temporal_extent: Optional[TemporalExtent] = None
doi: Optional[str] = None
uri: Optional[str] = None
raw_json: Optional[str] = None
raw_xml: Optional[str] = None
This is format-agnostic. A parser’s job is to fill this model from whatever source format it understands.
The Template Method Pattern
With four formats, we don’t want four completely independent parsers. They share structure: read input, extract fields, handle errors, produce output.
The Template Method pattern defines the skeleton:
class MetadataParser(ABC):
"""Abstract base class for metadata parsers."""
def __init__(self, strict: bool = False):
self.strict = strict
self.errors: List[str] = []
self.warnings: List[str] = []
def parse(self, content: str, identifier: str) -> DatasetMetadata:
"""
Template method defining the parsing workflow.
Subclasses implement the abstract methods.
"""
self.errors = []
self.warnings = []
try:
# Step 1: Parse the raw content into a workable structure
parsed = self._parse_content(content)
# Step 2: Extract and validate each field
metadata = DatasetMetadata(
identifier=identifier,
title=self._extract_title(parsed) or "Untitled",
description=self._extract_description(parsed),
abstract=self._extract_abstract(parsed),
lineage=self._extract_lineage(parsed),
publication_date=self._extract_publication_date(parsed),
metadata_date=self._extract_metadata_date(parsed),
authors=self._extract_authors(parsed),
organisations=self._extract_organisations(parsed),
keywords=self._extract_keywords(parsed),
spatial_extent=self._extract_spatial_extent(parsed),
temporal_extent=self._extract_temporal_extent(parsed),
doi=self._extract_doi(parsed),
uri=self._extract_uri(parsed),
)
# Step 3: Post-processing and deduplication
metadata = self._post_process(metadata)
return metadata
except Exception as e:
if self.strict:
raise
self.errors.append(f"Parse error: {e}")
return DatasetMetadata(identifier=identifier, title="Parse Error")
@abstractmethod
def _parse_content(self, content: str) -> Any:
"""Parse raw content into format-specific structure."""
pass
@abstractmethod
def _extract_title(self, parsed: Any) -> Optional[str]:
"""Extract title from parsed content."""
pass
# ... more abstract methods for each field
Each format implements only the extraction logic:
class JSONMetadataParser(MetadataParser):
def _parse_content(self, content: str) -> dict:
return json.loads(content)
def _extract_title(self, parsed: dict) -> Optional[str]:
return parsed.get("title")
class XMLMetadataParser(MetadataParser):
NAMESPACES = {
'gmd': 'http://www.isotc211.org/2005/gmd',
'gco': 'http://www.isotc211.org/2005/gco',
'gml': 'http://www.opengis.net/gml/3.2',
}
def _parse_content(self, content: str) -> etree._Element:
return etree.fromstring(content.encode())
def _extract_title(self, parsed: etree._Element) -> Optional[str]:
xpath = './/gmd:identificationInfo//gmd:citation//gmd:title/gco:CharacterString'
element = parsed.find(xpath, self.NAMESPACES)
return element.text if element is not None else None
The pattern pays off immediately: validation logic, error handling, and post-processing are written once. Each new format only needs to implement the extraction methods.
XPath Nightmares: Navigating ISO 19115
ISO 19115 is verbose. Extracting a title requires navigating five levels of nesting:
MD_Metadata
└── identificationInfo
└── MD_DataIdentification
└── citation
└── CI_Citation
└── title
└── CharacterString
The XPath: .//gmd:identificationInfo//gmd:citation//gmd:title/gco:CharacterString
And that’s a simple field. Here’s geographic extent:
def _extract_spatial_extent(self, parsed: etree._Element) -> Optional[SpatialExtent]:
"""Extract bounding box from EX_GeographicBoundingBox."""
bbox_path = (
'.//gmd:identificationInfo'
'//gmd:extent'
'//gmd:EX_Extent'
'//gmd:geographicElement'
'//gmd:EX_GeographicBoundingBox'
)
bbox = parsed.find(bbox_path, self.NAMESPACES)
if bbox is None:
return None
def get_coord(name: str) -> Optional[float]:
el = bbox.find(f'gmd:{name}/gco:Decimal', self.NAMESPACES)
if el is not None and el.text:
try:
return float(el.text)
except ValueError:
return None
return None
west = get_coord('westBoundLongitude')
east = get_coord('eastBoundLongitude')
south = get_coord('southBoundLatitude')
north = get_coord('northBoundLatitude')
if all(v is not None for v in [west, east, south, north]):
return SpatialExtent(
west_bound=west,
east_bound=east,
south_bound=south,
north_bound=north,
)
return None
Tips for surviving XML namespaces:
- Define namespaces once in a class constant
- Use
//liberally to skip intermediate elements we don’t care about - Always check for
None— elements might not exist - Wrap coordinate parsing in try/except — text might not be valid floats
Date Parsing: A Special Circle of Hell
Dates come in many formats:
# From JSON
"2017-01-01"
"2017-01"
"2017"
# From XML
"2017-01-01T00:00:00"
"2017-01-01T00:00:00Z"
"2017-01-01T00:00:00+00:00"
# From Schema.org
"2017"
A robust parser:
def parse_date(value: Any) -> Optional[date]:
"""Parse date from various string formats."""
if value is None:
return None
if isinstance(value, date):
return value
if isinstance(value, datetime):
return value.date()
s = str(value).strip()
if not s:
return None
# Try formats from most to least specific
formats = [
'%Y-%m-%dT%H:%M:%S.%fZ',
'%Y-%m-%dT%H:%M:%SZ',
'%Y-%m-%dT%H:%M:%S%z',
'%Y-%m-%dT%H:%M:%S',
'%Y-%m-%d',
'%Y-%m',
'%Y',
]
for fmt in formats:
try:
dt = datetime.strptime(s[:len(fmt.replace('%', ''))], fmt)
return dt.date()
except ValueError:
continue
# Last resort: dateutil
try:
from dateutil.parser import parse as dateutil_parse
return dateutil_parse(s).date()
except:
return None
The order matters. Try %Y-%m-%dT%H:%M:%S.%fZ before %Y-%m-%d, or “2017-01-01T00:00:00.000Z” will fail to parse.
Deduplication: The Same Author, Three Times
A dataset might list the same person as author, point of contact, and data custodian. The JSON format includes them all in responsibleParties. Naive extraction creates three AuthorInfo objects for one person.
Deduplication by ORCID (when available) or by normalized name:
def _deduplicate_authors(self, authors: List[AuthorInfo]) -> List[AuthorInfo]:
"""Remove duplicate authors, preferring entries with more data."""
seen_orcids: Dict[str, AuthorInfo] = {}
seen_names: Dict[str, AuthorInfo] = {}
for author in authors:
if author.orcid:
existing = seen_orcids.get(author.orcid)
if existing:
# Merge: keep the one with more fields filled
seen_orcids[author.orcid] = self._merge_authors(existing, author)
else:
seen_orcids[author.orcid] = author
else:
# Normalize name for comparison
normalized = self._normalize_name(author.full_name)
existing = seen_names.get(normalized)
if existing:
seen_names[normalized] = self._merge_authors(existing, author)
else:
seen_names[normalized] = author
# Combine, preferring ORCID-identified authors
result = list(seen_orcids.values())
for name, author in seen_names.items():
# Check if this name matches an ORCID author
if not any(self._normalize_name(a.full_name) == name for a in result):
result.append(author)
return result
def _normalize_name(self, name: str) -> str:
"""Normalize name for comparison."""
# "Morton, R.D." -> "morton rd"
name = name.lower()
name = re.sub(r'[^\w\s]', '', name) # Remove punctuation
name = ' '.join(name.split()) # Normalize whitespace
return name
The merge logic prefers entries with more data:
def _merge_authors(self, a: AuthorInfo, b: AuthorInfo) -> AuthorInfo:
"""Merge two author records, preferring non-null values."""
return AuthorInfo(
full_name=a.full_name or b.full_name,
given_name=a.given_name or b.given_name,
family_name=a.family_name or b.family_name,
orcid=a.orcid or b.orcid,
email=a.email or b.email,
role=a.role or b.role, # First role wins
organisation=a.organisation or b.organisation,
)
Strict Mode vs. Lenient Mode
Sometimes we want to fail fast (development, testing). Sometimes we want to extract whatever’s possible (production, bulk import).
The strict parameter controls this:
class MetadataParser:
def __init__(self, strict: bool = False):
self.strict = strict
self.errors = []
self.warnings = []
def _extract_with_fallback(
self,
primary_fn: Callable,
fallback_fn: Optional[Callable] = None,
field_name: str = "field",
) -> Any:
"""Extract a field with error handling and optional fallback."""
try:
result = primary_fn()
if result is not None:
return result
except Exception as e:
if self.strict:
raise
self.warnings.append(f"Error extracting {field_name}: {e}")
if fallback_fn:
try:
return fallback_fn()
except Exception as e:
self.warnings.append(f"Fallback failed for {field_name}: {e}")
return None
In production, after parsing 210 datasets:
parser = JSONMetadataParser(strict=False)
for identifier in identifiers:
metadata = parser.parse(content, identifier)
if parser.errors:
logger.error(f"{identifier}: {parser.errors}")
if parser.warnings:
logger.warning(f"{identifier}: {parser.warnings}")
# Store even if there were warnings
repository.upsert(metadata)
We get the data. We log the issues. We fix them incrementally.
The Repository Pattern: Clean Database Access
Once parsed, metadata goes into SQLite via the Repository pattern:
class DatasetRepository:
def __init__(self, session: Session):
self.session = session
def upsert(self, metadata: DatasetMetadata) -> Dataset:
"""Insert or update dataset from parsed metadata."""
# Check for existing
existing = self.get_by_identifier(metadata.identifier)
if existing:
dataset = existing
dataset.title = metadata.title
dataset.description = metadata.description
# ... update other fields
else:
dataset = Dataset(
identifier=metadata.identifier,
title=metadata.title,
description=metadata.description,
)
self.session.add(dataset)
# Handle related entities with their own deduplication
for author_info in metadata.authors:
author = self._get_or_create_author(author_info)
if author not in dataset.authors:
dataset.authors.append(author)
self.session.commit()
return dataset
def _get_or_create_author(self, info: AuthorInfo) -> Author:
"""Find existing author by ORCID or create new."""
if info.orcid:
existing = self.session.query(Author).filter_by(orcid=info.orcid).first()
if existing:
return existing
author = Author(
full_name=info.full_name,
given_name=info.given_name,
family_name=info.family_name,
orcid=info.orcid,
email=info.email,
)
self.session.add(author)
return author
The repository owns all database logic. Parsers never touch the database directly. Clean separation.
Lessons Learned
-
Design output model first. Know what you want before you start extracting.
-
Template Method is your friend. Common structure, varying implementation.
-
Normalize aggressively. ORCIDs as IDs not URLs. Names lowercased without punctuation. Dates as
dateobjects not strings. -
Deduplicate at every level. Authors within a dataset. Authors across datasets. Keywords.
-
Strict mode for development, lenient for production. Extract what you can, log what fails.
-
Store raw documents. When your parser is wrong (it will be), you can reprocess without re-fetching.
The hardest part of ETL isn’t any single format. It’s handling the inconsistencies between them while producing consistent output. These patterns make that manageable.
The final part of this series will cover RAG Without the LLM: Building Context-First Retrieval Pipelines: How to build a retrieval pipeline that returns structured context, formatted and ready for any LLM (or no LLM at all).
Enjoyed this post? Check out more articles on my blog.
View all posts
Comments