Part 2: Parsing Heterogeneous Metadata: A Case Study in Multi-Format ETL

The UK Centre for Ecology & Hydrology publishes 210 environmental datasets. Each dataset has metadata available in four different formats:

JSON — A custom API format
ISO 19115 XML — The international standard for geospatial metadata (UK GEMINI profile)
RDF Turtle — Linked data format
Schema.org JSON-LD — For search engine optimization

Same data. Four representations. I needed to extract it all into a unified data model.

This is the story of building that ETL pipeline, the design patterns that saved me, the edge cases that nearly broke me, and the normalization strategies that made it work.

The Problem: Four Formats, One Truth

Here’s what the same author looks like across formats:

JSON:

{
  "responsibleParties": [{
    "fullName": "Morton, R.D.",
    "role": "author",
    "nameIdentifier": "https://orcid.org/0000-0003-3947-6463",
    "organisationName": "UK Centre for Ecology & Hydrology"
  }]
}

ISO 19115 XML:

<gmd:CI_ResponsibleParty>
  <gmd:individualName>
    <gco:CharacterString>Morton, R.D.</gco:CharacterString>
  </gmd:individualName>
  <gmd:organisationName>
    <gco:CharacterString>UK Centre for Ecology & Hydrology</gco:CharacterString>
  </gmd:organisationName>
  <gmd:role>
    <gmd:CI_RoleCode codeListValue="author"/>
  </gmd:role>
</gmd:CI_ResponsibleParty>

Schema.org:

{
  "@type": "Dataset",
  "creator": {
    "@type": "Person",
    "name": "Morton, R.D.",
    "affiliation": {
      "@type": "Organization",
      "name": "UK Centre for Ecology & Hydrology"
    }
  }
}

The ORCID is in the JSON but not the XML. The role is in both but expressed differently. The name format is inconsistent (Morton, R.D. vs a structured first/last name).

Every format has different fields, different structures, different ways of being incomplete.

The Solution: A Unified Data Model

First, define what you want to end up with:

class AuthorInfo(BaseModel):
    """Unified author representation."""

    full_name: str
    given_name: Optional[str] = None
    family_name: Optional[str] = None
    orcid: Optional[str] = None
    email: Optional[str] = None
    role: Optional[str] = None
    organisation: Optional[str] = None

    @field_validator('orcid', mode='before')
    @classmethod
    def normalize_orcid(cls, v: Any) -> Optional[str]:
        """Extract ORCID from various formats."""
        if not v:
            return None

        v = str(v).strip()

        # Handles full URLs
        # "https://orcid.org/0000-0003-3947-6463" -> "0000-0003-3947-6463"
        if 'orcid.org/' in v:
            v = v.split('orcid.org/')[-1].strip('/')

        # Validate format: 0000-0000-0000-000X
        if re.match(r'^\d{4}-\d{4}-\d{4}-\d{3}[\dX]$', v):
            return v

        return None

The normalize_orcid validator is crucial. The JSON API gives us full URLs (https://orcid.org/0000-...). Some sources give just the ID. Some give garbage. The validator handles all cases and rejects invalid formats.

Similarly for the complete dataset:

class DatasetMetadata(BaseModel):
    """Unified dataset metadata model."""

    identifier: str
    title: str
    description: Optional[str] = None
    abstract: Optional[str] = None
    lineage: Optional[str] = None

    publication_date: Optional[date] = None
    metadata_date: Optional[date] = None

    authors: List[AuthorInfo] = Field(default_factory=list)
    organisations: List[OrganisationInfo] = Field(default_factory=list)
    keywords: List[KeywordInfo] = Field(default_factory=list)

    spatial_extent: Optional[SpatialExtent] = None
    temporal_extent: Optional[TemporalExtent] = None

    doi: Optional[str] = None
    uri: Optional[str] = None

    raw_json: Optional[str] = None
    raw_xml: Optional[str] = None

This is format-agnostic. A parser’s job is to fill this model from whatever source format it understands.

The Template Method Pattern

With four formats, we don’t want four completely independent parsers. They share structure: read input, extract fields, handle errors, produce output.

The Template Method pattern defines the skeleton:

class MetadataParser(ABC):
    """Abstract base class for metadata parsers."""

    def __init__(self, strict: bool = False):
        self.strict = strict
        self.errors: List[str] = []
        self.warnings: List[str] = []

    def parse(self, content: str, identifier: str) -> DatasetMetadata:
        """
        Template method defining the parsing workflow.

        Subclasses implement the abstract methods.
        """
        self.errors = []
        self.warnings = []

        try:
            # Step 1: Parse the raw content into a workable structure
            parsed = self._parse_content(content)

            # Step 2: Extract and validate each field
            metadata = DatasetMetadata(
                identifier=identifier,
                title=self._extract_title(parsed) or "Untitled",
                description=self._extract_description(parsed),
                abstract=self._extract_abstract(parsed),
                lineage=self._extract_lineage(parsed),
                publication_date=self._extract_publication_date(parsed),
                metadata_date=self._extract_metadata_date(parsed),
                authors=self._extract_authors(parsed),
                organisations=self._extract_organisations(parsed),
                keywords=self._extract_keywords(parsed),
                spatial_extent=self._extract_spatial_extent(parsed),
                temporal_extent=self._extract_temporal_extent(parsed),
                doi=self._extract_doi(parsed),
                uri=self._extract_uri(parsed),
            )

            # Step 3: Post-processing and deduplication
            metadata = self._post_process(metadata)

            return metadata

        except Exception as e:
            if self.strict:
                raise
            self.errors.append(f"Parse error: {e}")
            return DatasetMetadata(identifier=identifier, title="Parse Error")

    @abstractmethod
    def _parse_content(self, content: str) -> Any:
        """Parse raw content into format-specific structure."""
        pass

    @abstractmethod
    def _extract_title(self, parsed: Any) -> Optional[str]:
        """Extract title from parsed content."""
        pass

    # ... more abstract methods for each field

Each format implements only the extraction logic:

class JSONMetadataParser(MetadataParser):
    def _parse_content(self, content: str) -> dict:
        return json.loads(content)

    def _extract_title(self, parsed: dict) -> Optional[str]:
        return parsed.get("title")


class XMLMetadataParser(MetadataParser):
    NAMESPACES = {
        'gmd': 'http://www.isotc211.org/2005/gmd',
        'gco': 'http://www.isotc211.org/2005/gco',
        'gml': 'http://www.opengis.net/gml/3.2',
    }

    def _parse_content(self, content: str) -> etree._Element:
        return etree.fromstring(content.encode())

    def _extract_title(self, parsed: etree._Element) -> Optional[str]:
        xpath = './/gmd:identificationInfo//gmd:citation//gmd:title/gco:CharacterString'
        element = parsed.find(xpath, self.NAMESPACES)
        return element.text if element is not None else None

The pattern pays off immediately: validation logic, error handling, and post-processing are written once. Each new format only needs to implement the extraction methods.

XPath Nightmares: Navigating ISO 19115

ISO 19115 is verbose. Extracting a title requires navigating five levels of nesting:

MD_Metadata
  └── identificationInfo
        └── MD_DataIdentification
              └── citation
                    └── CI_Citation
                          └── title
                                └── CharacterString

The XPath: .//gmd:identificationInfo//gmd:citation//gmd:title/gco:CharacterString

And that’s a simple field. Here’s geographic extent:

def _extract_spatial_extent(self, parsed: etree._Element) -> Optional[SpatialExtent]:
    """Extract bounding box from EX_GeographicBoundingBox."""

    bbox_path = (
        './/gmd:identificationInfo'
        '//gmd:extent'
        '//gmd:EX_Extent'
        '//gmd:geographicElement'
        '//gmd:EX_GeographicBoundingBox'
    )

    bbox = parsed.find(bbox_path, self.NAMESPACES)
    if bbox is None:
        return None

    def get_coord(name: str) -> Optional[float]:
        el = bbox.find(f'gmd:{name}/gco:Decimal', self.NAMESPACES)
        if el is not None and el.text:
            try:
                return float(el.text)
            except ValueError:
                return None
        return None

    west = get_coord('westBoundLongitude')
    east = get_coord('eastBoundLongitude')
    south = get_coord('southBoundLatitude')
    north = get_coord('northBoundLatitude')

    if all(v is not None for v in [west, east, south, north]):
        return SpatialExtent(
            west_bound=west,
            east_bound=east,
            south_bound=south,
            north_bound=north,
        )

    return None

Tips for surviving XML namespaces:

Define namespaces once in a class constant
Use // liberally to skip intermediate elements we don’t care about
Always check for None — elements might not exist
Wrap coordinate parsing in try/except — text might not be valid floats

Date Parsing: A Special Circle of Hell

Dates come in many formats:

# From JSON
"2017-01-01"
"2017-01"
"2017"

# From XML
"2017-01-01T00:00:00"
"2017-01-01T00:00:00Z"
"2017-01-01T00:00:00+00:00"

# From Schema.org
"2017"

A robust parser:

def parse_date(value: Any) -> Optional[date]:
    """Parse date from various string formats."""
    if value is None:
        return None

    if isinstance(value, date):
        return value

    if isinstance(value, datetime):
        return value.date()

    s = str(value).strip()
    if not s:
        return None

    # Try formats from most to least specific
    formats = [
        '%Y-%m-%dT%H:%M:%S.%fZ',
        '%Y-%m-%dT%H:%M:%SZ',
        '%Y-%m-%dT%H:%M:%S%z',
        '%Y-%m-%dT%H:%M:%S',
        '%Y-%m-%d',
        '%Y-%m',
        '%Y',
    ]

    for fmt in formats:
        try:
            dt = datetime.strptime(s[:len(fmt.replace('%', ''))], fmt)
            return dt.date()
        except ValueError:
            continue

    # Last resort: dateutil
    try:
        from dateutil.parser import parse as dateutil_parse
        return dateutil_parse(s).date()
    except:
        return None

The order matters. Try %Y-%m-%dT%H:%M:%S.%fZ before %Y-%m-%d, or “2017-01-01T00:00:00.000Z” will fail to parse.

Deduplication: The Same Author, Three Times

A dataset might list the same person as author, point of contact, and data custodian. The JSON format includes them all in responsibleParties. Naive extraction creates three AuthorInfo objects for one person.

Deduplication by ORCID (when available) or by normalized name:

def _deduplicate_authors(self, authors: List[AuthorInfo]) -> List[AuthorInfo]:
    """Remove duplicate authors, preferring entries with more data."""
    seen_orcids: Dict[str, AuthorInfo] = {}
    seen_names: Dict[str, AuthorInfo] = {}

    for author in authors:
        if author.orcid:
            existing = seen_orcids.get(author.orcid)
            if existing:
                # Merge: keep the one with more fields filled
                seen_orcids[author.orcid] = self._merge_authors(existing, author)
            else:
                seen_orcids[author.orcid] = author
        else:
            # Normalize name for comparison
            normalized = self._normalize_name(author.full_name)
            existing = seen_names.get(normalized)
            if existing:
                seen_names[normalized] = self._merge_authors(existing, author)
            else:
                seen_names[normalized] = author

    # Combine, preferring ORCID-identified authors
    result = list(seen_orcids.values())
    for name, author in seen_names.items():
        # Check if this name matches an ORCID author
        if not any(self._normalize_name(a.full_name) == name for a in result):
            result.append(author)

    return result

def _normalize_name(self, name: str) -> str:
    """Normalize name for comparison."""
    # "Morton, R.D." -> "morton rd"
    name = name.lower()
    name = re.sub(r'[^\w\s]', '', name)  # Remove punctuation
    name = ' '.join(name.split())  # Normalize whitespace
    return name

The merge logic prefers entries with more data:

def _merge_authors(self, a: AuthorInfo, b: AuthorInfo) -> AuthorInfo:
    """Merge two author records, preferring non-null values."""
    return AuthorInfo(
        full_name=a.full_name or b.full_name,
        given_name=a.given_name or b.given_name,
        family_name=a.family_name or b.family_name,
        orcid=a.orcid or b.orcid,
        email=a.email or b.email,
        role=a.role or b.role,  # First role wins
        organisation=a.organisation or b.organisation,
    )

Strict Mode vs. Lenient Mode

Sometimes we want to fail fast (development, testing). Sometimes we want to extract whatever’s possible (production, bulk import).

The strict parameter controls this:

class MetadataParser:
    def __init__(self, strict: bool = False):
        self.strict = strict
        self.errors = []
        self.warnings = []

    def _extract_with_fallback(
        self,
        primary_fn: Callable,
        fallback_fn: Optional[Callable] = None,
        field_name: str = "field",
    ) -> Any:
        """Extract a field with error handling and optional fallback."""
        try:
            result = primary_fn()
            if result is not None:
                return result
        except Exception as e:
            if self.strict:
                raise
            self.warnings.append(f"Error extracting {field_name}: {e}")

        if fallback_fn:
            try:
                return fallback_fn()
            except Exception as e:
                self.warnings.append(f"Fallback failed for {field_name}: {e}")

        return None

In production, after parsing 210 datasets:

parser = JSONMetadataParser(strict=False)
for identifier in identifiers:
    metadata = parser.parse(content, identifier)

    if parser.errors:
        logger.error(f"{identifier}: {parser.errors}")
    if parser.warnings:
        logger.warning(f"{identifier}: {parser.warnings}")

    # Store even if there were warnings
    repository.upsert(metadata)

We get the data. We log the issues. We fix them incrementally.

The Repository Pattern: Clean Database Access

Once parsed, metadata goes into SQLite via the Repository pattern:

class DatasetRepository:
    def __init__(self, session: Session):
        self.session = session

    def upsert(self, metadata: DatasetMetadata) -> Dataset:
        """Insert or update dataset from parsed metadata."""

        # Check for existing
        existing = self.get_by_identifier(metadata.identifier)

        if existing:
            dataset = existing
            dataset.title = metadata.title
            dataset.description = metadata.description
            # ... update other fields
        else:
            dataset = Dataset(
                identifier=metadata.identifier,
                title=metadata.title,
                description=metadata.description,
            )
            self.session.add(dataset)

        # Handle related entities with their own deduplication
        for author_info in metadata.authors:
            author = self._get_or_create_author(author_info)
            if author not in dataset.authors:
                dataset.authors.append(author)

        self.session.commit()
        return dataset

    def _get_or_create_author(self, info: AuthorInfo) -> Author:
        """Find existing author by ORCID or create new."""
        if info.orcid:
            existing = self.session.query(Author).filter_by(orcid=info.orcid).first()
            if existing:
                return existing

        author = Author(
            full_name=info.full_name,
            given_name=info.given_name,
            family_name=info.family_name,
            orcid=info.orcid,
            email=info.email,
        )
        self.session.add(author)
        return author

The repository owns all database logic. Parsers never touch the database directly. Clean separation.

Lessons Learned

Design output model first. Know what you want before you start extracting.
Template Method is your friend. Common structure, varying implementation.
Normalize aggressively. ORCIDs as IDs not URLs. Names lowercased without punctuation. Dates as date objects not strings.
Deduplicate at every level. Authors within a dataset. Authors across datasets. Keywords.
Strict mode for development, lenient for production. Extract what you can, log what fails.
Store raw documents. When your parser is wrong (it will be), you can reprocess without re-fetching.

The hardest part of ETL isn’t any single format. It’s handling the inconsistencies between them while producing consistent output. These patterns make that manageable.

The final part of this series will cover RAG Without the LLM: Building Context-First Retrieval Pipelines: How to build a retrieval pipeline that returns structured context, formatted and ready for any LLM (or no LLM at all).