Fixing Pandas datetime Parsing Errors: strptime, ParserError, and Mixed Formats

Fixing Pandas datetime Parsing Errors: strptime, ParserError, and Mixed Formats

Datetime parsing is one of the most common sources of production failures in data pipelines. Real-world data has inconsistent formats, mixed timezones, Excel serial numbers, and Unix timestamps all living in the same column. Pandas exposes all of these issues at parse time — but the error messages require some interpretation. This post covers each error type with its exact traceback, root cause, and the right fix.

pip install pandas python-dateutil

Error 1: ValueError — strptime Format Mismatch

The Traceback

import pandas as pd

dates = pd.Series(["2024-01-15", "2024-02-20", "2024-03-05"])
parsed = pd.to_datetime(dates, format="%d/%m/%Y")
ValueError: time data "2024-01-15" doesn't match format "%d/%m/%Y"

# In older pandas versions:
ValueError: ('Unknown string format:', '2024-01-15')

Root Cause

You specified an explicit format string that does not match the actual date string. Here the data is in YYYY-MM-DD (ISO 8601) but the format says DD/MM/YYYY. Pandas delegates to Python's datetime.strptime under the hood, which requires an exact character-for-character match between the format template and the input string — including separators.

Common mismatches:

  • Dashes vs slashes: 2024-01-15 vs %d/%m/%Y
  • Zero-padded vs non-padded: 1/5/2024 vs %m/%d/%Y (works for 01/05/2024 but fails for single-digit months)
  • Two-digit vs four-digit year: 24-01-15 vs %Y-%m-%d (needs %y)
  • 12-hour vs 24-hour: 03:45 PM vs %H:%M (needs %I:%M %p)

Fix

import pandas as pd

dates = pd.Series(["2024-01-15", "2024-02-20", "2024-03-05"])

# Fix 1: correct the format string
parsed = pd.to_datetime(dates, format="%Y-%m-%d")
print(parsed)
# 0   2024-01-15
# 1   2024-02-20
# 2   2024-03-05
# dtype: datetime64[ns]

# Fix 2: omit format and let pandas infer (slower but tolerant)
parsed = pd.to_datetime(dates)
print(parsed)

# Fix 3: use errors='coerce' to turn unparseable values into NaT
# instead of raising an exception
mixed = pd.Series(["2024-01-15", "not-a-date", "2024-03-05"])
parsed = pd.to_datetime(mixed, format="%Y-%m-%d", errors='coerce')
print(parsed)
# 0   2024-01-15
# 1          NaT
# 2   2024-03-05

# Check how many failed
print(f"Failed to parse: {parsed.isna().sum()} values")

Strptime format reference for the codes you'll use most:

# Common strptime codes
# %Y  — 4-digit year         (2024)
# %y  — 2-digit year         (24)
# %m  — month, zero-padded   (01-12)
# %d  — day, zero-padded     (01-31)
# %H  — hour 24h             (00-23)
# %I  — hour 12h             (01-12)
# %M  — minute               (00-59)
# %S  — second               (00-59)
# %f  — microseconds         (000000-999999)
# %p  — AM/PM
# %z  — UTC offset           (+0000, -0500)
# %Z  — timezone name        (UTC, EST)

# Examples:
pd.to_datetime(s, format="%Y-%m-%dT%H:%M:%S")   # ISO 8601 without tz
pd.to_datetime(s, format="%d/%m/%Y %H:%M")        # UK datetime
pd.to_datetime(s, format="%b %d, %Y")             # Jan 15, 2024
pd.to_datetime(s, format="%Y%m%d")                # 20240115 (compact)

Error 2: ParserError — Mixed Date Formats in a Column

The Traceback

import pandas as pd

# Real-world data exported from multiple systems
dates = pd.Series([
    "2024-01-15",      # ISO 8601
    "15/01/2024",      # European DD/MM/YYYY
    "January 15 2024", # Long form
    "01-15-2024",      # US MM-DD-YYYY
])

parsed = pd.to_datetime(dates, format="%Y-%m-%d")
ValueError: time data "15/01/2024" doesn't match format "%Y-%m-%d"

# Without an explicit format:
# dateutil.parser._parser.ParserError: Unknown string format: 15/01/2024

Root Cause

A single column contains date strings from multiple sources in incompatible formats. No single format string can parse all of them. This is extremely common when merging data from different spreadsheets, CRM exports, or APIs with different regional settings.

Without an explicit format, pandas uses dateutil.parser.parse which is smart but can silently misinterpret ambiguous dates. For example, 01/02/2024 could be January 2 or February 1 depending on locale — dateutil defaults to month-first (US), which may be wrong for European data.

Fix

import pandas as pd
from dateutil import parser as dateutil_parser

dates = pd.Series([
    "2024-01-15",
    "15/01/2024",
    "January 15 2024",
    "01-15-2024",
])

# Fix 1: let pandas infer — works for many cases but ambiguous for DD/MM vs MM/DD
parsed = pd.to_datetime(dates, infer_datetime_format=True, errors='coerce')
print(parsed)

# Fix 2: use a custom parser to handle each format explicitly
def parse_mixed_dates(date_str):
    """Try multiple formats, return NaT on failure."""
    formats = [
        "%Y-%m-%d",
        "%d/%m/%Y",
        "%B %d %Y",
        "%m-%d-%Y",
        "%Y/%m/%d",
        "%d-%m-%Y",
    ]
    for fmt in formats:
        try:
            return pd.Timestamp(pd.to_datetime(date_str, format=fmt))
        except (ValueError, TypeError):
            continue
    # Fall back to dateutil as last resort
    try:
        return pd.Timestamp(dateutil_parser.parse(date_str, dayfirst=True))
    except Exception:
        return pd.NaT

parsed = dates.apply(parse_mixed_dates)
print(parsed)
# 0   2024-01-15
# 1   2024-01-15
# 2   2024-01-15
# 3   2024-01-15
# dtype: datetime64[ns]

# Fix 3: if you know the format is consistent within batches, split and concat
iso_mask = dates.str.match(r'^\d{4}-\d{2}-\d{2}$')
european_mask = dates.str.match(r'^\d{2}/\d{2}/\d{4}$')

parsed = pd.Series(pd.NaT, index=dates.index)
parsed[iso_mask] = pd.to_datetime(dates[iso_mask], format="%Y-%m-%d")
parsed[european_mask] = pd.to_datetime(dates[european_mask], format="%d/%m/%Y")
print(parsed)

Note: infer_datetime_format was deprecated in pandas 2.0 and removed in 2.2. For pandas 2.x, omit it — the parser is now automatically optimized:

import pandas as pd

# pandas 2.0+: just omit infer_datetime_format
parsed = pd.to_datetime(dates, errors='coerce')

Error 3: TypeError — Timezone-Naive vs Timezone-Aware Comparison

The Traceback

import pandas as pd

# One series is timezone-aware (from an API returning UTC)
api_dates = pd.to_datetime(["2024-01-15T10:00:00Z", "2024-02-20T14:30:00Z"])

# Another is timezone-naive (from a local CSV)
csv_dates = pd.to_datetime(["2024-01-10", "2024-02-25"])

# Attempting comparison
mask = api_dates > csv_dates[0]
TypeError: Cannot compare tz-naive and tz-aware datetime-like objects

# Or when merging DataFrames:
TypeError: Cannot join tz-naive and tz-aware DatetimeIndex

Root Cause

Pandas treats timezone-aware and timezone-naive datetimes as fundamentally different types. A timezone-naive datetime has no timezone information — it is ambiguous (is it UTC? local time? something else?). A timezone-aware datetime is anchored to a specific timezone. Python and pandas refuse to compare them directly because the comparison would be logically undefined.

This commonly occurs when:

  • Mixing API responses (which return UTC) with spreadsheet data (which has no timezone)
  • Parsing ISO 8601 strings with Z or +00:00 suffix alongside plain date strings
  • Loading data from PostgreSQL (timezone-aware TIMESTAMPTZ) and comparing with pandas-generated timestamps

Fix

import pandas as pd

api_dates = pd.to_datetime(["2024-01-15T10:00:00Z", "2024-02-20T14:30:00Z"])
csv_dates = pd.to_datetime(["2024-01-10", "2024-02-25"])

print(api_dates.dtype)   # datetime64[ns, UTC]
print(csv_dates.dtype)   # datetime64[ns]

# Fix 1: localize the naive series to UTC, then compare
csv_dates_utc = csv_dates.dt.tz_localize('UTC')
mask = api_dates > csv_dates_utc[0]
print(mask)  # [True, True]

# Fix 2: strip timezone from the aware series (convert to naive UTC)
api_dates_naive = api_dates.dt.tz_localize(None)
mask = api_dates_naive > csv_dates[0]
print(mask)

# Fix 3: use utc=True in pd.to_datetime to normalize everything to UTC at parse time
# This is the cleanest approach for pipelines that ingest from multiple sources
all_dates = pd.to_datetime(
    ["2024-01-15T10:00:00Z", "2024-02-20T14:30:00+05:30", "2024-01-10"],
    utc=True
)
print(all_dates)
# DatetimeIndex(['2024-01-15 10:00:00+00:00',
#                '2024-02-20 09:00:00+00:00',  # converted from IST
#                '2024-01-10 00:00:00+00:00'], dtype='datetime64[ns, UTC]', freq=None)

When loading from a DataFrame, apply the fix at the column level:

import pandas as pd

df = pd.DataFrame({
    'created_at': ["2024-01-15T10:00:00Z", "2024-02-20T14:30:00Z"],
    'local_date': ["2024-01-10", "2024-02-25"]
})

# Parse both columns and normalize to UTC
df['created_at'] = pd.to_datetime(df['created_at'], utc=True)
df['local_date'] = pd.to_datetime(df['local_date']).dt.tz_localize('UTC')

# Now comparison works
df['is_after_local'] = df['created_at'] > df['local_date']
print(df)

Special Cases

Excel Serial Dates

Excel stores dates as integers — the number of days since January 1, 1900 (with a deliberate off-by-one bug for 1900 leap year compatibility). If you read an Excel file with pd.read_excel() and get column values like 45306, you have serial dates.

import pandas as pd

# Serial date 45306 = 2024-01-15
excel_serials = pd.Series([45306, 45352, 45385])

# Fix: use pd.to_datetime with unit and origin
parsed = pd.to_datetime(excel_serials, unit='D', origin='1899-12-30')
print(parsed)
# 0   2024-01-15
# 1   2024-02-29  (careful: check leap year handling)
# 2   2024-04-01
# dtype: datetime64[ns]

The origin is 1899-12-30 rather than 1900-01-01 to account for Excel's off-by-one bug. If your dates are off by one or two days, adjust the origin.

Unix Timestamps

import pandas as pd

# Seconds since epoch
unix_seconds = pd.Series([1705312800, 1708430400])
parsed = pd.to_datetime(unix_seconds, unit='s')
print(parsed)
# 0   2024-01-15 10:00:00
# 1   2024-02-20 12:00:00
# dtype: datetime64[ns]

# Milliseconds since epoch (JavaScript, many APIs)
unix_ms = pd.Series([1705312800000, 1708430400000])
parsed = pd.to_datetime(unix_ms, unit='ms')
print(parsed)

# Microseconds
unix_us = pd.Series([1705312800000000, 1708430400000000])
parsed = pd.to_datetime(unix_us, unit='us')
print(parsed)

ISO 8601 with Timezone Offsets

import pandas as pd

# Mixed UTC offsets — use utc=True to normalize all to UTC
iso_dates = pd.Series([
    "2024-01-15T10:00:00Z",
    "2024-01-15T05:00:00-05:00",   # US Eastern = same instant as UTC 10:00
    "2024-01-15T18:30:00+08:30",   # India-ish = same instant
])

parsed = pd.to_datetime(iso_dates, utc=True)
print(parsed)
# All three resolve to the same UTC moment:
# 0   2024-01-15 10:00:00+00:00
# 1   2024-01-15 10:00:00+00:00
# 2   2024-01-15 10:00:00+00:00

# Convert to a specific timezone for display
parsed_eastern = parsed.dt.tz_convert('America/New_York')
print(parsed_eastern)
# 0   2024-01-15 05:00:00-05:00

Defensive Parsing Pattern for Production Pipelines

When you cannot trust the input format, combine all the techniques above into a single robust parser:

import pandas as pd
import numpy as np

def parse_dates_robust(series: pd.Series, timezone: str = 'UTC') -> pd.Series:
    """
    Parse a mixed-format date column defensively.
    - Handles ISO 8601, Unix timestamps (int/float), Excel serials, and common string formats
    - Returns timezone-aware Series in the specified timezone
    - Sets unparseable values to NaT
    """
    result = pd.Series(pd.NaT, index=series.index, dtype='datetime64[ns, UTC]')

    # Numeric: could be Unix timestamp or Excel serial
    numeric_mask = pd.to_numeric(series, errors='coerce').notna()
    if numeric_mask.any():
        numeric_vals = pd.to_numeric(series[numeric_mask])

        # Heuristic: Unix epoch in seconds is ~1.7B for 2024, Excel serials are ~45K
        unix_mask = numeric_vals > 100_000
        excel_mask = ~unix_mask

        if unix_mask.any():
            result[numeric_mask & (numeric_vals > 100_000)] = (
                pd.to_datetime(numeric_vals[unix_mask], unit='s', utc=True)
            )
        if excel_mask.any():
            result[numeric_mask & (numeric_vals <= 100_000)] = (
                pd.to_datetime(numeric_vals[excel_mask], unit='D', origin='1899-12-30')
                .dt.tz_localize('UTC')
            )

    # String: try pd.to_datetime with utc=True (handles ISO 8601 with offsets)
    string_mask = ~numeric_mask & series.notna()
    if string_mask.any():
        parsed_strings = pd.to_datetime(
            series[string_mask], utc=True, errors='coerce'
        )
        result[string_mask] = parsed_strings

    # Convert to requested timezone
    result = result.dt.tz_convert(timezone)
    return result


# Usage
df = pd.DataFrame({
    'date_col': [
        "2024-01-15T10:00:00Z",   # ISO UTC
        "15/01/2024",              # European
        1705312800,                # Unix seconds
        45306,                     # Excel serial
        "bad date",               # Will become NaT
        None,                     # Will become NaT
    ]
})

df['parsed'] = parse_dates_robust(df['date_col'])
print(df[['date_col', 'parsed']])
print(f"\nParse failures: {df['parsed'].isna().sum()}")

Quick Reference: errors Parameter Values

import pandas as pd

s = pd.Series(["2024-01-15", "not-a-date", "2024-03-01"])

# errors='raise' (default) — raises on first failure
# pd.to_datetime(s, format="%Y-%m-%d")  # ValueError

# errors='coerce' — converts failures to NaT
pd.to_datetime(s, format="%Y-%m-%d", errors='coerce')
# 0   2024-01-15
# 1          NaT
# 2   2024-03-01

# errors='ignore' — returns original input for failures (deprecated in pandas 2.2)
# Returns a Series of objects, not datetimes — almost never what you want

In pandas 2.2+, errors='ignore' is deprecated and will raise a warning. Replace it with errors='coerce' followed by explicit handling of NaT values.


Summary

The three categories of pandas datetime errors and their fixes:

  • strptime format mismatch — fix the format string to match the actual input; use errors='coerce' to surface which values fail; omit format to let pandas infer when the format is consistent
  • Mixed formats / ParserError — try each known format in sequence with a for loop; use regex masks to split the series by format and parse each group separately; use pd.to_datetime(s, utc=True) in pandas 2.x for the best automatic inference
  • Timezone mismatch — always normalize to UTC at parse time with utc=True; use dt.tz_localize('UTC') to make naive timestamps aware; use dt.tz_localize(None) to strip timezone info when you need naive datetimes

For production data pipelines, errors='coerce' plus a downstream assert df['date'].isna().sum() == 0 (or an acceptable NaT threshold check) gives you both resilience and observability over parse failures.