Datetime parsing is one of the most common sources of production failures in data pipelines. Real-world data has inconsistent formats, mixed timezones, Excel serial numbers, and Unix timestamps all living in the same column. Pandas exposes all of these issues at parse time — but the error messages require some interpretation. This post covers each error type with its exact traceback, root cause, and the right fix.
pip install pandas python-dateutil
import pandas as pd
dates = pd.Series(["2024-01-15", "2024-02-20", "2024-03-05"])
parsed = pd.to_datetime(dates, format="%d/%m/%Y")
ValueError: time data "2024-01-15" doesn't match format "%d/%m/%Y"
# In older pandas versions:
ValueError: ('Unknown string format:', '2024-01-15')
You specified an explicit format string that does not match the actual date string. Here the data is in YYYY-MM-DD (ISO 8601) but the format says DD/MM/YYYY. Pandas delegates to Python's datetime.strptime under the hood, which requires an exact character-for-character match between the format template and the input string — including separators.
Common mismatches:
2024-01-15 vs %d/%m/%Y1/5/2024 vs %m/%d/%Y (works for 01/05/2024 but fails for single-digit months)24-01-15 vs %Y-%m-%d (needs %y)03:45 PM vs %H:%M (needs %I:%M %p)import pandas as pd
dates = pd.Series(["2024-01-15", "2024-02-20", "2024-03-05"])
# Fix 1: correct the format string
parsed = pd.to_datetime(dates, format="%Y-%m-%d")
print(parsed)
# 0 2024-01-15
# 1 2024-02-20
# 2 2024-03-05
# dtype: datetime64[ns]
# Fix 2: omit format and let pandas infer (slower but tolerant)
parsed = pd.to_datetime(dates)
print(parsed)
# Fix 3: use errors='coerce' to turn unparseable values into NaT
# instead of raising an exception
mixed = pd.Series(["2024-01-15", "not-a-date", "2024-03-05"])
parsed = pd.to_datetime(mixed, format="%Y-%m-%d", errors='coerce')
print(parsed)
# 0 2024-01-15
# 1 NaT
# 2 2024-03-05
# Check how many failed
print(f"Failed to parse: {parsed.isna().sum()} values")
Strptime format reference for the codes you'll use most:
# Common strptime codes
# %Y — 4-digit year (2024)
# %y — 2-digit year (24)
# %m — month, zero-padded (01-12)
# %d — day, zero-padded (01-31)
# %H — hour 24h (00-23)
# %I — hour 12h (01-12)
# %M — minute (00-59)
# %S — second (00-59)
# %f — microseconds (000000-999999)
# %p — AM/PM
# %z — UTC offset (+0000, -0500)
# %Z — timezone name (UTC, EST)
# Examples:
pd.to_datetime(s, format="%Y-%m-%dT%H:%M:%S") # ISO 8601 without tz
pd.to_datetime(s, format="%d/%m/%Y %H:%M") # UK datetime
pd.to_datetime(s, format="%b %d, %Y") # Jan 15, 2024
pd.to_datetime(s, format="%Y%m%d") # 20240115 (compact)
import pandas as pd
# Real-world data exported from multiple systems
dates = pd.Series([
"2024-01-15", # ISO 8601
"15/01/2024", # European DD/MM/YYYY
"January 15 2024", # Long form
"01-15-2024", # US MM-DD-YYYY
])
parsed = pd.to_datetime(dates, format="%Y-%m-%d")
ValueError: time data "15/01/2024" doesn't match format "%Y-%m-%d"
# Without an explicit format:
# dateutil.parser._parser.ParserError: Unknown string format: 15/01/2024
A single column contains date strings from multiple sources in incompatible formats. No single format string can parse all of them. This is extremely common when merging data from different spreadsheets, CRM exports, or APIs with different regional settings.
Without an explicit format, pandas uses dateutil.parser.parse which is smart but can silently misinterpret ambiguous dates. For example, 01/02/2024 could be January 2 or February 1 depending on locale — dateutil defaults to month-first (US), which may be wrong for European data.
import pandas as pd
from dateutil import parser as dateutil_parser
dates = pd.Series([
"2024-01-15",
"15/01/2024",
"January 15 2024",
"01-15-2024",
])
# Fix 1: let pandas infer — works for many cases but ambiguous for DD/MM vs MM/DD
parsed = pd.to_datetime(dates, infer_datetime_format=True, errors='coerce')
print(parsed)
# Fix 2: use a custom parser to handle each format explicitly
def parse_mixed_dates(date_str):
"""Try multiple formats, return NaT on failure."""
formats = [
"%Y-%m-%d",
"%d/%m/%Y",
"%B %d %Y",
"%m-%d-%Y",
"%Y/%m/%d",
"%d-%m-%Y",
]
for fmt in formats:
try:
return pd.Timestamp(pd.to_datetime(date_str, format=fmt))
except (ValueError, TypeError):
continue
# Fall back to dateutil as last resort
try:
return pd.Timestamp(dateutil_parser.parse(date_str, dayfirst=True))
except Exception:
return pd.NaT
parsed = dates.apply(parse_mixed_dates)
print(parsed)
# 0 2024-01-15
# 1 2024-01-15
# 2 2024-01-15
# 3 2024-01-15
# dtype: datetime64[ns]
# Fix 3: if you know the format is consistent within batches, split and concat
iso_mask = dates.str.match(r'^\d{4}-\d{2}-\d{2}$')
european_mask = dates.str.match(r'^\d{2}/\d{2}/\d{4}$')
parsed = pd.Series(pd.NaT, index=dates.index)
parsed[iso_mask] = pd.to_datetime(dates[iso_mask], format="%Y-%m-%d")
parsed[european_mask] = pd.to_datetime(dates[european_mask], format="%d/%m/%Y")
print(parsed)
Note: infer_datetime_format was deprecated in pandas 2.0 and removed in 2.2. For pandas 2.x, omit it — the parser is now automatically optimized:
import pandas as pd
# pandas 2.0+: just omit infer_datetime_format
parsed = pd.to_datetime(dates, errors='coerce')
import pandas as pd
# One series is timezone-aware (from an API returning UTC)
api_dates = pd.to_datetime(["2024-01-15T10:00:00Z", "2024-02-20T14:30:00Z"])
# Another is timezone-naive (from a local CSV)
csv_dates = pd.to_datetime(["2024-01-10", "2024-02-25"])
# Attempting comparison
mask = api_dates > csv_dates[0]
TypeError: Cannot compare tz-naive and tz-aware datetime-like objects
# Or when merging DataFrames:
TypeError: Cannot join tz-naive and tz-aware DatetimeIndex
Pandas treats timezone-aware and timezone-naive datetimes as fundamentally different types. A timezone-naive datetime has no timezone information — it is ambiguous (is it UTC? local time? something else?). A timezone-aware datetime is anchored to a specific timezone. Python and pandas refuse to compare them directly because the comparison would be logically undefined.
This commonly occurs when:
Z or +00:00 suffix alongside plain date stringsTIMESTAMPTZ) and comparing with pandas-generated timestampsimport pandas as pd
api_dates = pd.to_datetime(["2024-01-15T10:00:00Z", "2024-02-20T14:30:00Z"])
csv_dates = pd.to_datetime(["2024-01-10", "2024-02-25"])
print(api_dates.dtype) # datetime64[ns, UTC]
print(csv_dates.dtype) # datetime64[ns]
# Fix 1: localize the naive series to UTC, then compare
csv_dates_utc = csv_dates.dt.tz_localize('UTC')
mask = api_dates > csv_dates_utc[0]
print(mask) # [True, True]
# Fix 2: strip timezone from the aware series (convert to naive UTC)
api_dates_naive = api_dates.dt.tz_localize(None)
mask = api_dates_naive > csv_dates[0]
print(mask)
# Fix 3: use utc=True in pd.to_datetime to normalize everything to UTC at parse time
# This is the cleanest approach for pipelines that ingest from multiple sources
all_dates = pd.to_datetime(
["2024-01-15T10:00:00Z", "2024-02-20T14:30:00+05:30", "2024-01-10"],
utc=True
)
print(all_dates)
# DatetimeIndex(['2024-01-15 10:00:00+00:00',
# '2024-02-20 09:00:00+00:00', # converted from IST
# '2024-01-10 00:00:00+00:00'], dtype='datetime64[ns, UTC]', freq=None)
When loading from a DataFrame, apply the fix at the column level:
import pandas as pd
df = pd.DataFrame({
'created_at': ["2024-01-15T10:00:00Z", "2024-02-20T14:30:00Z"],
'local_date': ["2024-01-10", "2024-02-25"]
})
# Parse both columns and normalize to UTC
df['created_at'] = pd.to_datetime(df['created_at'], utc=True)
df['local_date'] = pd.to_datetime(df['local_date']).dt.tz_localize('UTC')
# Now comparison works
df['is_after_local'] = df['created_at'] > df['local_date']
print(df)
Excel stores dates as integers — the number of days since January 1, 1900 (with a deliberate off-by-one bug for 1900 leap year compatibility). If you read an Excel file with pd.read_excel() and get column values like 45306, you have serial dates.
import pandas as pd
# Serial date 45306 = 2024-01-15
excel_serials = pd.Series([45306, 45352, 45385])
# Fix: use pd.to_datetime with unit and origin
parsed = pd.to_datetime(excel_serials, unit='D', origin='1899-12-30')
print(parsed)
# 0 2024-01-15
# 1 2024-02-29 (careful: check leap year handling)
# 2 2024-04-01
# dtype: datetime64[ns]
The origin is 1899-12-30 rather than 1900-01-01 to account for Excel's off-by-one bug. If your dates are off by one or two days, adjust the origin.
import pandas as pd
# Seconds since epoch
unix_seconds = pd.Series([1705312800, 1708430400])
parsed = pd.to_datetime(unix_seconds, unit='s')
print(parsed)
# 0 2024-01-15 10:00:00
# 1 2024-02-20 12:00:00
# dtype: datetime64[ns]
# Milliseconds since epoch (JavaScript, many APIs)
unix_ms = pd.Series([1705312800000, 1708430400000])
parsed = pd.to_datetime(unix_ms, unit='ms')
print(parsed)
# Microseconds
unix_us = pd.Series([1705312800000000, 1708430400000000])
parsed = pd.to_datetime(unix_us, unit='us')
print(parsed)
import pandas as pd
# Mixed UTC offsets — use utc=True to normalize all to UTC
iso_dates = pd.Series([
"2024-01-15T10:00:00Z",
"2024-01-15T05:00:00-05:00", # US Eastern = same instant as UTC 10:00
"2024-01-15T18:30:00+08:30", # India-ish = same instant
])
parsed = pd.to_datetime(iso_dates, utc=True)
print(parsed)
# All three resolve to the same UTC moment:
# 0 2024-01-15 10:00:00+00:00
# 1 2024-01-15 10:00:00+00:00
# 2 2024-01-15 10:00:00+00:00
# Convert to a specific timezone for display
parsed_eastern = parsed.dt.tz_convert('America/New_York')
print(parsed_eastern)
# 0 2024-01-15 05:00:00-05:00
When you cannot trust the input format, combine all the techniques above into a single robust parser:
import pandas as pd
import numpy as np
def parse_dates_robust(series: pd.Series, timezone: str = 'UTC') -> pd.Series:
"""
Parse a mixed-format date column defensively.
- Handles ISO 8601, Unix timestamps (int/float), Excel serials, and common string formats
- Returns timezone-aware Series in the specified timezone
- Sets unparseable values to NaT
"""
result = pd.Series(pd.NaT, index=series.index, dtype='datetime64[ns, UTC]')
# Numeric: could be Unix timestamp or Excel serial
numeric_mask = pd.to_numeric(series, errors='coerce').notna()
if numeric_mask.any():
numeric_vals = pd.to_numeric(series[numeric_mask])
# Heuristic: Unix epoch in seconds is ~1.7B for 2024, Excel serials are ~45K
unix_mask = numeric_vals > 100_000
excel_mask = ~unix_mask
if unix_mask.any():
result[numeric_mask & (numeric_vals > 100_000)] = (
pd.to_datetime(numeric_vals[unix_mask], unit='s', utc=True)
)
if excel_mask.any():
result[numeric_mask & (numeric_vals <= 100_000)] = (
pd.to_datetime(numeric_vals[excel_mask], unit='D', origin='1899-12-30')
.dt.tz_localize('UTC')
)
# String: try pd.to_datetime with utc=True (handles ISO 8601 with offsets)
string_mask = ~numeric_mask & series.notna()
if string_mask.any():
parsed_strings = pd.to_datetime(
series[string_mask], utc=True, errors='coerce'
)
result[string_mask] = parsed_strings
# Convert to requested timezone
result = result.dt.tz_convert(timezone)
return result
# Usage
df = pd.DataFrame({
'date_col': [
"2024-01-15T10:00:00Z", # ISO UTC
"15/01/2024", # European
1705312800, # Unix seconds
45306, # Excel serial
"bad date", # Will become NaT
None, # Will become NaT
]
})
df['parsed'] = parse_dates_robust(df['date_col'])
print(df[['date_col', 'parsed']])
print(f"\nParse failures: {df['parsed'].isna().sum()}")
import pandas as pd
s = pd.Series(["2024-01-15", "not-a-date", "2024-03-01"])
# errors='raise' (default) — raises on first failure
# pd.to_datetime(s, format="%Y-%m-%d") # ValueError
# errors='coerce' — converts failures to NaT
pd.to_datetime(s, format="%Y-%m-%d", errors='coerce')
# 0 2024-01-15
# 1 NaT
# 2 2024-03-01
# errors='ignore' — returns original input for failures (deprecated in pandas 2.2)
# Returns a Series of objects, not datetimes — almost never what you want
In pandas 2.2+, errors='ignore' is deprecated and will raise a warning. Replace it with errors='coerce' followed by explicit handling of NaT values.
The three categories of pandas datetime errors and their fixes:
errors='coerce' to surface which values fail; omit format to let pandas infer when the format is consistentfor loop; use regex masks to split the series by format and parse each group separately; use pd.to_datetime(s, utc=True) in pandas 2.x for the best automatic inferenceutc=True; use dt.tz_localize('UTC') to make naive timestamps aware; use dt.tz_localize(None) to strip timezone info when you need naive datetimesFor production data pipelines, errors='coerce' plus a downstream assert df['date'].isna().sum() == 0 (or an acceptable NaT threshold check) gives you both resilience and observability over parse failures.