As a web scraping expert with over 5 years of experience, I find myself constantly dealing with datetime strings extracted from websites and APIs. These strings come in hundreds of formats – from ISO 8601 and RFC 2822 to completely ambiguous natural language dates.
Manually handling the parsing and normalization of these datetimes would be a nightmare!
Fortunately, Python provides some great libraries to automate most of this datetime parsing work. In this comprehensive guide, I'll share my experiences using one such useful library – dateparser
.
Here's what I'll cover:
- The pain points of manually parsing datetimes
- Why dateparser is a lifesaver for this problem
- Using dateparser to extract datetimes from text
- Handling ambiguous and incomplete dates
- Advanced dateparser settings and preferences
- Limitations and caveats of relying on dateparser
- Best practices for production datetime parsing
So let's get started!
The Pain and Suffering of Manual Datetime Parsing
In my early days as a Python developer, I used to handle datetime parsing manually using standard library modules like datetime
and dateutil.parser
. I quickly realized what a huge pain it was!
Just look at some of these common frustrations:
- Too many formats – The number of possible datetime string formats is insane. From ISO 8601 (e.g.
2011-11-04T00:05:23Z
) to RFC 2822 (e.g.Fri, 04 Nov 2011 00:05:23 +0000
) to casually written text (e.g. “Thursday, November 3rd 2015 2:30pm”) – trying to manually handle them all requires crazy regexes and date logic. - Regional differences – Date formats vary across countries and regions. As a scraper from the US, I initially coded for MM/DD/YYYY format. But European dates in DD/MM/YYYY would break my parsers. Not to mention Asian, African and Middle Eastern formats!
- Implicit components – Many HTML pages don't explicitly include year or timezone information, e.g. “Posted on March 04”. My parsers would need to infer missing components.
- Human language issues – Humans don't write dates in perfectly machinable formats. We use relative terms like “yesterday”, “next month”, “3 weeks ago” etc. Making sense of these requires almost AI-level date reasoning!
So as you can see, I was banging my head trying to handle all these edge cases. My parsers only worked for a narrow set of US-centric date formats and broke for anything slightly ambiguous. There had to be a better way!
Dateparser to the Rescue!
After struggling for months with my hand-written datetime parsers, I came across the dateparser
library. And boy, was I glad to find it! dateparser
is an open-source Python package that handles datetimes in a smart and intuitive way. Here's what makes it so useful:
- Robust parsers – It has a comprehensive set of parsing rules and regexes to extract dates from ISO8601, RFC2822, US, European and many other international formats.
- Contextual parsing – It looks at language, locale, key terms, and ordering to disambiguate between potential date matches. This enables it to parse even relative and incomplete dates.
- Well-tested – As a popular open-source library, dateparser has been battle-tested on millions of real-world dates and edge cases. So it works surprisingly well out-of-the-box.
- Extensible – Its rules-based parsing is customizable via settings and preferences. So you can plug it into non-English workflows.
- Active support – With ~2500 Github stars and regular updates, bugs and issues are quickly identified and fixed.
In summary, dateparser took away a ton of headache for me when it came to parsing ambiguous web-scraped dates. It could intelligently handle edge cases that my hand-rolled parsers failed at. The next sections walk through dateparser's features and usage in more depth.
Using Dateparser to Extract Datetimes
The main interface of dateparser is the parse()
function. You pass it a string containing a date, and it returns a parsed Python datetime
object:
from dateparser import parse date_str = "Friday April 30 2021 10:30pm" parsed_date = parse(date_str) print(parsed_date) # 2021-04-30 22:30:00
This handles many different date formats:
parse("March 21, 2022") parse("03/21/2022") parse("Mon January 21, 2022") parse("2022-01-21 14:30:15Z")
Some things to note about parse()
:
- It aims to return a valid
datetime
object even for ambiguous inputs, based on best guesses from available context. - If parsing fails, it returns
None
. So you may need additional validation afterparse()
. - By default, it returns a naive datetime without timezone information. More on handling timezones later.
- The returned datetimes have a default time of 00:00:00 unless explicitly specified.
Now let's look at some ways dateparser handles ambiguous and incomplete dates.
Handling Ambiguous and Partial Dates
Date strings extracted from websites are often incomplete or ambiguous. For example:
01/02/2020
– Jan 2 or Feb 1?August 2021
– Which day in August?Friday
– Friday of which week?
To handle such cases, dateparser has to make some “intelligent guesses” on missing components. You can guide its guessing by providing relevant context through settings
.
Specifying Date Order
For ambiguous dates like 01/02/2020
, the order format matters:
# Spanish date format is DD/MM/YYYY parse('12/06/2021', settings={'DATE_ORDER': 'DMY'}) # US format is MM/DD/YYYY parse('06/12/2021', settings={'DATE_ORDER': 'MDY'})
Valid DATE_ORDER
values are:
MDY
– Month, Day, Year (US style)DMY
– Day, Month, Year (European style)YMD
– Year, Month, Day (ISO 8601 style)
So if you know the region of the website, you can pass the relevant format.
Handling Missing Date Components
For incomplete dates like “August 2021”, you can guide dateparser on how to fill missing components:
# Picking first day for dates with missing day parse('August 2021', settings={'PREFER_DAY_OF_MONTH': 'first'}) # Choosing last day parse('August 2021', settings={'PREFER_DAY_OF_MONTH': 'last'}) # When year is missing, assume from past/current/future year parse('June 2018', settings={'PREFER_DATES_FROM': 'past'})
The PREFER_DATES_FROM
setting is useful for handling relative dates like “3 weeks ago” or “next month”.
Parsing Ambiguous Times
Times can also be ambiguous in terms of 12-hour vs 24-hour format:
# With 24-hour format parse('02/03/2021 15:00') # With 12-hour format parse('02/03/2021 15:00', settings={'PREFER_12_HOUR': True})
So by using relevant settings
, you can guide dateparser to return the correct datetime components even when the input string is incomplete or ambiguous.
Advanced Dateparser Settings
In addition to the above, dateparser provides many more settings and preferences to control parsing behavior:
Timezone Handling
# Parse date as UTC timezone parse('July 22, 2022 10:30pm', settings={'TIMEZONE': 'UTC'}) # Convert parsed date to US/Pacific timezone parse('July 22, 2022 10:30pm', settings={'TO_TIMEZONE': 'US/Pacific'}) # Return timezone-aware datetimes parse('2022-07-22T10:30:00Z', settings={'RETURN_AS_TIMEZONE_AWARE': True})
This enables handling datetimes from known timezones.
Culture-Specific Parsing
# Parse date in French parse('22 Juillet 2022', settings={'LANGUAGE': 'fr'}) # Parse in Japanese parse('2022年7月22日', settings={'LANGUAGE': 'ja'})
This is useful for non-English websites. dateparser supports about 10 languages currently.
Miscellaneous Settings
# Skip words like 'by', 'on', 'at' during parsing parse('by January 2025', settings={'SKIP_TOKENS': ['by']}) # Use custom datetime formats parse('10/11/2025', settings={'DATE_FORMATS': ['%d/%m/%Y']})
See the full settings docs for more options.
Limitations and Caveats of Dateparser
While dateparser is very useful, it's important to know its limitations:
- Performance – Being based on regex and rules, dateparser is not very fast when parsing large volumes of dates.
- Language Support – Out-of-the-box support for only ~10 languages. For Chinese, Arabic, etc. it falls back to heuristic parsing.
- Implicit Timezones – Timezone abbreviations like PST, IST etc. are not handled automatically. The timezone needs to be provided explicitly.
- Daylight Savings – Dateparser does not handle DST transitions or quirks automatically. The time may be off by an hour during DST periods.
- No Parser Customization – No way to customize parsing by adding or removing rules. You have to rely on pre-defined rule sets.
- Always Returns Date –
parse()
aims to return a date always, even if input is gibberish. Post-validation is required.
So dateparser is not a one-stop solution for all datetime parsing needs. Critical applications may need additional checks and processing.
Best Practices for Production Datetime Parsing
Based on many years of hands-on experience parsing web-scraped dates, here are my recommendations:
- Validate after parsing – Always validate that the parsed datetime is within expected range. Discard outliers.
- Compare multiple parsers – Try parsing with dateparser, dateutil, datefinder etc. and take the consensus.
- Divide and conquer – Break up parsing into multiple steps – extract date first, then time, then consolidate.
- Culture-specific parsing – Detect language/locale and apply relevant settings for each parser.
- Look for clues – Utilize day-of-week, season, ordinal terms to pick the right date.
- Normalize early – Convert all dates to UTC or ISO format for easier storage and processing.
- Contribute fixes – For incorrect parses, dig into source code and add missing rules or regexes.
The key takeaway is not to blindly rely on any single parser. Use an ensemble of smart parsers and validators for catching edge cases.
Summary
Handling the myriad datetime formats on the web is tricky but unavoidable for any real-world web scraping pipeline. dateparser is an invaluable tool for tackling this problem. It gets rid of the pain of writing complex datetime parsing logic yourself. I hope these dateparser tips and tricks from my web scraping career help you efficiently parse the dates and times in your own projects!