Extracting Published Dates from web pages
03 May 2019One objectively useful piece of information often present on news and blog post
articles online is the date of publication.
It can be used to determine how fresh and relevant an article is and when used
in conjunction with other processing allow you get a feel for the subject of
the article, be it a company, person or event.
At Synoptica I worked on improving the accuracy of getting such a published date (and sometimes time).
Interestingly this is a harder task than you might naively believe it to be.
Challenges
Surely you can just grab the first date you see on a web page and be done
with it right? Nope.
Often a web page will have many dates on it, some from other articles,
adverts or even today’s date.
So grabbing the first one you see is not good enough, but even if it was we’d
still have the problem of what we consider a date.
Americans like to use the (confusing) date format of mm/dd/yyyy
compared to
dd/mm/yyyy
, but often websites also use a textual representation of the date
like Monday 21st January 2008
or August 3, 2009
, with many variations of
order and punctuation.
Some websites might not even include the date of an article on it’s page in
text but instead encode it in the URL like:
https://example.com/2012/1/1/happy-new-year
or in elements within the page.
But there must be “a standard” for presenting dates on articles online right? Otherwise, how do the likes of Google, Bing, Twitter, Facebook and others show nice summaries of web links to news articles?
Well there are standards, plural:
- The Open Graph protocol is a standard that allows for
adding rich metadata to a web page that the likes of Facebook consume, this
includes
article:published_time
which can be used for determining a published date - schema.org promotes another (large) standard for adding metadata to web pages including quite a few different variations of dates that can be used as published dates. It can also be encoded into a json-ld file included with a web page
- Many content management systems (CMSs) and publishers also have their own
ways of encoding a published date in a pages metadata or tags.
E.g. The Wall Street Journal uses the
<meta>
tag namedarticle.published
and Metro uses one namedsailthru.date
Solving it
So that’s 3 places we can get a published date (and maybe time) from:
- Text within the web page
- The web page’s URL
- The metadata and tags on a web page
All these can be in various formats so we’d need to be able to parse them to dates (and maybe times for some) reliably.
A web page might have many dates on it so we’d also want to be able to determine which one is the most likely published date as well.
We’d want to cover as many web pages as possible and make it easy to add more “rules” to such a system, so we can incrementally improve it.
A decent solution to this (and the one I took) would be to:
- Encode all the common patterns into their own rules.
For instance create a process to decode a URL and extract anything that looks like a date in it likehttps://example.com/2008-02-21/article
orhttps://example.com/2012/12/15/article
, other processes for the common metadata patterns, and more still for searching for date patterns in a pages content. - With the dates output by the above processes; evaluate them and determine a most likely published date.
The first step is simple enough and can be structured in such a way that it is
easy to add more rules to as you discover them and modify existing ones.
I did this using classical Java and Object Orientated patterns; defined a
common interface for processing a web page and it’s URL, then implemented types
based on this interface, taking advantage of inheritance to divide similar
rules into simple type hierarchies to reuse common behaviours.
The second step requires some intelligence and reasoning. One simple algorithm might be to use the date that appears the most for a given page, ordering those with similar counts by recency. A more intelligent one might be to weight the output of the various rules, since some may be more likely to be right than others, e.g. the date in a URL is likely right.
With these steps in place you could now easily evaluate a list of URLs and determine when they were published, outputting such information in a standard way (ISO-8061 is nice) e.g.
URL | Published Date (with time) |
---|---|
https://edition.cnn.com/2018/09/02/health/cuba-china-state-department-microwaves-sonic-attacks/index.html | 2018-09-02T20:13:34Z |
https://www.bbc.co.uk/news/business-45394226 | 2018-09-03T14:27:52+01:00 |
https://www.dawn.com/news/1430365 | 2018-09-02T00:55:02Z |
https://dolphin-emu.org/blog/2018/09/01/dolphin-progress-report-august-2018 | 2018-09-01T00:00:00Z |
https://www.wsj.com/articles/new-speed-bump-planned-for-u-s-stock-market-1535713321 | 2018-08-31T11:02:00Z |
https://www.bbc.co.uk/news/uk-england-surrey-44291716 | 2018-05-29T15:49:31+01:00 |
Conclusion
I built such a library over a short period of time in Java that covered many
cases over many web pages, 51 in my original hand curated tests.
It could handle at least 19 different date formats that I had seen, some
including time and offset information.
It was also easily extensible, with tooling allowing me to explore pages for
potential new patterns and improvements to existing ones.
In an effort to be both polite and efficient I saved local copies of each web
page I was testing and built the tests and tooling around this so it became
easy to add more rules, and more web pages when I needed.
At Synoptica the extracted date is used to
determine relevant articles to a company and used in scoring companies in
various categories like funding, corporate social responsibility, security,
recruitment etc.
The more accurate a date (and maybe even time) the more accurate results are.
As a result of this being work related the code is property of Synoptica but
the process itself should be relatively easy to reproduce, in fact there
already exists at least one project that does something similar to this in
Python for those requiring a ready-made solution:
Webhose article-date-extractor.
I am unsure on the licensing of this project however, and do not know if it is
actively maintained. I encountered it when I had already developed most of the
rules in my own project and was relived to see they had a similar approach to
my own (albeit in differently structured Python code).
Overall this was an interesting self-contained problem that was well defined and relatively easy to solve. I’d be tempted to attempt a solution again in my own time and evaluate potential alternatives and new possible patterns.
One interesting possibility would be to implement this using machine learning to assist in step 2. That is, if given enough context around each extracted date you could potentially train a model that would decide which is the most likely date for a given web page. A similar approach is taken to eliminate adverts and boilerplate parts of web pages by the project Dragnet. I had this idea initially but thought a heuristic based approach would be faster to implement and give good enough results.
Comments