Designing a Comment System

In my previous post I mentioned one possible solution for adding comments to my blog using the built in support for data files in Jekyll. This approach was pioneered by Damien Guard.
In this post I hope to have a crack at designing such a system myself and implementing it.

What do I want?

My first step in designing this comment system will be to decide what my goals are.

  • Foremost I want to allow people to leave comments on my blog (obviously)
  • Adding comments should be relatively easy
    • The format they are in should be common, I am leaning toward Markdown as that is the common format used by Jekyll and is commonly used on sites such as Reddit
  • I want to have to curate all the comments coming in and approve them if they seem legitimate
    • I’ve had experience in the past when running Wordpress blogs where there was a lot of spam or irrelevant comments that would be nice to filter out
    • Manual approval would likely be fine for me since my blog is low traffic
    • I could still augment approval so that obvious spam comments can be filtered out automatically
    • Being able to block certain known offenders would be a nice feature as well; obviously this is non-trivial but a simple IP blacklist could help
    • Using a captcha would obviously be advantageous or a honeypot style captcha
    • Browser fingerprinting could be another technique to detect when many requests come from the same source
  • I want to preserve comments in a format that is easy to store, process and potentially migrate
    • In addition to this, the comments should be stored in a static way that is in-keeping with Jekyll’s general approach
  • I want to allow users to have an avatar if they desire it
    • Gravatar is quite popular and would be nice to support
    • Twitter profiles may be useful to support
    • GitHub profiles again would be useful to support
  • The comment system should be relatively lightweight
    • By this I mean there shouldn’t be too many moving parts to it and should not require any heavy systems be used.
    • I am thinking of running this on an EC2 instance or as AWS Lambda functions so ideally nothing should be process intensive

That’s quite a few things I want but it is fairly doable.

Some drawbacks of such a design are:

  • No guarantee that commenters are who they say they are
    • This extends further in that a multiple comments do not guarantee that they are from the same person. In a way this is much like a traditional guestbook on older websites
  • Comments will take time to appear on the website
    • Since they will be merged into the blog via GitHub they will take a non-trivial amount of time to be approved
    • While waiting for a comment to be approved a user may not realise and attempt to leave another

Overall I am willing to live with these drawbacks, at least for the moment.

The comment system boils down to the following kind of top level flow:

Simple system diagram

  1. The user reads a post from the blog
  2. The user decides to leave a comment
  3. The comment system determines if the comment should be allowed and updates the blog

This diagram does simplify parts of the design like approving the comments, however this could be seen as being outside of the current scope since it would be an external process.

Some Prototyping

I often find it helpful to work through and prototype some ideas roughly before implementing them properly.

Input Data

One such prototype that is (in my opinion) always helpful is thinking about what kind of inputs and outputs a system will use and produce.
So in this example a comment might look something like the following (as JSON for ease of reading):

{
  "uuid": "UUID",
  "post": "post-id or permalink",
  "displayName": "User display name",
  "avatar": "URL to an avatar",
  "webLink": "a URL provided",
  "comment": "markdown comment"
}

Some of these could be optional. I might also want to include a client generated date with the data if I want to be able to show when this comment was posted and in what timezone/offset.

You might notice I included a UUID in my fields, I find these useful for use as correlation IDs during debugging of a system.

So this is the data provided for the comment but there will also be some additional data associated with the comment request:

  • Time and date of the request
  • Source of the request (IP address etc.)
  • Browser metadata and headers

This data could all be used in conjunction with the user generated data, especially the data and time.

Form Prototype

I am not the most visual person, as you can probably tell this by the simple design of this blog. But nevertheless it is important to decide and visualize how the comment form might look. Below is my attempt:

Leave a comment

It’s a relatively simple form that relies on the HTML5 form elements and the default theme styling.

For an actual implementation I may add some additional checks and use an AJAX request instead of a form submit action.

The advantage of using an AJAX call are:

  • I can do some checks on the client side
    • I could prevent sending bad form data, perhaps even do some client side checks for the existence of user provided links
    • I could reduce the chances of duplicate requests being sent
  • I can decide upon the encoding of the data and add any additional data to the request
  • I can react to the response on the blog post page without navigating away
  • Basic spam-bots and web-crawlers that do not render JavaScript won’t be able to post comments

Of course there are disadvantages too, mainly that browsers with limited or no JavaScript support won’t work. This may include some screen-reading software used by the visually impaired. However since display of comments will be handled by Jekyll and it’s Liquid templating language existing comments should still be readable in any browser.

Experimenting with Data Files

I am not super familiar with Jekyll’s data files and the Liquid templating language so I thought it prudent to research and experiment with them more.

Jekyll supports YAML, JSON, CSV and TSV files. Out of these file types YAML and JSON are probably the best suited for storing comments due to them not using commas or tabs as separators. My personal preference between YAML and JSON is JSON, mostly because it is a simpler format with much more support, including native JavaScript support in web browsers.

Data files are stored in the _data folder and can be placed in subfolders, which is good because that makes it easier to organise the comments I’ll get into folders based on the posts they are for.

Jekyll makes data accessible by namespace. The example given in the documentation uses the example files _data/orgs/jekyll.yml and _data/orgs/deorg.yml which are associated with the namespace data.orgs and accessible when iterating over that namespace’s members.

Applying this to comments I can see several possible ways of implementing such a system:

Folders for each blog post

Since blog posts in Jekyll are stored as markdown files they can be identified with names that are safe to use in the file system. For example a blog post might be named 2019-05-09-comments-on-static-blog.md on the file system which translates to the relative link https://lyndon.codes/2019/05/09/comments-on-static-blog/.
Now I can use that filename as a folder in the data directory, something like: _data/comments/2019-05-09-comments-on-static-blog/ and store all comments in that folder for that given post.

The benefits to this are:

  • It is easy to keep track of all comments for each post
  • It’s easy to migrate comments with posts if you change post names or even move to a new blogging system.
  • When merging in new comments they can all be kept in separate files, reducing problems with merges in Git.

One potential issue with this is ordering comments based on their posted time, thankfully Liquid seems to support this with filters. Even if it didn’t comments could be given filenames that order correctly using an incrementing count or just the current time.

Single files for each blog posts

A similar approach to using a folders and files per comment would be using a single file per blog post. This has similar benefits but other drawbacks like merges being harder when blog post has multiple comments awaiting approval.

Testing

So a test of the folder approach for the data in _data/test/comments/ that contains 3 files named _0.json, _1.json and _3.json would look something like this:

{% for comment in site.data.test.comments %}
* {{ comment }} 
{% endfor %}

With the rendered output of:

  • _0{“uuid”=>”example UUID”, “displayName”=>”John Smith”, “comment”=>”Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.”, “dateTime”=>”2019-05-23T09:34:43.581Z”}

  • 2{“uuid”=>”example UUID”, “displayName”=>”Mr Test”, “comment”=>”Foo bar \n abc *123* _xyz foo.bar()”, “dateTime”=>”2019-05-23T09:21:01.120Z”}

  • _1{“uuid”=>”example UUID”, “displayName”=>”Jane Test”, “comment”=>”Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.”, “dateTime”=>”2019-05-23T09:37:21.520Z”}

Notice that the prefix before each item is the filename. So to render the data within we need to select the 2nd item in each comment. So something like the following could render comments:

<table>
  <thead>
    <tr>
      <th>DateTime</th>
      <th>Author</th>
      <th>Comment</th>
    </tr>
  </thead>
  <tbody>
    {% for comment_hash in site.data.test.comments %}
    {% assign comment = comment_hash[1] %}
    <tr>
      <td>{{ comment.dateTime }}</td>
      <td>{{ comment.displayName }}</td>
      <td>{{ comment.comment }}</td>
    </tr>
    {% endfor %}
  </tbody>
</table>

Which would render to:

DateTime Author Comment
2019-05-23T09:34:43.581Z John Smith Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
2019-05-23T09:21:01.120Z Mr Test Foo bar abc *123* _xyz_ `foo.bar()`
2019-05-23T09:37:21.520Z Jane Test Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

One important thing you might notice is that, in the first example made use of markdown to render the comments which actually automatically rendered the comment text of the comments.
In the second, HTML example the comment text is not parsed as markdown and rendered raw. This means that the special markdown characters are rendered and the newline character is added to the document itself.
I want to support markdown in my comments so I’ll have to to use a Liquid filter like {{ comment.comment | markdownify }}.
With that in place the comment with markdown would be rendered like so:

Foo bar abc 123 xyz foo.bar()

Remember that line breaks aren't added in markdown unless they are double breaks (for a new paragraph) or the preceding line is prefixed with 2 spaces (for a manually added line break).

Additionally Liquid has filters for displaying dates in a more friendly manner. These include:

  • date_to_xmlschema
  • date_to_rfc822
  • date_to_string
  • date_to_long_string

So the date 2019-05-23T09:21:01.120Z could be rendered as:

  • 2019-05-23T09:21:01+00:00
  • Thu, 23 May 2019 09:21:01 +0000
  • 23 May 2019
  • 23rd May 2019

Or some other variations based on possible settings for the filters.

Closing

This post has got quite long already so I will end it here.

You can see a lot of the ideas and thought process happening within here however I still haven’t looked into the pull request side of this system. I am aware of JGit for Java which could be used to create branches in Git and have had a quick look at the Github API for creating Pull Requests but will write a separate post on that.

Comments on a Static Blog

I have been considering adding comments to my blog for a little while now.

Other people with static blogs built on Jekyll tend to use Disqus for their comments. Disqus is a purpose built third-party plugin for embedding comments in websites and works quite well.
However as Victor Zhou outlined in his blog post Disqus is actually a very heavy plugin that makes a lot of external requests. It is also closed source and powered by adverts (at least at the free tier).
Victor actually recommend using an alternative in his blog post called Commento which is much more lightweight (in size and requests), open-source, values privacy and can even be self hosted if desired.

When looking for alternatives to Disqus I also came across a blog post by Phil Haack mentioning quite a cool solution that Damien Guard mentioned to him; using Jekyll data files and pull requests on GitHub to make a comment system that is static.
This approach seems quite cool in my opinion and would a nice self contained project to have a go at implementing.

Stay tuned to see what kind of solution I go with. I may well opt to continue not having any comments on my blog.

Extracting Published Dates from web pages

One objectively useful piece of information often present on news and blog post articles online is the date of publication.
It can be used to determine how fresh and relevant an article is and when used in conjunction with other processing allow you get a feel for the subject of the article, be it a company, person or event.

At Synoptica I worked on improving the accuracy of getting such a published date (and sometimes time).

Interestingly this is a harder task than you might naively believe it to be.

Challenges

Surely you can just grab the first date you see on a web page and be done with it right? Nope.
Often a web page will have many dates on it, some from other articles, adverts or even today’s date.

So grabbing the first one you see is not good enough, but even if it was we’d still have the problem of what we consider a date.
Americans like to use the (confusing) date format of mm/dd/yyyy compared to dd/mm/yyyy, but often websites also use a textual representation of the date like Monday 21st January 2008 or August 3, 2009, with many variations of order and punctuation.

Some websites might not even include the date of an article on it’s page in text but instead encode it in the URL like: https://example.com/2012/1/1/happy-new-year or in elements within the page.

But there must be “a standard” for presenting dates on articles online right? Otherwise, how do the likes of Google, Bing, Twitter, Facebook and others show nice summaries of web links to news articles?

Well there are standards, plural:

  • The Open Graph protocol is a standard that allows for adding rich metadata to a web page that the likes of Facebook consume, this includes article:published_time which can be used for determining a published date
  • schema.org promotes another (large) standard for adding metadata to web pages including quite a few different variations of dates that can be used as published dates. It can also be encoded into a json-ld file included with a web page
  • Many content management systems (CMSs) and publishers also have their own ways of encoding a published date in a pages metadata or tags. E.g. The Wall Street Journal uses the <meta> tag named article.published and Metro uses one named sailthru.date

Solving it

So that’s 3 places we can get a published date (and maybe time) from:

  • Text within the web page
  • The web page’s URL
  • The metadata and tags on a web page

All these can be in various formats so we’d need to be able to parse them to dates (and maybe times for some) reliably.

A web page might have many dates on it so we’d also want to be able to determine which one is the most likely published date as well.

We’d want to cover as many web pages as possible and make it easy to add more “rules” to such a system, so we can incrementally improve it.

A decent solution to this (and the one I took) would be to:

  1. Encode all the common patterns into their own rules.
    For instance create a process to decode a URL and extract anything that looks like a date in it like https://example.com/2008-02-21/article or https://example.com/2012/12/15/article, other processes for the common metadata patterns, and more still for searching for date patterns in a pages content.
  2. With the dates output by the above processes; evaluate them and determine a most likely published date.

The first step is simple enough and can be structured in such a way that it is easy to add more rules to as you discover them and modify existing ones.
I did this using classical Java and Object Orientated patterns; defined a common interface for processing a web page and it’s URL, then implemented types based on this interface, taking advantage of inheritance to divide similar rules into simple type hierarchies to reuse common behaviours.

Simple class diagram

The second step requires some intelligence and reasoning. One simple algorithm might be to use the date that appears the most for a given page, ordering those with similar counts by recency. A more intelligent one might be to weight the output of the various rules, since some may be more likely to be right than others, e.g. the date in a URL is likely right.

With these steps in place you could now easily evaluate a list of URLs and determine when they were published, outputting such information in a standard way (ISO-8061 is nice) e.g.

URL Published Date (with time)
https://edition.cnn.com/2018/09/02/health/cuba-china-state-department-microwaves-sonic-attacks/index.html 2018-09-02T20:13:34Z
https://www.bbc.co.uk/news/business-45394226 2018-09-03T14:27:52+01:00
https://www.dawn.com/news/1430365 2018-09-02T00:55:02Z
https://dolphin-emu.org/blog/2018/09/01/dolphin-progress-report-august-2018 2018-09-01T00:00:00Z
https://www.wsj.com/articles/new-speed-bump-planned-for-u-s-stock-market-1535713321 2018-08-31T11:02:00Z
https://www.bbc.co.uk/news/uk-england-surrey-44291716 2018-05-29T15:49:31+01:00

Conclusion

I built such a library over a short period of time in Java that covered many cases over many web pages, 51 in my original hand curated tests.
It could handle at least 19 different date formats that I had seen, some including time and offset information.
It was also easily extensible, with tooling allowing me to explore pages for potential new patterns and improvements to existing ones.
In an effort to be both polite and efficient I saved local copies of each web page I was testing and built the tests and tooling around this so it became easy to add more rules, and more web pages when I needed.

At Synoptica the extracted date is used to determine relevant articles to a company and used in scoring companies in various categories like funding, corporate social responsibility, security, recruitment etc.
The more accurate a date (and maybe even time) the more accurate results are.

As a result of this being work related the code is property of Synoptica but the process itself should be relatively easy to reproduce, in fact there already exists at least one project that does something similar to this in Python for those requiring a ready-made solution: Webhose article-date-extractor.
I am unsure on the licensing of this project however, and do not know if it is actively maintained. I encountered it when I had already developed most of the rules in my own project and was relived to see they had a similar approach to my own (albeit in differently structured Python code).

Overall this was an interesting self-contained problem that was well defined and relatively easy to solve. I’d be tempted to attempt a solution again in my own time and evaluate potential alternatives and new possible patterns.

One interesting possibility would be to implement this using machine learning to assist in step 2. That is, if given enough context around each extracted date you could potentially train a model that would decide which is the most likely date for a given web page. A similar approach is taken to eliminate adverts and boilerplate parts of web pages by the project Dragnet. I had this idea initially but thought a heuristic based approach would be faster to implement and give good enough results.

Vertx Config Toml

Today I have quickly thrown together a library to add support for TOML to the Vert.x Configuration service.

The repository can be viewed here: https://github.com/LyndonArmitage/vertx-config-toml

I did this after reading 2 blog posts on both YAML and JSON being not so great as configuration file format solutions and then looking at some of the applications I was working on using JSON as a configuration format. The 2 blog posts in question were by Martin Tournoij (arp242), the first about JSON and the second about YAML. Both were written in 2016 but the YAML one has become popular on Reddit this week.

Forced use of HTTPS on Blog

Quick update to this blog.
I have configured and forced the use of HTTPS.

Previously I started to look into this but had forgotten about it until a recent blog post by Scott Helme reminded me of just how important HTTPS is. For more information on why HTTPS should be the default for all websites I suggest this link: https://doesmysiteneedhttps.com/

This site is hosted on GitHub pages and uses Jekyll as a static site template. Enabling HTTPS was simple enough, below is a list of useful links in case anyone struggles:

The most annoying parts were waiting for DNS entries to update and then editing files in my Jekyll settings to make sure https:// was used instead of http:// where appropriate.

In other news I have noticed I need to fix some formatting in my previous post so it looks good on smaller devices, unfortunately I did not have Jekyll running on the machine I wrote that post on so could not preview the post before publishing.