iliana destroyer of worlds

Webscale website webstats

My personal website has been statically generated for a long time. A couple of months ago I moved it to Netlify, but it’s been on lots of hosts, from the mammoth GitHub Pages to the nostalgic Neocities (who I honestly cannot recommend enough). There was even a strange, brief period where I hosted it myself with a bespoke nginx build.

Static hosts often don’t provide even basic hit counts or statistics. After all, there’s almost-universal services like Google Analytics for that, so I understand why they don’t build it. Neocities provides basic hit counting but it’s for the entire website, not individual pages.

For the privacy-minded webmistress who prefers not to hand her visitor’s browser resolution and other unnecessary data to the likes of Google, but still needs to stay within her web font license and feed her ego, what else is there to do?

Build a Rube Goldberg machine, I guess.

Privacy-focused beacons

Beacons, or tracking pixels, have a lot of possible dangerous uses, but in the simplest case they provide very little user data: IP address, the URL the visitor hit, and the browser’s user agent. This is all information that’s being transmitted by a user visiting your page anyway.

Where beacons become dangerous is by sending additional information (perhaps using the query string) or setting cookies without the user’s consent. Currently, EasyPrivacy (shipped by default in uBlock Origin) only blocks resources named pixel.gif if they contain a query string.

Calling it what it is instead of trying to hide from privacy filters is the only ethically correct thing to do; let privacy-conscious visitors know what you’re doing.

Where should the pixel live?

It’d be simple enough to host a pixel somewhere that provides request logs, but there’s a few advantages to the request being same-origin: you don’t have to add to your Content-Security-Policy header, and browsers can request it in the same connection that they requested the rest of your content.

Netlify supports request proxying, bizarrely enough, and it works seamlessly. So that solves the same-origin desire.

Netlify uses a worldwide CDN, so in order to minimize request time to load the beacon, we want our beacon’s host to be as close as possible to the… worldwide CDN.

The solution I picked was to push the beacon content out to Amazon CloudFront via Lambda@Edge. CloudFront POPs will respond to a request as quickly as it can, then take its time throwing the logs into an S3 bucket.

Here’s the CloudFront origin request function I’m using (nodejs8.10 runtime):

exports.handler = async (event) => {
  const uri = event.Records[0].cf.request.uri;
  if (uri == "/pixel.gif") {
    return {
  	status: '200',
  	headers: {
  	  'access-control-allow-origin': [{key: 'access-control-allow-origin', value: '*'}],
  	  'cache-control': [{key: 'cache-control', value: 'no-cache, no-store, must-revalidate'}],
  	  'content-type': [{key: 'content-type', value: 'image/gif'}],
  	  'expires': [{key: 'expires', value: 'Mon, 01 Jan 1990 00:00:00 GMT'}],
  	  'pragma': [{key: 'pragma', value: 'no-cache'}],
  	  'x-content-type-options': [{key: 'x-content-type-options', value: 'nosniff'}],
  	},
  	body: 'R0lGODlhAQABAID/AP///wAAACwAAAAAAQABAAACAkQBADs=',
  	bodyEncoding: 'base64',
    };
  } else {
    return {status: '404'};
  }
};

(If I could change one thing about Lambda, it would be to make their integrations with other services less… obtusely nested.)

We couple that with a _redirects file in our Netlify website:

/pixel.gif https://d123456example.cloudfront.net/pixel.gif 200!

And we now have a same-origin beacon saving request logs to S3.

Anonymizing access logs

CloudFront access logs have a lot of fields. There’s the usual things, like the date and time, host and path, client IP address, referer, and user agent. There’s also the request ID, the edge location airport code that answered the request, the X-Forwarded-For value, and (for some reason I’m sure) the encryption algorithm used for the TLS stream.

I don’t particularly care to keep most of this data; I really just want the timestamp and the referer header.

Fortunately, S3 is a great entry point to developing an incredibly complicated chain of services. We’re going to start this chain by running an AWS Lambda function when logs get put into the log bucket. The code is provided at the end of this post, but the gist is:

The file is rewritten to contain only the date, time, and referer (split into host and path to make queries a bit nicer).

Instead of calling PutObject with new data, you could store the data somewhere else in an actual database. But if you’re like me and freeze at the decision for how to store data, have I got a solution for you!

Querying access logs

It turns out AWS has a service for performing queries on a bunch of text data in S3 called Amazon Athena. With it, you define a SQL-like table from a data source, and you can perform SQL queries on that data. It can handle line-by-line data formats complicated enough to require a regular expression, but it’s easiest if you have tab-delimited data.

Athena bills by how much data it has to read to answer your query. A way to reduce this amount is to partition your data in S3 based on your planned queries. Because I expect to answer questions about the last n days, I’ve partitioned my data by date.

The log-snarfing Lambda function above rewrites data into a path such as partitioned/date=YYYY-MM-DD/log-file.gz, and by informing Athena the date field is partitioned, we reduce the amount of data it needs to read.

These files then contain the time, host, and path of the request; for example:

12:34:56	linuxwit.ch	/blog/2019/05/webscale-website-webstats/

In the Athena console, we can create a table for this data:

CREATE EXTERNAL TABLE `website_logs` (
    `time` string,
    `host` string,
    `path` string
) PARTITIONED BY (`date` date)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION 's3://bucket-name/partitioned'

To load the partitions, repair the table:

MSCK REPAIR TABLE `website_logs`

And now we can execute queries:

SELECT COUNT(*) FROM `website_logs`
    WHERE host='linuxwit.ch'

Putting it all together

I’ve put my entire setup into a CloudFormation stack called “staticstat”, available over on GitHub.

This is the stack that served pixel.gif when you loaded this page in your browser! Neat!