Tl;dr – As the title says – if you are running your db on RDS, then set up a replica of the same, then download and install Metabase on another ec2 instance and link it to this replica. You’re done.
Why do this?
No digital business can be built without a solid analytics dashboard. All of us start using Google Analytics, Mixpanel, Heap, Kissmetrics and a host of other Javascript based analytics tools. These are all good for evaluating the efficiency of the UX, by measuring the conversions at various points of the funnel. But they do not qualify as Business Intelligence.
Solid Business Intelligence is gained only by analyzing the data we already own – the one in our databases. So the simplest way to start is to wire up some queries and grab some row wise data in the form of (x,y) pairs and then use some graphing library to plot graphs on the frontend. We did the same for a few months. SQLAlchemy lets you write such queries very easily. And to plot the graphs, we used Rickshaw JS We still use this setup for a few graphs. But writing adhoc queries and graphs for every new question that comes to mind – is time consuming. Plus running these queries on the production database was dangerous. We even crashed the server a couple of times by running ill advised voluminous queries.
To avoid such disasters, the recommended approach to do this is to set up a separate Business Intelligence Database. It is also recommended that this should have a separate schema structure – one that is more suited to analytics queries – as opposed to the production database which would have a schema that is more suited to transactional queries. Plus, the nature of the performance requirements vary as well. Your transactional db should be able to respond to a very high frequency of low latency queries. While the analytics db should be able to process high latency queries, which will arrive at a lower frequency.
So typically it is recommended that we use a database meant for analytics as the BI datastore. Amazon Redshift is the most popular option. But getting a Redshift DB up and running is a bit of a hassle.
1. Redshift cannot always use the same schema as your production db (It is not meant to, as explained earlier). So you need to spend some time coming up with an analytics specific schema.
2. Syncing the data between your production db and Redshift is either going to be cumbersome or quite expensive.
(i) The simplest DIY method is to write scripts to periodically take db dumps from the production db and run some scripts on top of it to convert the data to Redshift schema and then clear and refill the Redshift DB with this new data. But clearly it is not an efficient approach to delete and refill the full database ever so often. And given the time consuming nature of this operation, it cannot be done frequently. If it is done once a day, the analytics data is always stale by one day, which may not be good in many cases.
(ii) Alternatively, we can take advantage of the binlog replication feature of MySQL. This is a feature which allows any changes in the schema to be listened to, processed upon and applied immediately on another database. But this is sufficiently complicated that it cannot be done in house. There are third party providers for the same and they charge quite heftily. AWS has a nice article about this approach
If you are short of both time and money, it is understandable that Redshift’s ROI is not too attractive. A simpler option is needed, when there is no dedicated analytics team to build and manage the analytics pipeline.
So here is a simple hack to have a separate analytics database and a visualization tool.
1. Separation from production database, can be attained by using the replication feature of AWS RDS. If you are already on RDS, setting up a replica database is fairly simple. AWS has straight forward documentation on how to do this. Since the db load is likely to be light, it should be enough to use one of the lower tiers of RDS for the replica instance, which won’t pinch the budget too much. Since this is a read replica, there is no danger of any accidental data corruption. Plus the analytics db is always up to date with production data. And since it has the same schema as the production db, there is no additional dev effort required for maintaining a separate schema.
2. Next up is the choice of a solid visualization tool. There are hundreds of tools in the market. The market is so saturated that you will pay someone just to help you choose. Tableau is the market leader. But it is too pricy for an early stage startup with limited needs. There are many other tools like Qlikview, Cluvio, Chartio . All of these have a lot of commonalities, but still have their own USPs. Some of them like Cluvio are quite affordable as well. We were in fact about to settle for Cluvio, when we stumbled upon Metabase , which is an open source business intelligence platform which can be self hosted. And it has an interesting take on BI, letting you visualize your BI dashboard as a collection of questions, which makes sense , since that is how we tend to think of analytics. Plus since it is open source and self hosted, it allows unlimited questions and unlimited dashboards. That settled the question for us. Setting it up was a breeze. Their docks are straight forward. You can spin up a heroku node, or an AWS beanstalk instance. Or you can download a docker container and set up on your own instance. We chose the third option – deploying it on a spare DigitalOcean node we had.
Once it was set up, it was just a simple matter of connecting to the replica database. They have an intuitive visual interface for building SQL queries, so that even non-programmers can use the tool. Some of our team-mates are already getting a hang of writing the queries. And for more complicated queries, you always have the option of writing down the SQL – which is what I prefer.
So there it is. A simple, no-frills, no-cost (except for the minor extra cost for RDS replica) way of setting up a proper BI dashboard usable by everyone. And that is what we are using at Inkmonk for now, until our analytics needs outgrow this solution and we are forced to set up a proper Redshift cluster.
PS: There is a timezone issue in Metabase, which is a bit of a bother though. Our database for example stores datetime in UTC. But we want them to be converted to IST before generating reports of course. Metabase has a feature which lets you set the Reporting Timezone. Setting it to IST, ensures that SELECT NOW()
query returns the current IST datetime. But unfortunately, the query builder does not apply a similar conversion for existing datetime columns in the database. Which means that if you run a query SELECT date(item.purchased_at), count(*) AS items_count FROM item GROUP BY date(item.purchased_at) ORDER BY date(item.purchased_at);
it won’t return the chart that you expect. The bars will have inconsistencies with the items purchased in the early morning hours between 12 AM and 5:30 AM IST, getting added up in the previous day’s bar. The solution to this is to call convert_tz
like this SELECT date(convert_tz(item.purchased_at, '+00:00', '+05:30')) AS purchase_date, count(*) AS items_count FROM item GROUP BY purchase_date ORDER BY purchase_date;
If Metabase did this on its own, whenever a datetime column was used in a query generated by the GUI query builder, it would be awesome. There are some pending issues regarding the same. But until they are resolved, timeline graphs would have to be generated by writing the SQL queries by hand as shown above.