The Startup CTO’s Guide to Ops (2 of 3): My Operations Toolbox

This is part 2 of a 3-part series on the minimum required operations setup for an early-stage startup. Previously, we discussed our guiding principles and requirements. Here I discuss my specific preferred tools.

There are many (many!) tools available for monitoring, metrics, deployments, etc. In this post I’ll step through the tools we used for a recent SaaS project, a Python Pyramid web application project running on Ubuntu Linux. Deployment and production configuration are big enough topics that I’ll talk about those later, in part 3.

We knew that this product would take about 8 weeks to build, and would soon generate a few thousand dollars a month with tens of thousands of unique sessions. The ops infrastructure is selected with this revenue, usage, and time investment in mind.

Outline

Hosting — where and how

Choosing how much to outsource

My locally hosted tools

My 3rd party tools

Hosting

That being said, the decision often boils down to: cloud hosting (in AWS EC2, Google Cloud, etc.) vs. leasing dedicated servers. Of course, the waters get muddied by service providers who offer hybrid dedicated-and-cloud options. And there are other options which I’ll pass over, like managed hosting with Heroku (it can quickly become too expensive) and server-less cloud with AWS Lambda (not a great fit for large web applications, and also harder to debug and monitor).

Here’s my very rough decision process:

  • Are you just starting out with a typical web stack? Just get 1 or 2 dedicated servers, prove the business is viable, and work on scaling later. (I’ll give more reasoning on this below).
  • Is your company getting traction? It’s time to scale your web tier horizontally with cheap virtual hosts. Either move to cloud hosting or work with your service provider to add VMs.
  • Do you have very “spiky” loads (bursts of traffic, batch jobs, mercurial customers)? This is where the “elastic” part of EC2 really shines; pick EC2 so you can dynamically scale up/down and not pay for unused capacity. (I’m saying “EC2” for simplicity, but it could be any major cloud hosting service).
  • Are you an established, sizable web property? Both EC2 and dedicated server costs are expensive at volume. It’s reasonable to seriously consider buying your own servers, especially when you factor in the accounting benefits of CapEx and depreciation. BUT: your developers will rage-quit if it takes 12 weeks to get a freakin’ VM approved and set up, so make sure you have the process and ops infrastructure in place to add capacity smoothly.
  • Do you need to do something fancy like managing thousands of nodes, or massive stream processing, or machine learning on a fleet packed with GPUs? This discussion doesn’t apply to you.

Returning to the first point: when you’re just starting out, I’d probably suggest a dedicated leased server approach over EC2 because:

  • If you start with EC2, you’ll need to think about scale from day 1. EC2 hosts are less powerful than dedicated machines and (anecdotally) are more prone to failure. This means you’ll need to do more up-front ops work on deployment and scaling.
  • EC2 costs more. For example, I could easily run a full website stack (Nginx, PostgreSQL, multiple web processes) on one decent dedicated server with SSD for ~$120/mo. Compare this to an AWS environment with: an on-demand PostgreSQL instance on a db.m4.large image with 100GB SDD, 2 m4.large web servers, storage, one elastic IP, and elastic load balancing; which comes to ~$350/mo (and this doesn’t include capacity for staging and monitoring). Granted, that price difference might be chump change relatively speaking, but every cent matters when you’re a startup.
  • On a dedicated server, you have complete control, and all the IO.

These reasons are enough to nudge me towards a dedicated server, but it’s certainly not a slam-dunk argument—EC2 is a reasonable option and the AWS ecosystem offers a nice suite of tools. Do your thing.

Ugh, don’t be like the guy who choose poorly… (Source: my childhood)

How much infrastructure should you outsource?

No more [PROBLEM]. Don’t waste time maintaining your own [COMPONENT]. We [BRAG ABOUT SCALE]! Get started in minutes!

— Madlibs version of the typical ops outsource pitch

For pretty much any DevOps task you can imagine there exists a managed 3rd party service offering to take it off your hands. For example, you could ship your logs to Paper Trail, run jobs on Iron.io, use CodeShip for CI/deployment, report errors with Sentry, and so on.

(I picked the above companies pretty much at random — in each category, there are multiple competitors.)

While I buy the argument that a startup should focus on features (its core competency) and outsource operations, I’m a bit wary of introducing extensive reliance on multiple 3rd party managed services:

  • Costs can add up
  • There are operational risks and added complexity in relying on many external providers
  • The sheer number of services can become a headache to manage

I lean towards open source tools I can host locally when possible, but I’ll use a 3rd party service if there’s value in a task being off-site or it’s something I absolutely don’t want to manage, such as email deliverability.

Locally hosted tools

What could possibly go wrong? (Source)

Development stack

Vagrant/VirtualBox: Local development

A Vagrantfile describes your machine as e.g.:

Vagrant.configure("2") do |config|
config.vm.box = "ubuntu/xenial64"
config.vm.network "forwarded_port", guest: 8000, host: 8000
config.vm.network "forwarded_port", guest: 5432, host: 15432, autocorrect: true
config.vm.synced_folder ".", "/myservice"
config.vm.provider "virtualbox" do |vb|
vb.memory = "4096"
end
config.vm.provision :shell, :path => "provision.sh"
end

The above does:

  • Use Ubuntu 16.04 base image (xenial64)
  • Set up port forwarding on our MacOS machine so we can visit http://localhost:8000 to hit the web server, and we can connect to our PostgreSQL instance on port 15432.
  • Make the current folder (on the MacOS side) map to a shared folder at /myservice in the virtual machine
  • Give the VM 4GB of RAM
  • After initial setup, Vagrant will run the scriptprovision.sh as root. We will use this to install our packages and application dependencies.

A provision.sh file for local development may look something like:

# Update repos 
sudo apt-get update
# Install our packages
sudo apt-get -y install vim postgresql-client postgresql postgresql-contrib python-dev redis-server ntp
# PostgreSQL port forward to local host machine
echo "host all all 0.0.0.0/0 md5" | sudo -u postgres tee -a /etc/postgresql/9.5/main/pg_hba.conf
echo "listen_addresses = '*'" | sudo -u postgres tee -a /etc/postgresql/9.5/main/postgresql.conf
# Restart postgresql
sudo /etc/init.d/postgresql start
# ...add your own commands to set up the DB role, schema,
# and populate initial data...
# Install pip and virtualenvx
sudo apt-get -y install python-pip
sudo pip install --upgrade pip
sudo pip install virtualenv
cd /myservice
virtualenv venv --system-site-packages
source venv/bin/activate
# If you have a Pyramid project also called myservice
# which has a setup.py file listing Python package
# dependencies...
cd /myservice/myservice
pip install -e .

With this in place, you can just run vagrant up to kick off the one-time machine provisioning, which may take 10–30 minutes. When it’s done, you can access your new virtual machine withvagrant ssh.

Note that your web server must bind to 0.0.o.0:8000 to port forward as http://localhost:8000. If you’re using Python Pyramid, edit development.ini :

[server:main]
use = egg:waitress#main
host = 0.0.0.0
port = 8000

GrayLog: Logging, dashboards, and logfile monitoring

The ELK (Elasticsearch, Logstash, Kibana) stack has significant traction, but we opted for GrayLog because it was so very easy to install and was familiar to us as Splunk users. You’ll need to install MongoDB and Elasticsearch, but minimal configuration is required and the documentation explained the steps clearly.

Here are some examples of what GrayLog looks like (these images come from their website):

Note that GrayLog’s use cases extend beyond searching logs and monitoring health, it’s also an incredible tool for business insights. We use it to answer questions like “how many active users were there in the last hour? 24 hours?” or to follow the click trail of a specific user.

There are many ways to send logs to GrayLog. We opted to send entries using its native log file format (GELF) over UDP and HTTP POST.

Our Python application uses the graypy package to add GrayLog as a logging handler. In the __init__.py main() method:

import graypy# ...
# Later, in main()...
# ...
logger = logging.getLogger(__name__)
handler = graypy.GELFHandler(
settings['graylog.host'],
int(settings['graylog.port']),
level_names=True)
logger.addHandler(handler)

I suggest that when you write out log lines you format datum as multiple key=value tuples. This format makes it easier to extract as values in GrayLog (this is true of Splunk, too), and improves readability. For example, we log a StatusCake health check to / as:

StartRequest ip_address=188.226.158.160 user=none plan=none path=/ user_agent="Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.4 (KHTML, like Gecko) Chrome/98 Safari/537.4 (StatusCake)"

We configured GrayLog to send monitoring alerts to email and Slack callbacks on these events:

  • Any status=ERROR (includes exceptions)
  • Excessive status=WARN
  • Various significant user events (payment, etc)
  • Lack of expected events, like a nightly backup

Machine monitoring with Bash and GrayLog

#/bin/bash
MEMORY=$(free -m | awk 'NR==2{printf "%.2f", $3*100/$2 }')
DISK=$(df -h | awk '$NF=="/"{printf "%s", $5}')
DISK=${DISK/\%/}
CPU=$(top -bn1 | grep load | awk '{printf "%.2f", $(NF-2)}')
MESSAGE="memory_pct=$MEMORY disk_pct=$DISK cpu_load=$CPU"curl -XPOST http://grayloghost/gelf -p0 -d "{\"message\":\"$MESSAGE\", \"category\":\"machine_stat\", \"facility\":\"production\"}"echo $MESSAGE

The $MESSAGE we send to GrayLog looks like: memory_pct=2.18 disk_pct=54 cpu_load=0.81. In GrayLog, it’s straightforward to trend and set alerts on these machine stats.

Redis

Source (a great slide deck)

Redis is an in-memory data store which persists to disk. It can be used for a wide range of functions: cache, queue, NOSQL database, set operations, fixing the kitchen sink, you name it.

Note that if you use Redis as a website cache you should probably change the default out-of-memory policy to allow key eviction (the default out-of-memory behavior is to freeze up, which is probably no good for you).

# /etc/redis/redis.conf
maxmemory 1gb
maxmemory-policy allkeys-lru

3rd Party Services

As a matter of pricing strategy, I’ve found that it’s often advantageous to pick the #2 or #3 player in a category instead of the market leader because the up-and-comers offer more features at the free tier or cost less.

Case in point: while top-tier blog posts would punch up their content with the occasional LOLcat, we’ll go with Corgis (source)

Slack: Collaboration, notifications (free)

We configured Slack webhooks to receive notifications for Git checkins, issue tracking changes, and monitoring alerts.

StatusCake: External monitoring (free)

GitLab: Hosting code, issue tracking, documentation (free)

S3: Backups, sharing large files (< $10/mo)

$ aws s3 sync backups/ s3://myservice-backups/ --exclude "*" --include "*.sql.gz"

Stripe: Payments (per-transaction fee)

There are a few minor “gotchas.” If you’re a new corporation, it may take about 10 days after registering your federal tax ID (EIN) for it to be verifiable by Stripe. Also, the reporting is surprisingly minimal; it’s hard to answer questions like “how many paying customers do I have?” Fortunately there’s a huge ecosystem of free services around Stripe.

MailGun and Gmail: Customer and alert emails (both free)

We use a Gmail account to send operations emails from our production host, via Postfix using Gmail SMTP as the relayhost.

Google Analytics: Site metrics (free)

I know I’m preaching to the choir, but you should use Google Analytics from day 1. Take the time to instrument key user events, clicks, etc.

Namecheap: Domain and DNS ($11/year)

ssl2buy: SSL wildcard cert ($42/year)

This finishes our tour through a set of service providers and tools. In the next (and last) part of this series, we’ll do a deep-dive into a specific example of production configuration and a deployment system.

Click Here for Part 3: A Minimal Production and Deployment Setup >>

Consulting CTO open to projects. I’m a serial entrepreneur, software engineer, and leader at early- and mid-stage companies. https://www.chuckgroom.com

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store