Swift-Informant realtime telemetry for Swift

The problem.

Need to know how many object put’s your cluster is receiving per second?
Or how many 401’s are being generated? Is your cluster getting spammed
with container creates?

Having good insight into what your swift cluster is doing is hugely
important to keeping it healthy and to running down issues when they
crop up. As you get larger it also becomes apparent that 5 minute
resolution on a graph isn’t enough anymore. If you’re doing 5/10/100k
requests a second a lot can happen in 5 minutes.

You need realtime data, or at least with in a few seconds. The data
needs to be flexible. You don’t want to slow down the request or
decrease proxy performance to obtain this data.

So…whats an ops monkey todo ?

You deploy graphite + statsd and then use swift-informant on all your proxy servers. Easy. You’ll be back to playing Skyrim in no time.

Eh, wtf is swift-informant ?

swift-informant is a really small piece of middleware you run on your swift-proxy server that reports what status codes you’re returning and what method’s you’re servicing in real time. It breaks this all down by whether it was a swift op on an account, container, or object.

How it works:

After a proxy request has been serviced (using a eventlet posthook) it fires a statsd counter incrementing the request method and the status code for the request that was just serviced.

It also breaks these up by whether the request was an operation on an Account, Container, or an Object.

Sample of the generated statsd events:

#a successful object get
obj.GET:1|c
obj.200:1|c
#a successful container delete
cont.DELETE:1|c
cont.204:1|c
# a client disconnected prematurely
obj.GET:1|c
obj.499:1|c

The beautiful thing here is that these get sent to statsd via UDP. Since its UDP based its completely fire and forget. It’s minimal overhead for
the proxy, there’s no having to worry about keeping a tcp connection established. It’s awesome.

To use informant:

Load informant as the first pipeline entry (even before catcherrors):

pipeline = informant catch_errors healthcheck cache ratelimit proxy-server

And add the following filter config:

[filter:informant]
use = egg:informant#informant
# statsd_host = 127.0.0.1
# statsd_port = 8125
## standard statsd sample rate 0.0 <= 1
# statsd_sample_rate = 0.5

The commented out values are the defaults. You’ll need to adjust them to
fit your environment. This middleware does not require any additional
statsd client modules.

The sample rate control’s how many events we fire. A sample rate of 1
will fire events for every request received. If your proxy is servicing
1000 requests a second you’ll be generating 2000 udp packets per second
(one packet for the status code, and one for the method)

A sample rate of 0.5 would do it for every other request. 0.25 for every
1 in 4. If your cluster’s already doing a few thousand requests a second
a sample rate of 0.25 should be sufficient.

The sample rate is reported along with the event so that statsd can
adjust values accordingly before flushing to graphite.

Next steps:

Got Object-server LockTimeout’s… rsync issues…memcache connection
errors? Next time, I’ll show you how to generate statsd events in
realtime for your swift error logs. So that you can:

http://img.ronin.io/360480859784.jpeg