MCollective is a chainsaw (not a hammer) - an experience report

Warning: This is an archived post, the information below is outdated

MCollective painpoints

I was recently on team that ripped out MCollective halfway through a project. Once it was removed, the deployments were much more stable, and when something did go wrong it was much easier to troubleshoot. MCollective was given a fair chance - it was used for several months and was an essential piece of Continuous Delivery pipeline that deployed 4 apps and 7 microservices into a several environments. It was getting the job done - but there were painpoints.

This is an email I sent to my team before MCollective was removed:

MCollective runs are non-deterministic

If MCollective does not find a server within 2 seconds, it assumes it doesn’t exist. If MCollective is down or the server is slow to respond the deployment will skip a server but “succeed” without even a warning.

Swallowing errors

I have seen many errors that should fail builds but are ignored. This might be an implementation mistake rather than a framework issue - but the complexity of the framework makes this difficult to avoid.

Difficult to change

Changing agents

Putting the commands on the agent side (instead of the invoker) makes them a little painful to change. I think the process is update git, wait for [the build to produce a new package], use mco to invoke puppet on all agents (this installs the agent change), run mco reload_agents or restart all agents. Then, you can use the new command. Sometimes the new agent fails to load.

Changing facts

Changing facts require a restart. There was a corrupt facts file for two weeks before we MCollective was restarted and we found it.

Failing silently

If the facts or new agents are corrupt, MCollective logs the exception, then keeps running with the old agent/fact. This hides the fact we’ve committed a flawed change.

No output flush for long-running jobs

We aren’t flushing output - results do not show up in Go until the command has finished. This causes Go to think the job has stalled (it can go many minutes without output when running Go). I don’t think its possible to flush sooner, because of the way MCollective serializes responses. MCollective was not designed for long-running jobs. Queuing and Scheduling seem to have been dropped from the roadmap. I also don’t think that stdout in a hash with whitespace escaped is very readable.

Readability

the way MCollective puts stdout and stderr in a hash, which messes with the formatting.

Cost

MCollective is more expensive than other options when the number of concurrent operations is low. In theory it outscales other options, but that requires you to scale ActiveMQ (and Puppetmaster/db if you’re using them). Right now our activemq is a single-point-of-failure, and our mco puppet runs die at 5 concurrent nodes.

Install & dev complexity

MCollective is more difficult to install than alternatives. We only have a working package for centos (and that is an outdated fork). This makes it difficult to test MCollective changes locally. Ideally I would have an mco client installed on my machine, and use it to test deployments (to a Vagrant VM running an Iba appnode), or to for the mco runbook (e.g.: checking facts).

These are not hypothetical painpoints. I have encountered every one of these problems with our current setup.

Other reports

A former colleague gave a good summary of MCollective:

Capistrano for the most part relies on a central list of servers but it’s also simpler because it’s just essentially ssh in a for loop. MCollective can automatically discover the machines to execute commands in, which might be a better fit for a very dynamic cloud environment, but it also adds a lot of complexity with extra middleware.

This is why I feel MCollective is often an over-engineering red flag. Most projects are not “a very dynamic cloud environment”.

We were not the only team that had these sorts of problems. A colleague at a different client wrote:

My client has just replaced MCollective with a more hand rolled solution as it recently took out our production site. I don’t have many more details on why it went wrong - config mistakes or shonky software, but they lost confidence in it.

Perhaps it was config mistakes or shonky software. If so, is it fair to lose confidence in MCollective? Yes! If a system is complex or difficult to understand, and you can get the same or better results with a similar system, then it is rational to be more confident in the simple system. MCollective is a powerful tool, but it can obfuscate change management. Non-deterministic runs, swallowing errors, not flushing output, non-readable output, complex for devs to test, complex security. MCollective isn’t winning confidence.

Wrapping up - MCollective is a chainsaw

Does that mean you should never use MCollective? No - but it shouldn’t be your go-to tool. It’s not a hammer - its more like a chainsaw. You can do some amazing things with chainsaws. They get many jobs done much faster, but they lack finesse (you can’t see the fine details as you carve), and don’t work well for long-running tasks (they overheat).

MCollective is a very innovative niche tool. Just realize that when you use it outside of its main niche, it may not be the simplest thing that could possibly work. Personally I’ve found that scheduling Puppet runs is not a hard problem, but I usually go with masterless Puppet and fancy SSH “for loop.”

MCollective Is a Chainsaw (Not a Hammer) - an Experience Report

MCollective painpoints

MCollective runs are non-deterministic

Swallowing errors

Difficult to change

Changing agents

Changing facts

Failing silently

No output flush for long-running jobs

Readability

Cost

Install & dev complexity

Other reports

Wrapping up - MCollective is a chainsaw

Comments