How to create supportable software

As a business person it’s hard to see whether software is supportable and maintainable, right up until the point where it breaks and nobody can fix it! So let’s be proactive and see what it takes to make software that’s easy to keep running.

To keep your costs down, it’s important that your team creates software that is easy to support and maintain. This involves a little more work up-front, but will pay you back in spades over the long run.

Once your software team has built your software and it has gone live, they’re going to need to support and maintain it for you, probably for several years. Even if your software has already gone live it’s not too late to make it more supportable, although thinking about support and maintenance right from the start will make things easier and cheaper.

Principles of Supportable Software: MALT

Here are the key areas to consider – when gathering requirements and building software – when thinking about making software easy to support and maintain. They make the acronym MALT.

Monitoring: How can we tell if the software is running as expected?

Alerting: How will we know when something goes wrong?

Logging: How can we see inside the black-box at the detail of how the software ran?

Traceability: How can we trace the source of problems back to their root cause?

A detailed example

Let’s look at each of these areas in more detail, with a real-world example. For simplicity, I’m going to use the example of a little process that brings in sales data from a distributor who sells our products. It has a very simple task: it just reads a data file once a day and stores the contents in a database, then saves the file somewhere in case we need it again.

Monitoring

Without any monitoring, nobody would know whether the process had run or not. I’ve seen processes that people thought were actually running in the background and storing essential business continuity data and backups, but actually they hadn’t run for two years! If this was business-critical data, and there was no ability to use these backups because they didn’t exist, the business would go bust overnight!

A nice way to monitor our example process would be to write into the database each time the process runs. We would store a few things of interest:

Date and time that the process started
Date and time that the process ended, so we can see how long it took
The name of the data file – it probably has a date in the name, eg SalesData_2015-12-01.csv
How many lines of data it processed
How many lines of data had an error

Separately we might also choose to store each error message in the database, for each faulty line of data, so that we could build up a complete picture of what was going wrong. It would then be easy to go back to the company supplying our data and tell them that they were giving us bad data, and give them the information they need to fix it quickly.

So by adding monitoring we’ve gone from a total state of not knowing what’s happening to being able to check exactly when the process has been running and what it’s been doing.

Sometimes it’s appropriate to build a dashboard to give a nice visual indication of how a process has been running. I like Geckoboard for this – it gives an attractive and quick way to build a dashboard that can be viewed on a desktop computer, phone or tablet.

If you are working on a web project and need good monitoring and alerting, consider services such as Pingdom or New Relic.

Alerting

What if the process doesn’t receive a data file? Perhaps something has gone wrong with the network, or perhaps our data supplier has stopped sending us files due to a problem at their end which they haven’t even noticed.

What if the data suddenly starts going bad, eg financial fields are all set to zero, or valuable marketing email addresses are coming out as jumbled nonsense?

We need to put some alerts in place so that somebody on our team can start to look into the problems as soon as we notice them. We don’t want to wait to be told by a third-party that our system isn’t working, or wait for a disaster to find that the system broke years ago and nobody even realised.

Alerts can take several forms. They could be an email, an SMS text message, or shown in a special app. Email alerts are simple and traditional, and are usually good enough. If the alert is time-critical, SMS might be more appropriate.

My preferred way of working is NOT to send an alert when everything is successful, and to ONLY send an alert when something has gone wrong. Otherwise people just suffer “alert blindness”, and don’t notice the one day that everything has suddenly gone wrong because they’re used to seeing success messages and mentally screen them out.

A good alert message will give enough detail to be able to understand the general problem, and so that the support technician will have some idea of where to start investigating in more detail.

I usually suggest that alerts go to a business person AND to a technical person. The business person can determine priority and severity, and can understand any impact on the business. The technical person can then just focus on finding what went wrong and understanding how to fix it. In the example of our data process, if the data goes bad then the business person will probably have the relationship with the distributor, and can work with them to get the problem looked into.

Logging

Although business people might see monitoring and alerting information, they are very unlikely to see logging, because this is more of a technical area. A lot of software produces log files as it runs – web software especially – and this can show things like which web pages were called, what information the user entered into the page, and which specific database actions took place. Log files tend to be actual files, rather than being written to a database, because they can get very large.

The idea here is that if the software goes rogue, you should be able to figure out what’s happened. Did a particular piece of the software run when it wasn’t supposed to? Did an expected action not take place? Did the user give us some bad data that we weren’t expecting?

If you’re writing the requirements for some software, you don’t need to specify anything specific for logging. You should make a general note along the lines that “Logging is required in order to trouble-shoot any anticipated problems.” If you make a list of possible things that might go wrong, the developers can make sure they have enough logging to be able to easily trouble-shoot this. In the example of our data process, we can reasonably expect that sometimes the data file won’t appear, sometimes the data file will contain errors, etc.

Good logging should at least let us understand that a problem has happened, and perhaps show some information about the internals of our software and how it has responded to the problem, even if that’s just a big error message that gets dumped into the log file at the point where things went pear-shaped!

Traceability

Traceability is all about identifying where things came from. It is best illustrated with an example.

In our data process, we write each line of data from our file into the database. But if somebody runs a file of test data through the process by mistake, or runs the same file twice, we need to figure out which data came from that particular file and take it out of the database, so we don’t end up with bad data or duplicates.

If we can trace each line of data in our database back to the exact file and process-run that it came from, we can very easily unpick our problem and get rid of the dodgy data. Without this traceability we’re pretty much sunk, and have no practical way of knowing which data is valid and which was faulty test data that needs to be removed.

On a website, traceability might include which user ran a certain report or saw a particular error message. If our website is calling out to a payment provider to take a credit card payment from the customer, if we get an error it’s good to make that error traceable by making it easy to link the error message back to a specific customer, his specific order, the call-centre operator that took the order, etc.

Conclusion

I’ve only really scratched the surface of making software that’s easy to support and maintain, but I hope you can see the general principles at work.

Think about these principles when you are putting together the requirements for your software – even if you’re a business person and not technical – and you’ll get much lower Total Cost of Ownership from your software as you are able to more quickly and proactively find and fix problems.