Tracking Metrics to Surface and Solve Problems: Metric Tracking Practices I’ve Learned So Far

It is a nice pleasant evening, you are sipping coffee and reviewing your code one final time, just so that you can gather enough confidence to hit the deploy button.

But a fact of life as a software engineer is that things can go wrong. Small changes may result in unexpected outcomes, including outages, errors or negatively impacting customers.

And when problems occur, either we can do random checks and validations that may or may not solve the problem or we can have a disciplined problem-solving approach that relies on data rather than intuitions.

Metrics and Telemetry

To enable a disciplined problem-solving method, we need our software to track the right metrics and the right places. We need to design our systems so that they are continually creating telemetry.

What is Telemetry?

The DevOps handbook defines telemetry as,

An automated communication process by which metrics are collected at remote points and are sent over to receiving equipments.

When designing systems, it is a high leverage activity to include creating telemetry as a first-class citizen to enable and ease tracking metrics at all the levels needed, right from business level metrics to deployment pipeline level.

Levels Of Metrics Tracking

As engineers, the software we write impacts the organization at multiple levels, from infrastructure to the product, to business. Thus, to resolve problems quickly, we need to track metrics all these levels.

Following levels of metrics have been really useful for me to keep a checklist for adding the right metrics in the software that I write.

1. Business Level Metrics:

These metrics directly affect the business. Thus are really imported to keep an eye on.

Examples include sales transactions, numbers of items clients sent, total successful items processed, hourly processing rates, etc.

2. Application Level Metrics:

These metrics track the functioning of the application.

Examples include latency of the APIs, response time of queries, number of errors etc.

3. Infrastructure Level Metrics:

These metrics track the infrastructure that runs our application.

Examples include CPU usage, available memory, IOPS spikes etc.

4. Product Level Metrics:

These metrics track the product progress and results. As product engineers, it’s a high leverage task to track product-related metrics too.

Examples include A/B test results, feature toggling results, product progress, product extensibility, and configurability etc

By having telemetry coverage in all these areas, we will be to see the health of everything that our software relies on or things that rely on our software.

Conclusion

With the limited time that I’ve spent in the software industry, I’ve come to realize the importance of metrics. Even the parts that don’t involve any software must be tracked and measured. The key idea here is that you can not improve what you don’t measure.

With right telemetry built into the software that we write, we’ll be able to not only solve the problems they arise but also surface the latent ones before they catch fire.

That’s all, folks!

 

Advertisements

Organization Archetypes And The Concept Of Market-Oriented “Solver Teams”

depositphotos_44074529-stock-illustration-flat-design-illustration-concept-of

Organizations which designs systems are constrained to produce designs which are copies of the communication structure of the organization.

In other words, how we organize our teams has a powerful effect on the software we produce, as well as our resulting architectural and production outcomes.

Thus, in order to get a fast flow of work from Development to Operations, with high quality, great customer outcomes and fast speed of delivery,  we must organize our teams to bring the team structure to our advantage.

Done poorly, this can prevent teams from working safely and independently, instead, they’ll be tightly coupled, all waiting on each other for work to be done.

At SQUAD, teams are structured as market-oriented teams, to quickly respond and solve customer needs. At SQUAD, we call them “Solver Teams”.

 

Organizational Archetypes

meaning-of-life

There are primarily three types of organization structures that inform how we design our DevOps value stream: functional, matrix and market.

Functional-oriented:

Organizations optimize for expertise, division and reducing cost. These organizations centralize expertise and have tall hierarchical structures.

Ex. Server admins, SREs, Data admins

Matrix-oriented:

Organizations attempt to combine functional and market orientation. This results in complicated organization structures like a single person reporting to multiple managers etc.

Market-oriented:

Organizations optimize for responding quickly to customer needs. These organizations tend to be flat, composed of multiple cross-functional disciplines (ex. marketing, engineering, machine learning).

Each market-oriented team is responsible for feature delivery, operational tracking and service support.

Market-Oriented “Solver Teams” at SQUAD

At SQUAD, we have a bunch of interesting problems to solve that highly impact and solve customer needs.

Broadly speaking, solver teams at SQUAD are market-oriented teams, composed of cross-disciplinary work like engineering, marketing, machine learning, data analysis etc to “solve” a customer problem.

These teams are responsible not only for feature development but also for user experiments, testing, optimizations, deployment and operational tracking of services, from idea conception to, successful launch, to retirement, all without dependencies on other solver teams.

Advantages of Market-Oriented Teams

  1. Small teams working independently and safely.
  2. Faster execution and delivery of work.
  3. Enables team members to be “E-Shaped” specialists.

 

Enable every team member to be a generalist

Screenshot from 2018-05-05 21-03-30

As we rely on ever increasing number of technologies. We want engineers who can contribute to multiple areas of value stream.

Another major advantage of the market-oriented teams is that, because of their innate nature of being cross-disciplinary and covering entire value stream from development to operations, it provides opportunities for the team members to develop and multi-specialist capabilities, also called as E-Shaped specialists.

When team members start becoming “E-Shaped” experts, business benefits of enabling faster flow are overwhelming.

As the same team member is able to contribute to multiple points in the value stream, the flow of the stream is much smoother and faster than a specialist working on a single point in the stream without having comprehensive knowledge of the entire value stream.

 

Conclusion

We saw how organization architecture dramatically improves our outcomes. Done well, organizational structure plays as an advantage and helps teams move and deliver faster.

At SQUAD, we structure teams as “solver teams”, which are responsible to own the entire value stream of the problem they are solving.

Solver teams are small and can move fast and safely without having dependencies on other solver teams.

That’s all, folks!

 

Devops and The Principle Of Flow

lean-software-development-1-728

In the technology value stream, work typically flows from Development to Operations, steps consisting of functional areas between our business and our customers.

As stated in the lean principles developed by Toyota, we should optimize to get a single-piece fast and smooth flow for our releases.

We increase flow by:

  1. Making work visible,
  2. Reducing batch sizes and intervals of work
  3. Building in the quality, preventing defects from being passed to downstream work centers.

Why a fast flow is needed?

By speeding up the flow through the technology value stream, we reduce the lead time required to fulfill internal and external customer requests, further increasing the quality of the work while making us more agile.

Our goal is to decrease the amount of time required to deploy the changes into production and increase the reliability of those services.

Make our work visible

agile-pm-kanban-board

A significant difference between manufacturing and technology value streams is that our work is invisible.

It’s so easy for work to keep bouncing off between teams and yet have no visual control over it.

To prevent this and make out work more visible, we can use something like a Kanban board. (I prefer Trello for this).

Ideally, our Kanban board will span the entire value stream, defining work as completed only when it reaches the right side of the board.

Work is not done when development completes, but only when our application is running successfully in production.

Limit Work In Progress (WIP)

In technology, our work is far more dynamic than manufacturing. Teams have to satisfy demands of multiple stakeholders. As a result daily work gets dominated by urgent requests for work coming through every communication channel possible.

We can limit multi-tasking by using Kanban board, such as by codifying and enforcing WIP limits for each column on the board.

For example, we may set a WIP limit of three cards of testing. When there are already three cards in the testing column, no new cards can be added.

Using Kanban ensures that work is visible and WIP doesn’t get piled up.

Reduce Batch Sizes

one-piece-flow

Another key component to creating smooth and fast-flow is performing work in small batch sizes. Prior to the lean manufacturing revolution, it was common practice to manufacture work in large batches.

However, large batch sizes result in skyrocketing levels of WIP. According to lean principles, the ideal is a single piece flow, where each batch size is of just one.

Let’s take an example:

Suppose we have ten brochures to mail and mailing each one of them requires 4 steps:

  1. fold the paper
  2. insert the paper into the envelope
  3. seal the envelope
  4. stamp the envelope

Now in the traditional batch processing flow, we will perform each step sequentially for all ten envelopes.

In the lean one-piece flow, only one envelope can be at any given step. In other words, we fold the paper, insert it into the envelope, seal the envelope and stamp it before starting the next one.

How is one-piece flow dramatically better?

In the above example, suppose each step takes 10 seconds. In batch processing, we get our first complete envelope after 310 seconds, but with the one-piece flow we get it just after 40 seconds.

Worst, what if we find that the way we have folded the paper, doesn’t allow the envelope to be sealed. In which case we’ll be in a bigger trouble?

Eliminating hardships and wastes in the value stream

According to the Toyota Production System pioneer Shiego Shingo, a waste is:

The use of any material or resource beyond what the customer required or is willing to pay for

In software development value stream, a waste is anything that causes a delay for the customer, such as activities that can be bypassed without affecting the result.

The following are some common categories of waste that we encounter when implementing lean in software value stream.

  1. Partially done work
  2. Extra processes
  3. Extra features
  4. Task switching
  5. Waiting on QA or testing or acceptance testing
  6. Defects and bugs
  7. Non-standard or manual work

Explaining each of the above point deserves a post of its own. Will do that soon.

Conclusion

Improving flow through the technology value stream is essential to achieving DevOps outcomes. We do this by making work visible, limiting WIP, reducing batch sizes and eliminating wastes from our processes.

All of this will allow us to become more agile and will help in reducing lead times dramatically, and at the same time increasing the quality of releases.

That’s all, folks!

 

Practical Problem Solving Framework: Inspired By The Toyota Way

toyota-7-step-pratical-problem-solving-process

We all will agree to a certain point that having a system/process for anything reduces chances of errors.

As an engineer or someone people look forward to propose solutions to problems it’s beneficial to have a framework in place to solve problems effectively.

Recently I was reading The Toyota Way, and it suggested a framework to Practical Problem Solving. It almost felt trivial that this sort of framework would be invaluable to software engineers too (in fact for everyone).

When confronted with a problem, first we want to make it crystal clear and get a grasp of the real point of cause. That’s followed by a series of 5 WHYs? to investigate the root cause. And finally countermeasures, evaluations, and countermeasures.

1. Initial Problem Perception

Large, vague and complicated problem is presented. The first step is to perceive all the information available at this point information of time.

Ex. “Hey! Metric X is showing incorrect value”

This doesn’t show the actual problem, but just a perception of how some internal user saw it.

2. Clarify The Problem

Next step is to clarify the problem to scope it down. Go and see the problem yourself. Analyse it and get a clear understanding.

As you are seeing the problem first hand, we want to gather as much information as possible.

Ex. So the entire analytics data was actually not consistent.

3. Locate Point Of Cause

Next step is to dig a little deeper and try finding the point fo cause.

Where is the problem observed? Where is the likely cause? This will lead us to the vicinity of the root cause, which we find in step 4.

Ex. Analytics system is working correctly, just that it sometimes doesn’t get updated every 5 minutes like it’s supposed to.

Here we rule out other possible causes, like a bug in the code or wrong data was tracked in the first place.

4. Ask, 5 WHYs? Investigation of  the root cause

Here from the direct cause, we expose and go deep to the root cause of the problem by asking WHY five times.

Ex.

1. 1st why – Why was data inconsistent: Because analytics didn’t get updated on time.

2. 2nd Why – Why analytics were not updated on time – Because scheduled ETL jobs didn’t run on time.

3. 3rd Why – Why the schedule jobs didn’t run on time – Because CPU usage was 100%

4. 4th Why – Why CPU reached 100% – Because server instance size was not enough to handle increased number jobs.

5. 5th Why – Why server size was not enough to handle the spike in usage –  Because our auto-scaling is slow.

By asking a series of 5 whys, we can generally get to the root cause of the problem and fix it there instead of just duct-taping it and be waiting for it to rise again.

5. Countermeasures

This step is fixing the root cause of the problem so that this doesn’t come up again.

Ex. Moved to a more sophisticated auto-scaler to manage spikes in usages and setting up alerts to monitor the performance.

6. Evaluate

After the countermeasure have been executed, it’s important to evaluate the effect post that. Was the problem solved?

Ex. “Now analytics are always in sync and even if they miss getting updated, we get an alert to know it beforehand and take action.”

7. Standardize

This resonates with another Toyota principle of jidokameaning building in the quality.

How can we standardize the countermeasures such that similar problems are not faced again? How can we propagate our learnings across the organization?

Ex. “Document and standardize the process that for all our instances and jobs proper alerts must be in place so that we know when they are malfunctioning”

Conclusion

This was my take on how we can learn from a cross-discipline organization like Toyota on how to have a process and framework in place to solve problems effectively.

Afterall, problem-solving is supposed to be fun and having a proper framework in place, helps us keep it that way!

That’s all, folks!

 

Estimation Peril: How To Estimate Software Projects Effectively(or How Not To Lie)

road-1668916_960_720

Consider, you are a rockstar engineer and you are given a task by your favorite person, your project manager, to show some new fields in the dashboard.

As usual, you are asked to estimate it as soon as possible. You think that well, seems like a quickie and you are tempted to estimate it a day. But you, being burnt before, decided to look at the fields that are to be added carefully. These fields are for analytics. You think, ok, let’s make it 2 days then. But being more cautious, you dig deeper and find that those analytics are not even being tracked on the app.

Now to complete the story, you’ll have to track the analytics, send them to the server, make the backend accept those and store them, show these on the dashboard, write tests etc….

What seemed a simple task is now a 1-2 week thing. Very hard to estimate. And your manager was expecting a response like, “would be done by end of day”.

What is the problem with estimates?

The main problem with an estimate is that the “estimate” gets translated into commitment. And when you miss a commitment, you breed distrust.

Most estimations are poor because we don’t know what they are for. They are uncertain. A problem that seemed simple to you on the whiteboard, turned out not to be so simple. There were non-functional requirements, codebase friction, some unfortunate bugs etc. We deal with uncertainty.

There is a rule in software engineering that everything takes 3X more time than you think it should, and this holds true even when you know this and take it into account!

Estimates can go the other way too, that is when you overestimate. This is as dangerous as underestimating.

What should an estimate look like?

An estimate should have 3 characteristics :

  1. Honest (Hardest)
  2. Accurate
  3. Precise

1. Honest : 

You have to be able to communicate bad news when the news is bad. And when the continuous outrage of your managers and stakeholders is on your face, you need to be able to continue and assert that the news is bad.

Honesty is important as you breed trust. You are not eliminating disappointment, rage and people getting mad, but you will eliminate distrust.

2. Accurate :

You are given a task and you estimate it to take somewhere between now to the end of the universe. That’s definitely accurate, it’ll be done within that time.

We won’t breed distrust, but we definitely will breed something else.

Which brings us to the 3rd characteristic.

3. Precise : 

An estimate should have just the right amount of precision.

What is the most honest estimation that you can make? I don’t know!

This is as honest as it can get. You really don’t know. But this estimation is neither accurate not precise.

But when we try to make precise estimates, we must note that we are assuming that everything goes right. We get the right breakfast, traffic doesn’t suck, your co-worker is having a good day, no meetings, no hidden requirements, no non-functional complexities etc.

Estimating by work break down

The most common way to estimate a complex task is to break it down into smaller tasks, into sub-tasks. And then those sub-tasks into sub-sub-tasks and so on until each task in hand is manageable and ideally not more than 4 hours of work.

Imagine this forming a tree, with executable tasks at the bottom as leaves. You just estimate the leaves and it all adds up.

This approach works, but there are 2 problems :

  1. We missed the integration cost
  2. We missed some tasks

There is a fundamental truth to work break down structure estimates:

The only way to estimate using work break down chart accurately, to know what are the exact sub-tasks, is to implement the feature!

What to expect from an estimate?

Estimates are uncertain. There is no guarantee that your estimate will work itself out. And that’s OK. It’s your manager’s job to manage that risk. We are not asking them to do something outside of their job.

The problem arises when you make a commitment. If you make a commitment, you must make it. Be ready to move heaven and earth to make it. But if you are not in a position to make a commitment, then don’t make one.

Because he’s going to set up a whole bunch of dominos based on that commitment, and if you fail to deliver, everything fails.

Some interesting links :

https://medium.com/swlh/your-app-is-an-onion-why-software-projects-spiral-out-of-control-bb9247d9bdbd

Uncle Bob on Estimates: https://www.youtube.com/watch?v=eisuQefYw_o

Happy Estimating!

That’s all, folks!

E-Summit ’17 IIT Bombay — Experience

E-Summit is the flagship entrepreneurship event organized by IITB. The two-day annual summit promises to be an amazing meeting ground for industry experts, business leaders, investors and entrepreneurs and of course, students, many of whom are aspiring entrepreneurs.

I attended this event in its 2017 edition and had mixed feelings on how the whole thing turned out holistically. There were some good parts and some not so good parts, but as a whole the event was worth attending.

There were many small talks spread on a 2 day course. Obviously, you can not attend all the talks, you have to select few of them according to the schedule and feasibility.

I personally realized that choose a topic that you are not familiar with as talks are pretty basic and don’t go to great depths.

Following are the talks and keynotes that I attended.

Day 1 :

  1. Keynote by Raj Jaswa :

First event of day 1 was keynote by Raj Jaswa. Most prominent thing he said in a nutshell was areas in which one should look for business opportunities.

Some being,

  1. Cloning and localisation
  2. Long tail business
  3. Adapt an existing business model to a new sector.

2. Digital Marketing :

This talk was presented by founder of E2M, a digital media company. I found this talk too basic aa I had already taken a course online on digital marketing.

Some topics discussed were,

  1. SEO
  2. PPC
  3. Social Media
  4. Emergence of mobile platforms

3. Brembo Company Presentation :

Brembo is a breaking technology company and a dominant force in the market. A manager from Italy presented the company’s operations in India.

He quoted a quote from the founder of Brembo that I found very captivating,

“Anyone can do simple things, but only few can handle difficult ones. We have to do difficult ones”.

4. Chat with Rahul Yadav :

Next session I attended was a Q&A session with Rahul Yadav, the founder of Housing.com.

It was nice to see him talking about his mistakes and telling people not to repeat them.

5. Wealth creating through financial planning :

This was conducted by Reliance Mutual Funds. In a nutshell it was all about SIP.

6. Keynote by Rajat Sharma :

The day ended by keynote by Rajat Sharma. He discussed his journey and his humility and wisdom was notable and inspiring.

7. Stand up comedy by Vipul Goyal and Sapan Verma

Nice performances by both of them always.

Day 2 :

Day 2 of the event was more power packed. I found the speakers and the talk topics, both to of higher level.

  1. Building a brand that indians love :

This was presented by an ISB professor. Basic point conveyed in the talk was that business customers have two currencies that they spend : time and money.

Thus, trigger point of all the businesses must be how customers are spending these two.

2. Protecting your brand : Trademarks, Copyrights and Patents :

I had no prior knowledge of patents and thus decided to attend this talk.

It nicely packed info on what, when and where to file the patent.

3. Startup Scaling : Overcoming key operational challenges :

Pressing issue of this talk was the resource visibility issues that startups face.

The speaker was from a company called OutThink LLC. They advocated that such challenges can be overcame by businesses collaborating and providing services to each other instead of doing things completely by themselves in isolation.

Here is where OutThink helps its customers by what they call at SRM : Strategic Resource Mapping.

4. Most Common Startup Budget Mistakes:

This talk was presented by a startup investor and mentor from Ireland.

The talk revolved around funding sources, funding advice and bootstrapping.

5. Final Keynote : By Bibop Gresta : COO Hyperlopp TT

The most exciting event of the summit was final keynote by COO of Hyperloop. He presented us with the overview of Hyperloop and how it is planning to carry its operations in India.

It was notable how fit and fun he was at the age of 40. Something that we can all learn from.

Conclusion :

To conclude, the summit was a thumbs up. It was not entirely the standard that I was expecting it to be, but still was Ok.

It was great if you have networking as the primary goal in your mind, not so good if you wanted hand on knowledge on topics.

Finally, it was nice to see other aspiring and existing entrepreneur facing the problems that you are also facing. Makes you feel that you are not alone and if that can pull it off, you can too.

The Blue Ocean Strategy : How To Create Uncontested Market Space and Make the Competition Irrelevant

When Henry Ford made cheap, reliable cars people said, ‘Nah, what’s wrong with a horse?’ That was a huge bet he made, and it worked.
The whole idea of The Blue Ocean Strategy is to create uncontested market spaces that creates new demands and make the competition irrelevant.

The book describes Red Oceans as known market places that have bloody competition among businesses trying to win customers. Here there is a fixed existing demand of which every company wants a share.

The Blue Ocean on the other hand is an uncontested market place that creates demand for itself, which is not known to others. This makes competition irrelevant. Focus is on creating, not competing.

Value Innovation :

Value innovation occurs when company align innovation with utility, price and cost positions. Instead of using competition as the benchmark companies focus on taking leaps ion value for customers.

Idea behind value innovation if to break out of Value-Cost trade off.

Reducing Costs :

Reduced costs for the products are achieved by eliminating and reducing the factors that the conventional industry competes on.

Best example to illustrate this is the case study of Ford Model T.

Ford eliminated all factors like multiple colors and design variants and focused only on creating better cars for the masses.

Identifying Blue Oceans :

Identifying blue oceans needs managers and strategists of the company to brain storm on the strategy canvas. Where each manager holds his/her department accountable.

The strategy canvas’ focus must be shifted from competition to alternatives and from customers to non-customers.

Reconstruct Market Boundaries :

The author proposed a 6 step framework for identifying blue oceans in new market places :

  1. Look across alternative industries
  2. Look across strategic groups within industries
  3. Look across complementaries
  4. Look across the chain of buyers
  5. Look across functional and emotional appear to buyers
  6. Look across time

Reaching Beyond Existing Demands

To reach the customers in new markets, think of non-customers before customer differentiations.

There are 3 tiers of non-customers :

  1. Jump Ship : These can switch to competitors on any moment.
  2. Refusing : These are using competitors products.
  3. Distant : Product doesn’t appeal to these customers.

Examples of Blue Ocean Strategies Implemented by Famous Companies :

  1. Ford :

Ford standardized the car and made the options limited. This increase the quality of the car and brought the price point down.

2. GM :

General Motors found their blue ocean in making the cars fun, fashionable and comfortable.

3. Watson :

Watson computers introduced tabulators for businesses for the first time. They also introduced leasing pricing models which made it easy for businesses to own a tabulator.

4. Apple :

Apple created Apple II and tapped the new market for ready-made, easy to use personal computers.

5. Dell :

Dell on the other hand, found its blue ocean by changing the purchasing and delivery experience of the buyer. It allowed customization of the machines according to the needs of the buyer.

It is evident from the above examples that blue oceans are not unleashed by technology innovation per se but by linking technology to elements valued by buyers.

Strategy for Blue Ocean Implementation :

Two views on industry structure are related to strategic actions.

  1. Structuralist View :

Based on market structure to conduct and performance. This view on strategy deals with making sure that the company is making money in the red oceans.

2. Reconstructionist View :

This view is based on endogenous growth. It focuses on creativity not systematic approaches.

This view is responsible to find blue oceans for the company.

Both the views towards strategy are necessary to assert the company is making money is also exploring new markets to remain competent in future too.