Web Application Failure

First xMatters integrates with 100s of IT systems and monitoring tools to be able to ingest signals to automate incident response. In this example, an Event is triggered in NewRelic, which send relevant New Relic data to xMatters. There seems to be an issue with the Inventory Order Service.

Before notifying resolvers, xMatters attempts to trigger a “Restart Service” runbook in Ansible Tower. This is in effort to avoid manual human intervention that can cause TOIL, and to prevent this event from escalating to an incident. In this case the restart failed, and xMatters updated the JIRA issue and slack channel to keep everyone informed, regardless of what tool they are using.

Because the auto-remediation attempt has failed, xMatters automates the engagement process by identifying the team that owns the impacted service, and escalating based on self-service on-call calendars. Once resources are identified they are notified on multiple modalities according to their personal preferences.

The notification that is received contains data from New Relic, but xMatters also provided Signal Enrichment automation by adding potentially relevant data from other tools in the organization.

In this example, in addition to the New Relic Alert details, xMatters provides additional insights to automate a lot of otherwise manual Triage work. We see that there are ~7500 messages stuck in infrastructure queues, we see new code related to this service was recently committed and deployed into production. Lastly, we are giving technical teams insight that xMatters already attempted to restart the service, but failed due to error.

This automation can be rolled out quickly with no code thanks to the xMatters Flow Designer. Flow designer brings existing tools together, extending their value by automating actions across team, processes, and tools.

The alerts are actionable, not just contextual.

In response to the Enriched Alert, an SRE on the development team understand that the impact may be widespread, and a Major Incident may need to be declared.

This automatically updates all relevant tools that there is a Major incident linked to this issue, such as JIRA, Slack, and ServiceNow. Day to day the SRE does not work to much in the Corporate Standard ITSM platform, so the response automatically triggers a Major Incident in ServiceNow following the response playbook defined by the Operations Teams.

As important as it is to have all information everywhere, it is important for an incident commander to have all the information in one place for frictionless incident management.

Incident commanders can seamlessly manage the status of technical engagements. Commanders can clearly see which teams are being engaged and which Service(s) they support. They could also trigger quick stakeholder communications to subscribers or key stakeholders to be proactive.

Incident commander could see and interact with all related collaboration channels related to this incident, as certain teams prefer collaboration on their chosen platforms.

As the incident progresses, xMatters captures a consolidated incident timeline that captures activity across various teams and tools all in one place. We can see activity in ServiceNow, JIRA, MS Teams, Slack, notifications, responses, and escalations all in a timeline.

The incident management console also displays all impacted services, and features Service Intelligence to deliver clear understanding of the services that are impacted by the incident, and the related services they depend on in complex environments.

Service dependencies could be viewed right from the incident console to visualize the potential root cause, and run service-centric automations to minimize the impact of the major incident and the manual effort required by incident response teams.

In addition to viewing the related services, an incident commander could effortlessly see the service owner, and engage in the incident response process.

When a major IT incident poses a business continuity threat, a business continuity response plan could be seamlessly triggered as an automation. In this case, our inventory order service being down is impacting the organization’s ability order essential inventory.

In this case a supply chain business continuity response plan is triggered to engage cross-functional business continuity teams that are outside of IT.

Everbridge Crisis Management plans automate business continuity response SOPs, assign out tasks to cross-functional teams, and offer real time status dashboards of task status and risks.

Once the IT issue is identified and addressed, xMatters offers a detailed Post Incident report that gather key metrics such as Mean time to detect, respond, resolve to offer data driven insights for critical thinking and continual process improvement. These reports also include a detailed incident timeline, after actions, and is exportable to save time and manual effort gathering and curating incident data from multiple sources.