research-article

AutoTSG: learning and synthesis for incident troubleshooting

Authors:

Sai Pramod Upadhyayula,

Arjun Radhakrishna,

Anurag GuptaAuthors Info & Claims

ESEC/FSE 2022: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Pages 1477 - 1488

https://doi.org/10.1145/3540250.3558958

Published: 09 November 2022 Publication History

Abstract

Incident management is a key aspect of operating large-scale cloud services. To aid with faster and efficient resolution of incidents, engineering teams document frequent troubleshooting steps in the form of Troubleshooting Guides (TSGs), to be used by on-call engineers (OCEs). However, TSGs are siloed, unstructured, and often incomplete, requiring developers to manually understand and execute necessary steps. This results in a plethora of issues such as on-call fatigue, reduced productivity, and human errors. In this work, we conduct a large-scale empirical study of over 4K+ TSGs mapped to incidents and find that TSGs are widely used and help significantly reduce mitigation efforts. We then analyze feedback on TSGs provided by 400+ OCEs and propose a taxonomy of issues that highlights significant gaps in TSG quality. To alleviate these gaps, we investigate the automation of TSGs and propose AutoTSG -- a novel framework for automation of TSGs to executable workflows by combining machine learning and program synthesis. Our evaluation of AutoTSG on 50 TSGs shows the effectiveness in both identifying TSG statements (accuracy 0.89) and parsing them for execution (precision 0.94 and recall 0.91). Lastly, we survey ten Microsoft engineers and show the importance of TSG automation and the usefulness of AutoTSG.

References

[1]

Emad Aghajani, Csaba Nagy, Mario Linares-Vásquez, Laura Moreno, Gabriele Bavota, Michele Lanza, and David C Shepherd. 2020. Software documentation: the practitioners’ perspective. In 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE). 590–601.

Digital Library

[2]

Emad Aghajani, Csaba Nagy, Olga Lucero Vega-Márquez, Mario Linares-Vásquez, Laura Moreno, Gabriele Bavota, and Michele Lanza. 2019. Software documentation issues unveiled. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). 1199–1210.

Digital Library

[3]

Han Altae-Tran, Bharath Ramsundar, Aneesh S Pappu, and Vijay Pande. 2017. Low data drug discovery with one-shot learning. ACS central science, 3, 4 (2017), 283–293.

[4]

Naomi S Altman. 1992. An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, 46, 3 (1992), 175–185.

[5]

Jesper Andersen and Julia L Lawall. 2010. Generic patch inference. Automated software engineering, 17, 2 (2010), 119–148.

[6]

Chetan Bansal, Sundararajan Renganathan, Ashima Asudani, Olivier Midy, and Mathru Janakiraman. 2020. DeCaf: Diagnosing and Triaging Performance Issues in Large-Scale Cloud Services. In 2020 IEEE/ACM 42nd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP).

[7]

Y-Lan Boureau, Jean Ponce, and Yann LeCun. 2010. A Theoretical Analysis of Feature Pooling in Visual Recognition. In Proceedings of the 27th International Conference on International Conference on Machine Learning (ICML’10). Omnipress, Madison, WI, USA. 111–118. isbn:9781605589077

Digital Library

[8]

Leo Breiman. 2001. Random forests. Machine learning, 45, 1 (2001), 5–32.

[9]

J. Chen, X. He, Q. Lin, Y. Xu, H. Zhang, D. Hao, F. Gao, Z. Xu, Y. Dang, and D. Zhang. 2019. An Empirical Investigation of Incident Triage for Online Service Systems. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 111–120.

[10]

J. Chen, X. He, Q. Lin, H. Zhang, D. Hao, F. Gao, Z. Xu, Y. Dang, and D. Zhang. 2019. Continuous Incident Triage for Large-Scale Online Service Systems. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). 364–375.

[11]

Jie-Cherng Chen and Sun-Jen Huang. 2009. An empirical analysis of the impact of software development problem factors on software maintainability. Journal of Systems and Software, 82, 6 (2009), 981–992.

Digital Library

[12]

Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and psychological measurement, 20, 1 (1960), 37–46.

[13]

Sergio Cozzetti B de Souza, Nicolas Anquetil, and Káthia M de Oliveira. 2005. A study of the documentation essential to software maintenance. In Proceedings of the 23rd annual international conference on Design of communication: documenting & designing for pervasive information. 68–75.

Digital Library

[14]

Yan Duan, Marcin Andrychowicz, Bradly Stadie, OpenAI Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, and Wojciech Zaremba. 2017. One-shot imitation learning. Advances in neural information processing systems, 30 (2017).

Digital Library

[15]

Li Fei-Fei, Rob Fergus, and Pietro Perona. 2006. One-shot learning of object categories. IEEE transactions on pattern analysis and machine intelligence, 28, 4 (2006), 594–611.

Digital Library

[16]

Michael Fink. 2004. Object classification from a single example utilizing class relevance metrics. Advances in neural information processing systems, 17 (2004).

[17]

Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning. 1126–1135.

[18]

Golara Garousi, Vahid Garousi, Mahmoud Moussavi, Guenther Ruhe, and Brian Smith. 2013. Evaluating usage and quality of technical software documentation: an empirical study. In Proceedings of the 17th international conference on evaluation and assessment in software engineering. 24–35.

Digital Library

[19]

Sumit Gulwani. 2011. Automating string processing in spreadsheets using input-output examples. ACM Sigplan Notices, 46, 1 (2011), 317–330.

Digital Library

[20]

Sumit Gulwani, Oleksandr Polozov, and Rishabh Singh. 2017. Program synthesis. Foundations and Trends® in Programming Languages, 4, 1-2 (2017), 1–119.

[21]

Zellig S Harris. 1954. Distributional structure. Word, 10, 2-3 (1954), 146–162.

[22]

Jiajun Jiang, Weihai Lu, Junjie Chen, Qingwei Lin, Pu Zhao, Yu Kang, Hongyu Zhang, Yingfei Xiong, Feng Gao, and Zhangwei Xu. 2020. How to mitigate the incident? an effective troubleshooting guide recommendation technique for online service systems. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1410–1420.

Digital Library

[23]

Shinji Kikuchi. 2015. Prediction of workloads in incident management based on incident ticket updating history. In 2015 IEEE/ACM 8th International Conference on Utility and Cloud Computing (UCC). 333–340.

[24]

Nikita Kitaev, Steven Cao, and Dan Klein. 2019. Multilingual Constituency Parsing with Self-Attention and Pre-Training. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy. 3499–3505. https://doi.org/10.18653/v1/P19-1340

[25]

Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. 2015. Siamese neural networks for one-shot image recognition. In ICML deep learning workshop. 2, 0.

[26]

Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226.

[27]

Vu Le and Sumit Gulwani. 2014. Flashextract: A framework for data extraction by examples. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation. 542–553.

Digital Library

[28]

Alexander LeClair, Zachary Eberhart, and Collin McMillan. 2018. Adapting Neural Text Classification for Improved Software Categorization. In 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME). 461–472. https://doi.org/10.1109/ICSME.2018.00056

[29]

Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. 1989. Backpropagation Applied to Handwritten Zip Code Recognition. Neural Computation, 1, 4 (1989), 541–551. https://doi.org/10.1162/neco.1989.1.4.541

Digital Library

[30]

Olaf Leß enich, Sven Apel, and Christian Lengauer. 2015. Balancing precision and performance in structured merge. Automated Software Engineering, 22, 3 (2015), 367–397.

Digital Library

[31]

Liqun Li, Xu Zhang, Xin Zhao, Hongyu Zhang, Yu Kang, Pu Zhao, Bo Qiao, Shilin He, Pochian Lee, and Jeffrey Sun. 2021. Fighting the Fog of War: Automated Incident Detection for Cloud Systems. In 2021 USENIX Annual Technical Conference (USENIX ATC 21). 131–146.

[32]

Stuart Lloyd. 1982. Least squares quantization in PCM. IEEE transactions on information theory, 28, 2 (1982), 129–137.

Digital Library

[33]

Chen Luo, Jian-Guang Lou, Qingwei Lin, Qiang Fu, Rui Ding, Dongmei Zhang, and Zhe Wang. 2014. Correlating events with time series for incident diagnosis. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 1583–1592.

Digital Library

[34]

John MacFarlane. [n.d.]. Pandoc. https://pandoc.org/index.html

[35]

Na Meng, Miryung Kim, and Kathryn S McKinley. 2011. Systematic editing: generating program transformations from an example. ACM SIGPLAN Notices, 46, 6 (2011), 329–342.

Digital Library

[36]

Microsoft. [n.d.]. “Azure Data Factory”. https://azure.microsoft.com/en-in/services/data-factory/

[37]

Microsoft. [n.d.]. “Azure Monitor”. https://docs.microsoft.com/en-us/azure/azure-monitor/overview

[38]

Microsoft. [n.d.]. “Kusto Query Language (KQL)”. https://docs.microsoft.com/en-us/connectors/kusto/

[39]

Microsoft. [n.d.]. “Microsoft program synthesis using examples (prose) sdk.”. https://www.microsoft.com/en-us/research/group/prose/ Accessed: 2022-05-19.

[40]

Microsoft. [n.d.]. “Powershell”. https://docs.microsoft.com/en-us/powershell/

[41]

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

[42]

Vinod Nair, Ameya Raul, Shwetabh Khanduja, Vikas Bahirwani, Qihong Shao, Sundararajan Sellamanickam, Sathiya Keerthi, Steve Herbert, and Sudheer Dhulipalla. 2015. Learning a hierarchical monitoring system for detecting and diagnosing service issues. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2029–2038.

Digital Library

[43]

Rangeet Pan, Vu Le, Nachiappan Nagappan, Sumit Gulwani, Shuvendu Lahiri, and Mike Kaufman. 2021. Can program synthesis be used to learn merge conflict resolutions? an empirical analysis. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). 785–796.

Digital Library

[44]

Constituency Parsing. 2009. Speech and language processing.

[45]

Reinhold Plösch, Andreas Dautovic, and Matthias Saft. 2014. The value of software documentation quality. In 2014 14th International Conference on Quality Software. 333–342.

Digital Library

[46]

Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA, Valletta, Malta. 45–50. http://is.muni.cz/publication/884893/en

[47]

John Robinson. 2014. Likert Scale. Springer Netherlands, Dordrecht. 3620–3621. isbn:978-94-007-0753-5 https://doi.org/10.1007/978-94-007-0753-5_1654

[48]

Amrita Saha and Steven CH Hoi. 2022. Mining Root Cause Knowledge from Cloud Service Incident Investigations for AIOps. arXiv preprint arXiv:2204.11598.

[49]

Jürgen Schmidhuber. 1987. Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-... hook. Ph.D. Dissertation. Technische Universität München.

[50]

Manish Shetty, Chetan Bansal, Sumit Kumar, Nikitha Rao, and Nachiappan Nagappan. 2021. SoftNER: Mining Knowledge Graphs From Cloud Incidents. https://doi.org/10.48550/ARXIV.2101.05961

[51]

Manish Shetty, Chetan Bansal, Sumit Kumar, Nikitha Rao, Nachiappan Nagappan, and Thomas Zimmermann. 2021. Neural knowledge extraction from cloud service incidents. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 218–227.

Digital Library

[52]

Jake Snell, Kevin Swersky, and Richard Zemel. 2017. Prototypical networks for few-shot learning. Advances in neural information processing systems, 30 (2017).

[53]

Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. 2018. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1199–1208.

[54]

Yonglong Tian, Yue Wang, Dilip Krishnan, Joshua B Tenenbaum, and Phillip Isola. 2020. Rethinking few-shot image classification: a good embedding is all you need? In European Conference on Computer Vision. 266–282.

Digital Library

[55]

Secil Ugurel, Robert Krovetz, and C. Lee Giles. 2002. What’s the Code? Automatic Classification of Source Code Archives. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’02). Association for Computing Machinery, New York, NY, USA. 632–638. isbn:158113567X https://doi.org/10.1145/775047.775141

Digital Library

[56]

Amrisha Vaish, Tobias Grossmann, and Amanda L Woodward. 2008. Not all emotions are created equal: the negativity bias in social-emotional development. Psychological bulletin, 134 3 (2008), 383–403.

[57]

Joaquin Vanschoren. 2018. Meta-learning: A survey. arXiv preprint arXiv:1810.03548.

[58]

Oriol Vinyals, Charles Blundell, Timothy Lillicrap, and Daan Wierstra. 2016. Matching networks for one shot learning. Advances in neural information processing systems, 29 (2016).

[59]

Chi Zhang, Yujun Cai, Guosheng Lin, and Chunhua Shen. 2020. Deepemd: Few-shot image classification with differentiable earth mover’s distance and structured classifiers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 12203–12213.

Cited By

Zha JShan XLu JZhu JLiu Z(2024)Leveraging Large Language Models for Efficient Alert Aggregation in AIOPsElectronics10.3390/electronics1322442513:22(4425)Online publication date: 12-Nov-2024
https://doi.org/10.3390/electronics13224425
Shetty MChen YSomashekar GMa MSimmhan YZhang XMace JVandevoorde DLas-Casas PGupta SNath SBansal CRajmohan S(2024)Building AI Agents for Autonomous Clouds: Challenges and Design PrinciplesProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698525(99-110)Online publication date: 20-Nov-2024
https://dl.acm.org/doi/10.1145/3698038.3698525
Las-Casas PKumbhare AFonseca RAgarwal S(2024)LLexus: an AI agent system for incident managementACM SIGOPS Operating Systems Review10.1145/3689051.368905658:1(23-36)Online publication date: 14-Aug-2024
https://dl.acm.org/doi/10.1145/3689051.3689056

Index Terms

AutoTSG: learning and synthesis for incident troubleshooting

Index terms have been assigned to the content through auto-classification.

Recommendations

How to mitigate the incident? an effective troubleshooting guide recommendation technique for online service systems
ESEC/FSE 2020: Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

In recent years, more and more traditional shrink-wrapped software is provided as 7x24 online services. Incidents (events that lead to service disruptions or outages) could affect service availability and cause great financial loss. Therefore, ...
CCENT Troubleshooting Guide: 55 Practical Troubleshooting Exercises to Prepare You for the ICND1 100-105 Exam and the Field
Troubleshooting and Maintaining Cisco IP Networks (TSHOOT) Foundation Learning Guide: Foundation learning for the CCNP TSHOOT 642-832

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ESEC/FSE 2022: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

November 2022

1822 pages

ISBN:9781450394130

DOI:10.1145/3540250

General Chair:
Abhik Roychoudhury
National University of Singapore, Singapore
,
Program Chairs:
Cristian Cadar
Imperial College London, UK
,
Miryung Kim
University of California at Los Angeles, USA

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 November 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ESEC/FSE '22

Sponsor:

ESEC/FSE '22: 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

November 14 - 18, 2022

Singapore, Singapore

Acceptance Rates

Overall Acceptance Rate 112 of 543 submissions, 21%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
185
Total Downloads

Downloads (Last 12 months)81
Downloads (Last 6 weeks)16

Reflects downloads up to 03 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zha JShan XLu JZhu JLiu Z(2024)Leveraging Large Language Models for Efficient Alert Aggregation in AIOPsElectronics10.3390/electronics1322442513:22(4425)Online publication date: 12-Nov-2024
https://doi.org/10.3390/electronics13224425
Shetty MChen YSomashekar GMa MSimmhan YZhang XMace JVandevoorde DLas-Casas PGupta SNath SBansal CRajmohan S(2024)Building AI Agents for Autonomous Clouds: Challenges and Design PrinciplesProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698525(99-110)Online publication date: 20-Nov-2024
https://dl.acm.org/doi/10.1145/3698038.3698525
Las-Casas PKumbhare AFonseca RAgarwal S(2024)LLexus: an AI agent system for incident managementACM SIGOPS Operating Systems Review10.1145/3689051.368905658:1(23-36)Online publication date: 14-Aug-2024
https://dl.acm.org/doi/10.1145/3689051.3689056

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents