skip to main content
research-article
Open access
Just Accepted

MasterPlan: A Reinforcement Learning Based Scheduler for Archive Storage

Online AM: 17 December 2024 Publication History

Abstract

With the sheer volume of data in today’s world, archive storage systems play a significant role in persisting the cold data. Due to stringent cost concerns, one popular design is to organize disks into groups and periodically switch them to be powered on for serving user requests. Scheduling thus becomes critical for both CapEx and performance. Unfortunately, field results indicate that existing schedulers can be often suboptimal. Our further analysis suggests that the main reason is the mismatch between the ever-changing workloads and the fixed set of coarsely-configured parameters in current heuristic-based schedulers.
In this paper, we propose MasterPlan, a reinforcement learning (RL) based scheduler for archive storage systems. By identifying the unique characteristics of archive storage service, we design a state space and reward function for the RL agent. MasterPlan includes a continuous action encoding approach to guarantee efficient exploration, and a meta adaptation module to extract features of workload series. Experiments show that MasterPlan can achieve 1.25 × throughput, 2.16 × 99th latency and 1.47 × power draw improvement compared to existing solutions.

References

[1]
2024. Alibaba Cloud OSS Archive. https://www.alibabacloud.com/solutions/backup_archive.
[2]
2024. Amazon S3 Glacier Storage Classes. https://aws.amazon.com/cn/s3/storage-classes/glacier/.
[3]
2024. Apache HBase. https://hbase.apache.org/.
[4]
2024. Azure Archive Storage. https://azure.microsoft.com/zh-cn/solutions/backup-archive/.
[5]
2024. Google Cloud Archive Storage. https://cloud.google.com/storage/docs/storage-classes.
[6]
2024. IBM Tape Libraries and Tape Automation. https://www.ibm.com/it-infrastructure/tape-library.
[7]
2024. Kubernetes Scheduler. https://kubernetes.io/docs/concepts/scheduling-eviction/kube-scheduler/.
[8]
2024. Seagate Cold Data Storage. https://www.seagate.com/sg/en/blog/what-is-cold-data-storage/.
[9]
Patrick Anderson, Erika Aranas, Youssef Assaf, Raphael Behrendt, Richard Black, Marco Caballero, Pashmina Cameron, Burcu Canakci, Thales de Carvalho, Chatzieleftheriou, et al. 2023. Project Silica: Towards Sustainable Cloud archival storage in Glass. In Proceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP).
[10]
Jose A Arjona-Medina, Michael Gillhofer, Michael Widrich, Thomas Unterthiner, Johannes Brandstetter, and Sepp Hochreiter. 2019. Rudder: Return Decomposition for Delayed Rewards. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS).
[11]
Krish Bandaru and Kestutis Patiejunas. 2015. Under the hood: Facebook’s cold storage system. https://engineering.fb.com/2015/05/04/core-data/under-the-hood-facebook-s-cold-storage-system/.
[12]
Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum Learning. In Proceedings of the 26th annual International Conference on Machine Learning (ICML).
[13]
Zahy Bnaya and Ariel Felner. 2014. Conflict-Oriented Windowed Hierarchical Cooperative A*. In Proceedings of the 2014 International Conference on Robotics and Automation (ICRA).
[14]
James Bornholt, Randolph Lopez, Douglas M. Carmean, Luis Ceze, Georg Seelig, and Karin Strauss. 2016. A DNA-Based archival storage System. In Proceedings of the 21st USENIX Symposium on Operating Systems Design and Implementation (OSDI).
[15]
Renata Borovica-Gajić, Raja Appuswamy, and Anastasia Ailamaki. 2016. Cheap Data Analytics using Cold Storage Devices. In Proceedings of the 42th Very Large Data Base Endowment (VLDB).
[16]
Brad Calder, Ju Wang, Aaron Ogus, Niranjan Nilakantan, Arild Skjolsvold, Sam McKelvie, Yikang Xu, Shashwat Srivastav, Jiesheng Wu, Huseyin Simitci, et al. 2011. Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency. In Proceedings of the 23th ACM Symposium on Operating Systems Principles (SOSP).
[17]
Andromachi Chatzieleftheriou, Ioan Stefanovici, Dushyanth Narayanan, Benn Thomsen, and Antony Rowstron. 2020. Could cloud storage be disrupted in the next decade?. In Proceedings of the 12th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage).
[18]
Pim De Haan, Dinesh Jayaraman, and Sergey Levine. 2019. Causal Confusion in Imitation Learning. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS).
[19]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805(2018).
[20]
Zhuoxuan Du, Jiaqi Zheng, Hebin Yu, Lingtao Kong, and Guihai Chen. 2021. A Unified Congestion Control Framework for Diverse Application Preferences and Network Conditions. In Proceedings of the 17th International Conference on Emerging Networking EXperiments and Technologies (CoNEXT).
[21]
Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic Meta-learning for Fast Adaptation of Deep Networks. In Proceedings of the 34th International Conference on Machine Learning (ICML).
[22]
Chelsea Finn, Aravind Rajeswaran, Sham Kakade, and Sergey Levine. 2019. Online Meta-learning. In Proceedings of the 36th International Conference on Machine Learning (ICML).
[23]
Marek Grzeundefined. 2017. Reward Shaping in Episodic Reinforcement Learning. In Proceedings of the 16th Conference on Autonomous Agents and Multi-Agent Systems (AAMAS).
[24]
Ajay Gulati, Arif Merchant, and Peter J Varman. 2010. mClock: Handling Throughput Variability for Hypervisor IO Scheduling. In Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI).
[25]
Ori Hadary, Luke Marshall, Ishai Menache, Abhisek Pan, Esaias E Greeff, David Dion, Star Dorminey, Shailesh Joshi, Yang Chen, Mark Russinovich, and Thomas Moscibroda. 2020. Protean: VM Allocation Service at Scale. In Proceedings of 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI).
[26]
Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. 2018. Deep Reinforcement Learning that Matters. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI).
[27]
Jonathan Ho and Stefano Ermon. 2016. Generative Adversarial Imitation Learning. In Proceedings of the 30th Conference on Neural Information Processing Systems (NeurIPS).
[28]
InfluxData. 2024. InfluxDB. https://https://www.influxdata.com/.
[29]
Syed M Iqbal, Haley Li, Shane Bergsma, Ivan Beschastnikh, and Alan J Hu. 2022. CoSpot: A Cooperative VM Allocation Framework for Increased Revenue from Spot Instances. In Proceedings of the 13th Symposium on Cloud Computing (SoCC).
[30]
Nathan Jay, Noga Rotman, Brighten Godfrey, Michael Schapira, and Aviv Tamar. 2019. A Deep Reinforcement Learning Perspective on Internet Congestion Control. In Proceedings of the 36th International Conference on Machine Learning (ICML).
[31]
Vijay Konda and John Tsitsiklis. 1999. Actor-critic Algorithms. In Proceedings of the 13rd Conference on Neural Information Processing Systems (NeurIPS).
[32]
Sergey Legtchenko, Xiaozhou Li, Antony Rowstron, Austin Donnelly, and Richard Black. 2016. Flamingo: Enabling Evolvable HDD-based Near-Line Storage. In Proceedings ofd the 14th USENIX Conference on File and Storage Technologies (FAST).
[33]
Huaicheng Li, Mingzhe Hao, Michael Hao Tong, Swaminathan Sundararaman, Matias Bjørling, and Haryadi S Gunawi. 2018. The CASE of FEMU: Cheap, Accurate, Scalable and Extensible Flash Emulator. In Proceedings of the 16th USENIX Conference on File and Storage Technologies (FAST).
[34]
Xu Li, Feilong Tang, Jiacheng Liu, Laurence T. Yang, Luoyi Fu, and Long Chen. 2021. AUTO: Adaptive Congestion Control Based on Multi-Objective Reinforcement Learning for the Satellite-Ground Integrated Network. In Proceedings of the 2021 USENIX Annual Technical Conference (USENIX ATC).
[35]
Yiqing Ma, Han Tian, Xudong Liao, Junxue Zhang, Weiyan Wang, Kai Chen, and Xin Jin. 2022. Multi-Objective Congestion Control. In Proceedings of the 17th European Conference on Computer Systems (EuroSys).
[36]
Peter Macko, Xiongzi Ge, John Haskins Jr., James Kelley, David Slik, Keith A. Smith, and Smith Maxim G. 2017. SMORE: A Cold Data Object Store for SMR Drives. In Proceedings of the 34th International Conference on Mass Storage Systems and Technologies (MSST).
[37]
Hongzi Mao, Malte Schwarzkopf, Shaileshh Bojja Venkatakrishnan, Zili Meng, and Mohammad Alizadeh. 2019. Learning Scheduling Algorithms for Data Processing Clusters. In Proceedings of the ACM International Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM).
[38]
Markus Mäsker, Lars Nagel, Tim Süß, André Brinkmann, and Lennart Sorth. 2016. Simulation and Performance Analysis of the ECMWF Tape Library System. In Proceedings of the 16th International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[39]
Eduardo F. Morales and Claude Sammut. 2004. Learning to Fly by Combining Reinforcement Learning with Behavioural Cloning. In Proceedings of the 21st International Conference on Machine Learning (ICML).
[40]
Yudha P Pane, Subramanya P Nageshrao, and Robert Babuška. 2016. Actor-Critic Reinforcement Learning for Tracking Control in Robotics. In IEEE 55th conference on decision and control (CDC).
[41]
Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, Chen Meng, and Wei Lin. 2021. DL2: A Deep Learning-Driven Scheduler for Deep Learning Clusters. IEEE Transactions on Parallel and Distributed Systems (TPDS) 32, 8(2021).
[42]
Haoran Qiu, Subho S. Banerjee, Saurabh Jha, Zbigniew T. Kalbarczyk, and Ravishankar K. Iyer. 2020. FIRM: An Intelligent Fine-grained Resource Management Framework for SLO-Oriented Microservices. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI).
[43]
Haoran Qiu, Weichao Mao, Chen Wang, Hubertus Franke, Alaa Youssef, Zbigniew T Kalbarczyk, Tamer Basar, and Ravishankar K Iyer. 2023. AWARE: Automate Workload Autoscaling with Reinforcement Learning in Production Cloud Systems. In Proceedings of the 2023 USENIX Annual Technical Conference (USENIX ATC).
[44]
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding with unsupervised learning. Technical report, OpenAI(2018).
[45]
Simon Schmitt, Matteo Hessel, and Karen Simonyan. 2020. Off-policy Actor-critic with Shared Experience Replay. In Proceedings of the 37th International Conference on Machine Learning (ICML).
[46]
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal Policy Optimization Algorithms. arXiv preprint arXiv:1707.06347(2017).
[47]
Balakrishnan Shobana, Black Richard, Donnelly Austin, Paul England, Glass Adam, Harper Dave, Legtchenko Sergey, Ogus Aaron, Peterson Eric, and Rowstron Antony. 2014. Pelican: A building block for exascale cold data storage. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI).
[48]
Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The Hadoop Distributed File System. In Proceedings of the 26th Symposium on Mass Storage Systems and Technologies (MSST).
[49]
David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. 2016. Mastering the game of Go with deep neural networks and tree search. Nature 529, 7587 (2016).
[50]
David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. 2018. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science 362, 6419 (2018).
[51]
Ya Su, Youjian Zhao, Chenhao Niu, Rong Liu, Wei Sun, and Dan Pei. 2019. Robust Anomaly Detection for Multivariate Time Series through Stochastic Recurrent Neural Network. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD).
[52]
Christine Taylor. 2021. What Is Cold Data Storage? Storing Cold Data in the Cloud. https://www.enterprisestorageforum.com/management/cold-cloud-data-storage/.
[53]
Chen Tessler, Yuval Shpigelman, Gal Dalal, Amit Mandelbaum, Doron Haritan Kazakov, Benjamin Fuhrer, Gal Chechik, and Shie Mannor. 2022. Reinforcement Learning for Datacenter Congestion Control. In Proceedings of the 22nd AAAI Conference on Artificial Intelligence (AAAI).
[54]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NeurIPS).
[55]
Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. 2015. Large-scale Cluster Management at Google with Borg. In Proceedings of the 10th European Conference on Computer Systems (EuroSys).
[56]
Zhengxu Xia, Yajie Zhou, Francis Y Yan, and Junchen Jiang. 2022. Genet: Automatic Curriculum Generation for Learning Adaptation in Networking. In Proceedings of the ACM International Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM).
[57]
Di Zhang, Dong Dai, and Bing Xie. 2022. SchedInspector: A Batch Job Scheduling Inspector Using Reinforcement Learning. In Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing (HPDC).

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization Just Accepted
EISSN:1544-3973
Table of Contents
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Online AM: 17 December 2024

Check for updates

Author Tags

  1. Archive storage system
  2. reinforcement learning

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 63
    Total Downloads
  • Downloads (Last 12 months)63
  • Downloads (Last 6 weeks)63
Reflects downloads up to 03 Jan 2025

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media