Burst buffer

In the high-performance computing environment, burst buffer is a fast intermediate storage layer positioned between the front-end computing processes and the back-end storage systems. It bridges the performance gap between the processing speed of the compute nodes and the Input/output (I/O) bandwidth of the storage systems. Burst buffers are often built from arrays of high-performance storage devices, such as NVRAM and SSD. It typically offers from one to two orders of magnitude higher I/O bandwidth than the back-end storage systems.

Use cases

Burst buffers accelerate scientific data movement on supercomputers. For example, scientific applications' life cycles typically alternate between computation phases and I/O phases.[1] Namely, after each round of computation (i.e., computation phase), all the computing processes concurrently write their intermediate data to the back-end storage systems (i.e., I/O phase), followed by another round of computation and data movement operations. With the deployment of burst buffers, processes can quickly write their data to a burst buffer after one round of computation, instead of writing to the slow hard disk based storage system, and immediately proceed to the next round of computation without waiting for the data to be moved to the back-end storage system;[2][3] the data are then asynchronously flushed from the burst buffer to the storage system during the next round of computation. In this way, the long I/O time spent in moving data to the storage system is hidden behind the computation time. In addition, buffering data in a burst buffer gives applications plenty of opportunities to reshape the data traffic to the back-end storage systems for efficient bandwidth utilization of the storage systems.[4][5] In another common use case, scientific applications can stage their intermediate data in and out of burst buffer without interacting with the slower storage systems. Bypassing the storage systems allows applications to realize most of the performance benefit from burst buffer.[6]

Representative burst buffer architectures

There are two representative burst buffer architectures in the high-performance computing environment: node-local burst buffer and remote shared burst buffer. In the node-local burst buffer architecture, burst buffer storage is located on the individual compute node, so the aggregate burst buffer bandwidth grows linearly with the compute node count. This scalability benefit has been well-documented in recent literature.[7][8][9][10] It also comes with the demand for a scalable metadata management strategy to maintain a global namespace for data distributed across all the burst buffers.[11][12] In the remote shared burst buffer architecture, burst buffer storage resides on a fewer number of I/O nodes positioned between the compute nodes and the back-end storage systems. Data movement between the compute nodes and burst buffer needs to go through the network. Placing burst buffer on the I/O nodes facilitates the independent development, deployment and maintenance of the burst buffer service. Hence, several well-known commercialized software products have been developed to manage this type of burst buffer, such as DataWarp and Infinite Memory Engine. As supercomputers are deployed with multiple heterogeneous burst buffer layers, such as NVRAM on the compute nodes, and SSDs on the dedicated I/O nodes, there is a need to transparently move data across multiple storage layers.[13][14][15]

Supercomputers deployed with burst buffer

Due to its importance, burst buffer has been widely deployed on the leadership-scale supercomputers. For example, node-local burst buffer has been installed on DASH supercomputer at the San Diego Supercomputer Center,[16] Tsubame supercomputers at Tokyo Institute of Technology, Theta and Aurora supercomputers at the Argonne National Laboratory, Summit supercomputer at the Oak Ridge National Laboratory, and Sierra supercomputer at the Lawrence Livermore National Laboratory, etc. Remote shared burst buffer has been adopted by Tianhe-2 supercomputer at the National Supercomputer Center in Guangzhou, Trinity supercomputer at the Los Alamos National Laboratory, Cori supercomputer at the Lawrence Berkeley National Laboratory and ARCHER2 supercomputer at Edinburgh Parallel Computing Centre.

References

  1. ^ Liu, Zhuo; Lofstead, Jay; Wang, Teng; Yu, Weikuan (September 2013). "A Case of System-Wide Power Management for Scientific Applications". 2013 IEEE International Conference on Cluster Computing (CLUSTER). IEEE. pp. 1–8. doi:10.1109/CLUSTER.2013.6702681. ISBN 978-1-4799-0898-1. S2CID 6156410.
  2. ^ Wang, Teng; Oral, Sarp; Wang, Yandong; Settlemyer, Brad; Atchley, Scott; Yu, Weikuan (October 2014). "BurstMem: A High-Performance Burst Buffer System for Scientific Applications". 2014 IEEE International Conference on Big Data (Big Data). IEEE. pp. 71–79. doi:10.1109/BigData.2014.7004215. ISBN 978-1-4799-5666-1. OSTI 1150929. S2CID 16764901.
  3. ^ Liu, Ning; Cope, Jason; Carns, Philip; Carothers, Christopher; Ross, Robert; Grider, Gary; Crume, Adam; Maltzahn, Carlos (April 2012). "On the Role of Burst Buffers in Leadership-Class Storage systems". 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST). IEEE. pp. 1–11. doi:10.1109/MSST.2012.6232369. ISBN 978-1-4673-1747-4. S2CID 9676920.
  4. ^ Wang, Teng; Oral, Sarp; Pritchard, Michael; Wang, Bin; Yu, Weikuan (September 2015). "TRIO: Burst Buffer Based I/O Orchestration". 2015 IEEE International Conference on Cluster Computing. IEEE. pp. 194–203. doi:10.1109/CLUSTER.2015.38. ISBN 978-1-4673-6598-7. OSTI 1265517. S2CID 12482308.
  5. ^ Kougkas, Anthony; Dorier, Matthieu; Latham, Rob; Ross, Rob; Sun, Xian-He (March 2017). "Leveraging Burst Buffer Coordination to Prevent I/O Interference". 2016 IEEE 12th International Conference on e-Science (E-Science). IEEE. pp. 371–380. doi:10.1109/eScience.2016.7870922. ISBN 978-1-5090-4273-9. OSTI 1366308. S2CID 14514395.
  6. ^ Wang, Teng; Mohror, Kathryn; Moody, Adam; Sato, Kento; Yu, Weikuan (November 2016). "An Ephemeral Burst-Buffer File System for Scientific Applications". SC16: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE. pp. 807–818. doi:10.1109/SC.2016.68. ISBN 978-1-4673-8815-3. S2CID 260667.
  7. ^ "BurstFS: A Distributed Burst Buffer File System for Scientific Applications" (PDF). November 2015.
  8. ^ Moody, Adam; Bronevetsky, Greg; Mohror, Kathryn; Supinski, Bronis R. de (November 2010). "Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System". 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. ACM. pp. 1–11. doi:10.1109/SC.2010.18. ISBN 978-1-4244-7557-5. S2CID 7352923.
  9. ^ Rajachandrasekar, Raghunath; Moody, Adam; Mohror, Kathryn; Panda, Dhabaleswar K. (DK) (June 2013). "A 1 PB/s File System to Checkpoint Three Million MPI Tasks" (PDF). Proceedings of the 22nd international symposium on High-performance parallel and distributed computing - HPDC '13. ACM. p. 143. doi:10.1145/2493123.2462908. ISBN 9781450319102.
  10. ^ Zhao, Dongfang; Zhang, Zhao; Zhou, Xiaobing; Li, Tonglin; Wang, Ke; Kimpe, Dries; Carns, Philip; Ross, Robert; Raicu, Ioan (October 2014). "FusionFS: Toward supporting data-intensive scientific applications on extreme-scale high-performance computing systems". 2014 IEEE International Conference on Big Data (Big Data). IEEE. pp. 61–70. doi:10.1109/BigData.2014.7004214. ISBN 978-1-4799-5666-1. S2CID 5288472.
  11. ^ Wang, Teng; Moody, Adam; Zhu, Yue; Mohror, Kathryn; Sato, Kento; Islam, Tanzima; Yu, Weikuan (May 2017). "MetaKV: A Key-Value Store for Metadata Management of Distributed Burst Buffers". 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE. pp. 1174–1183. doi:10.1109/IPDPS.2017.39. ISBN 978-1-5386-3914-6. S2CID 8148699.
  12. ^ Li, Tonglin; Zhou, Xiaobing; Brandstatter, Kevin; Zhao, Dongfang; Wang, Ke; Rajendran, Anupam; Zhang, Zhao; Raicu, Ioan (May 2013). "ZHT: A Light-Weight Reliable Persistent Dynamic Scalable Zero-Hop Distributed Hash Table". 2013 IEEE 27th International Symposium on Parallel and Distributed Processing. IEEE. pp. 775–787. CiteSeerX 10.1.1.365.7329. doi:10.1109/IPDPS.2013.110. ISBN 978-1-4673-6066-1. S2CID 16614868.
  13. ^ Wang, Teng; Byna, Suren; Dong, Bin; Tang, Houjun (Sep 2018). "UniviStor: Integrated Hierarchical and Distributed Storage for HPC". 2018 IEEE International Conference on Cluster Computing (CLUSTER). IEEE. pp. 134–144. doi:10.1109/CLUSTER.2018.00025. ISBN 978-1-5386-8319-4. S2CID 53235423.
  14. ^ "Hermes: a heterogeneous-aware multi-tiered distributed I/O buffering system". ACM. June 2018. doi:10.1145/3208040.3208059. S2CID 47019714. {{cite journal}}: Cite journal requires |journal= (help)
  15. ^ Tang, Houjun; Byna, Suren; Tessier, Francois; Wang, Teng; Dong, Bin; Mu, Jingqing; Koziol, Quincey; Soumagne, Jerome; Vishwanath, Venkatram; Liu, Jialin; Warren, Richard (May 2018). "Toward Scalable and Asynchronous Object-centric Data Management for HPC". 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). IEEE. pp. 113–122. doi:10.1109/CCGRID.2018.00026. ISBN 978-1-5386-5815-4. S2CID 13811397.
  16. ^ He, Jiahua; Jagatheesan, Arun; Gupta, Sandeep; Bennett, Jeffrey; Snavely, Allan (November 2010). "DASH: a Recipe for a Flash-based Data Intensive Supercomputer" (PDF). 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. ACM. pp. 1–11. doi:10.1109/SC.2010.16. ISBN 978-1-4244-7557-5. S2CID 7349294.