High-Availability Distributed Storage Systems

A Benchmarking Framework for High-Availability Distributed Storage Systems

Sponsored by NSF-CNS

Abstract:

The availability and robustness of the I/O system is crucial to large-scale applications that generate and analyze terabytes of data. Storage systems are vulnerable to numerous hardware failures (I/O and metadata server crashes) and contribute to as much as 25% of all system failures. Actually, highly available data storage for high end computing is becoming increasingly more critical as high-end computing systems scale up in size. To achieve high availability storage systems, a challenging issue is to characterize the availability metric in addition to performance of these systems.

This research investigates high-availability data and I/O services and benchmarking. The investigators take an organized approach to developing a benchmarking framework to measure the storage performance in consideration of availability under various faulty conditions. The research involves four tasks: 1) develop faults/errors model and design fault injection schemes for storage systems; 2) develop an innovative benchmarking framework for high availability distributed storage systems under different faulty conditions; 3) implement an Availability and Performance Evaluation Toolset to integrate the fault injection and stress testing libraries and capture raw performance of storage systems at block level under various faults; 4) validate the benchmarking framework using the tool for block-level storage systems.

This research has direct contributions to understanding highly available data and I/O services for HEC systems, establishing a general benchmarking framework for characterizing storage systems under faulty conditions, and thus benefiting the society by guiding develop high-availability oriented distributed storage systems which are crucial to many applications.

Personnel

- Investigators

Dr. Xubin He, Virginia Commonwealth University, PI
Dr. Stephen Scott, Oak Ridge National Lab, Co-PI

- Collaborators

Dr. Jizhong Han, Institute of Computing Technology, Chinese Academy of Sciences
Prof. Changsheng Xie, Huazhong University of Science and Technology, China

- Graduate Students

Chentao Wu, Virginia Commonwealth University (PhD student, Fall 2010- )
Xin Chen, Tennessee Tech University (PhD student, Spring 2007-August 2010)
Fang Han, Tennessee Tech University (MS student, Fall 2009-August 2010)
Jeremy Langston, Tennessee Tech University (MS student, Fall 2007- Spring 2009)

- Undergraduate Students

Vladislav Sorkin, Virginia Commonwealth University (Spring-Summer 2011)
David Lyons, Virginia Commonwealth University (Spring 2011)
Ben Eckart, Tennessee Tech University (2007-2008)
James Warren, Tennessee Tech University (2009-Spring 2010)

Recent Publications

G. Wu, X. He, and B. Eckart, "An Adaptive Write Buffer Management Scheme for Flash-based SSD," ACM Transactions on Storage,Vol. 8, No. 1, February, 2012.
Xin Chen, Xubin He, He Guo, and Yuxin Wang, “Design and Evaluation of an Online Anomaly Detector for Distributed Storage Systems”, Journal of Software, Vol. 6, No. 12, December 2011, pp. 2379-2390.
C. Wu, X. He, G. Wu, S. Wan, X. Liu, Q. Cao, and C. Xie, "HDP Code: A Horizontal-Diagonal Parity Code to Optimize I/O Load Balancing in RAID-6," Proceedings of the 41st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN2011), June 27-June 30, 2011, Hongkong, China (acceptance rate: 26/148=17.6%).
S. Wan, Q. Cao, J. Huang, S. Li, X. Li, S. Zhan, L. Yu, C. Xie, and X. He, "Victim Disk First: An Asymmetric Cache to Boost the Performance of Disk Arrays under Faulty Conditions", The USENIX Annual Technical Conference, Portland, OR, June 15-17, 2011 (acceptance rate: 27/180=15%)
Xin Chen, Xubin He, He Guo, and Yuxin Wang, “An Online Performance Anomaly Detector in Cluster File Systems”, the 3rd International Symposium on Parallel Architectures, Algorithms, and Programming (PAAP), December 18-20, 2010.
S. Wan, Q. Cao, C. Xie, B. Eckart, and X. He, "Code-M: A Non-MDS Erasure Code Scheme to Support Fast Recovery from up to Two-Disk Failures in Storage Systems," accepted to appear in the Proceedings of the 40th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN2010), June 28-July 1, 2010, Chicago, USA.
X. Chen, J. Warren, F. Han, and X. He, "Characterizing the Dependability of Distributed Storage Systems Using a Two-layer Hidden Markov Model-Based Approach," accepted to appear in the Proceedings of the 5th IEEE International Conference on Networking, Architecture, and Storage (NAS), July 15-17, Macau, China. Best Paper Award!
X. Chen, J. Langston, X. He, and F. Mao, “An Adaptive I/O Load Distribution Scheme for Distributed Systems,” The 9th International Workshop on Performance Modeling, Evaluation, and Optimization of Ubiquitous Computing and Networked Systems, to be held in conjunction with IPDPS 2010, April 19-23, 2010.
Ben Eckart, Xubin He, Qishi Wu, and Changsheng Xie, “A Dynamic Performance-Based Flow Control Method for High-Speed Data Transfer”, IEEE Transactions on Parallel and Distributed Systems, January 2010.
Xubin He, Li Ou, Christian Engelmann, Xin Chen, and Stephen Scott, “Symmetric Active/Active Metadata Service for High Availability Parallel File Systems,” Journal of Parallel and Distributed Computing (JPDC), vol. 69, no. 12, December 2009. Preprint: doi:10.1016/j.jpdc.2009.08.004 .
Li Ou, Xubin He, and Jizhong Han, “An Effective Design for Fast Memory Registration in RDMA”, Journal of Network and Computer Applications, Vol. 32, no. 3, 2009.
James Warren, Xin Chen, and Xubin He, "Analysis and Investigation of Tools to Effectively Gather Data for Benchmarking Distributed File System Availability," the 20th Annual Argonne Symposium for Undergrad Research, Chicago, IL, November 13, 2009.
Xin Chen, Jeremy Langston, Xubin He, and Stephen Scott, “Design and Evaluation of a User-Oriented Availability Benchmark for Distributed File Systems,”the 21st IASTED International Conf. on Parallel and Distributed Computing and Systems, Cambridge, MA, Nov. 2-4, 2009.
X. Liu, J. Han, C. Han, Y. Zhong, and X. He, "Implementing WebGIS on Hadoop: A Case Study of Improving Small File I/O Performance on HDFS," IEEE Cluster, New Orleans, LA, August 31-September 3, 2009.
Jeremy Langston, Guanying Wu, and Xubin He, “Evaluation of the Impacts of Data Hot-Spots in Disk Arrays on Performance and Availability,” 41st IEEE Southeastern Symposium on System Theory, Tullahoma, TN, March 15-17, 2009.
Ben Eckart, Xin Chen, Xubin He, and Stephen Scott, “Failure Prediction Models for Proactive Fault Tolerance Within Storage Environment”, Proceedings of the 16th Annual Meeting of the IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS’2008), September 8-10, Baltimore, MD
Xin Chen and Xubin He, "Tolerating Temporal Correlated Failures from Cyclic Dependency in High Performance Computing Systems," Proc. of the 14th IEEE International Conference on Parallel and Distributed Systems (ICPADS'2008), Australia, Dec. 8-10, 2008
X. Chen, B. Eckart, X. He, C. Engelmann, and S. L. Scott. An online controller towards self-adaptive file system availability and performance. In Proceedings of the 5th High Availability and Performance Workshop (HAPCW) 2008, in conjunction with the 1st High-Performance Computer Science Week (HPCSW) 2008, Denver, CO, USA, April 3-4, 2008.
Ben Eckart, Xin Chen, and Xubin He, “A Failure Prediction Model for Disk Arrays”, High Performance Computer Science Symposium, Denver, April 2-4, 2008.

Thesis/Dissertation

[PhD] Xin Chen, "Dependability Modeling and Benchmarking for Distributed Storage Systems", Date Graduated: August 2010. First employment after graduation: Dell Inc., Austin, TX.

[MS] Jeremy Langston, "Availability and Performance Analysis of Data Hot-spots in Distributed Storage Systems", Date Graduated: May 2009. First employment after graduation: Redstone Technical Test Center, Huntsville, AL.

Sponsor

National Science Foundation (NSF)