BENCHMARKING OF CLOUD STORAGE SYSTEM（云存储系统测试）(硕士论文)

最新推荐文章于 2022-05-05 14:44:52 发布

flydragonfly

最新推荐文章于 2022-05-05 14:44:52 发布

阅读量3k

点赞数

文章标签： system benchmarking 存储系统测试 performance file

本文链接：https://blog.csdn.net/flydragonfly/article/details/5949317

版权

本文档是一篇关于云存储系统基准测试的硕士论文，重点研究了私有云存储。作者对云存储的概念、公共和私有云的区别进行了介绍，并探讨了不同的云存储解决方案。通过调研和实现实验，作者使用UML进行建模，选择了Java作为开发语言，NetBeans作为IDE。论文详细描述了实现的测试工具，包括读、写、重读、重写等测试，并对DsNet系统进行了基准测试，得出了其读写性能。尽管DsNet的表现有待提高，但作者指出，随着更多计算机的加入，其性能可能会得到提升。

摘要由CSDN通过智能技术生成

1. Introduction.. 5

1.1 My Subject--- Cloud Storage.. 5

1.2 Cloud Storage Technologies. 5

1.2.1 What is cloud storage. 5

1.2.2 Public and Private Cloud Storage. 6

1.2 3 Cloud Storage Solutions. 6

1.3 Description of IRCCyN.. 7

1.3.1 Presentation. 7

1.3.2 Research Field. 8

1.4 The role of my adviser.. 9

2. Bibliography.. 10

2.1 Introduction.. 10

2.2 Relevant Techniques and Concepts. 10

2.2.1 Cloud Computing and Cloud storage. 10

2.2.2 Distributed file system... 11

2.2.3 Storage Virtualization. 12

2.2 4 Storage performance. 12

2.3 Conclusion.. 13

3. Modeling.. 14

3.1 Introduction.. 14

3.1.1 Definitions of the Tests. 14

3.1.2 Benchmark Functions. 15

3.2 UML.. 16

3.2.1 User Case Diagram... 16

3.2.2 Class Diagram... 16

3.2.3 Sequence Diagram... 17

3.3 Development Environment. 20

3.3.1 Operation System: Linux. 20

3.3.2 Programming Language: Java. 21

3.3.3 IDE: Netbeans. 21

4. Realization.. 22

4.1 Introduction.. 22

4.1.1 All Test: 22

4.1.2 Specific Test 23

4.1.3 Throughput Test 24

4.2 Implementation.. 24

4.2.1 Read Test 24

4.2.2 Write Test 26

4.2.3 Reread Test 27

4.2.4 Rewrite Test 27

4.2.5 Random read test 28

4.2.6 Random write test 28

4.3 Running.. 28

4.3.1 Main Window.. 28

4.3.2 All Test Window.. 29

4.3.3 Specific Test Window.. 30

4.3.4 Throughput Test Window.. 31

4.4 Benchmark of DsNet. 32

4.4.1 Introduction of DsNet 32

4.4.2 Installation of DsNet 33

4.4.3 Benchmark with Small Files. 35

4.4.4 Influence of Memory Cache and Disk Cache. 38

4.4.4 Benchmark with Large Files and Make Comparison. 40

4.5 Conclusion of Benchmark.. 43

5. Conclusion.. 45

5.1 What I have done.. 45

5.2 Unaccomplished Work.. 45

5.2 My Future Work.. 46

6. References. 47

1. Introduction

1.1 My Subject

My internship lasts for six months, from March 2010 to August 2010. My subject is to benchmark cloud storage system. Cloud storage is a model of networked storage where data is stored on multiple virtual servers. I will explain the concept of Cloud Storage specifically later in this Chapter.

My internship relates to the company—Fizians. Fizians is a company which concerns in the filed of Cloud Storage. At present, Fizians has prototyped it’s technology of cloud storage by developing its own system.

Fizians has two main needs today:

n Validate (assess) the scalability of its solution,

n Benchmark competitor’s solutions.

My tasks of Internship include:

n Achieve a survey of private cloud storage,

n Design tools to benchmark competitor’s solutions (Parascale and Cleversafe)

1.2 Cloud Storage Technologies

1.2.1 What is cloud storage

Cloud storage is a natural extension of SaaS (Software-as-a-Service; applications like SalesForce delivered over the web as a service) and cloud computing (CPU cycles available for rent over the web). Made popular by Google, Amazon and VMware, a cloud computing architecture is defined as:“The architecture behind cloud computing is a massive network of "cloud servers" interconnected as if in a grid running in parallel, sometimes using the technique of virtualization to maximize the utilization of the computing power available per server.”

Cloud storage technologies have developed so fast recently, but there is sill some challenges. Cloud storage is combined with various device, multiple applications and diverse services. It mush develop along with other technologies, such as the development of Broad Band, Web 2.0 Technologies, distributed file system and storage virtualization. However, Cloud storage would improve the performance of storage and make data more manageable.

At the beginning of my internship, I started to search all kinds of material about cloud storage technologies from Internet and books and try to be familiar with this field. This will be much helpful for my later work. With this purpose, I started up to study the definition, conceptions, and characteristics of cloud storage. Having the basic theoretical knowledge of cloud storage, I continued to research deeply into this area. Then, I researched on the category and modes of cloud storage and analyzed the solutions and architecture.

1.2.2 Public and Private Cloud Storage

There are two kinds of cloud storage: public and private. The difference between private and public storage clouds is simple. A public cloud is offered as a service, usually over an internet connection. Private clouds are deployed inside a firewall and managed by the user organization. This simple difference drives very unique experiences and capabilities to the end user.^[7]

Public clouds typically charge a monthly usage fee per GB, combined with bandwidth transfer charges. Users can scale the storage on demand and do not need to purchase storage hardware. Service providers manage the infrastructure and pool resources into capacity that customers can claim.^[7]

Private clouds are built from software running on customer supplied commodity hardware. The data is typically not shared outside the enterprise and full control is retained by the organization. Scaling the cloud is as simple as adding another server to the pool, while the self-managing architecture expands the cloud by adding performance and capacity.^[7]

During my internship, I mainly concentrate on the private cloud storage.

1.2 3 Cloud Storage Solutions

Mainly speaking, there are 3 choices of cloud storage solutions.

(1) Service solution (SV).

In this way, if you want to build the cloud storage, you could refer to some companies who provide the cloud storage service. For example, Amazon S3. and Rackspace CF. Buying their service, you don’t need to maintenance the storage device, just focus on the application of data. But many companies are not willing to store their sensitive data outside.

(2) Hardware solution (HW).

We can also purchase some specific hardware from vendors to establish cloud storage. Such as Cleversafe and EMC Atmos. The system of hardware is well developed and integrated with software, it is convenient to use. But it is not easy to scale and upgrade.

(3) Software solution (SW).

Some companies, such as parascale, choose to develop the software solutions. The software solutions can be applied to cloud storage with existed storage devices. It is convenient for those companies who don’t want to buy expensive appliances to build the cloud.

After studying the knowledge previously, I started to research on existed cloud storage systems. That would help me understand cloud storage well. There are many kinds of cloud storage systems which have their own advantages and disadvantage. Before learning the cloud storage system developed by Fizians, I start with studying other existed cloud storage systems, such as:Parascale, CleverSafe dsNet, Rackspace(Cloud FS), EMC Atmos, Google FS, Amazon S3, Hadoop HFS, Nirvanix, PNFS

1.3 Description of IRCCyN

I have spent 6 months in the laboratory (IRCCyN) for my internship, in this section, I would like to describe the laboratory where I did my research.

1.3.1 Presentation

Research Institute in Communications and Cybernetics of Nantes (IRCCyN) is a joint research unit of the National Center for Scientific Research (CNRS UMR 6597), primarily related to the Institute of Engineering Sciences and Systems (INSIS ) and secondarily at the Institute of Computer Science and their interactions (INS2I) and the Institute of Biological Sciences (INSB), whose local guardianship are at Ecole Centrale de Nantes, Nantes University and the Ecole des Mines de Nantes

The research of IRCCyN is not only for purpose of producing new knowledge, the result of any research, but also has a deep technology-oriented mission, which focuses on developing methods and tools to provide solutions for practical problems that emerge from the economic and social issues. This allows proactive policy to benefit particularly the development of industrial and service companies. In turn, this policy is the source of research problems which are investigated by researchers IRCCyN.

1.3.2 Research Field

Research actions that are developed by IRCCyN cover a very broad scientific area which includes:

• automation of complex systems

• signal processing and image

• video communication and processing of handwriting

• robotics and mechanical systems articulated

• Design Mechanical Computer Aided

• Customer-oriented design

• modeling and optimization of production processes

• Virtual engineering for the improvement of industrial performance

• real-time systems

• modeling and verification of embedded systems

• logistics systems and production

• discrete event systems

• cognitive psychology and ergonomics

1.4 The role of my adviser

My adviser of my subject is Pierre EVENOU, who is one of the professors at Polytech'Nantes, working ing the lab IRCCyN. He gave me the subject. At the beginning of my internship, he explained me the concept of cloud storage and my subject, He also told me my responsibility and tasks. During my internship, when I had questions, I would refer to my adviser and he gave me the suggestions. I would like to express my gratitude to him for his advice and support. With his help, I have solved some difficult problems of my subject. It is very import for me to make progress in my research. Thank him for what he has done for me during my internship in the lab.

2. Bibliography

2.1 Introduction

In the first chapter, I have introduced you my subject and the description of cloud storage. I have also presented my tasks during my internship. In this chapter I want to discuss something about techniques and concepts that are needed for my work.

Firstly, I will explain the concept of cloud computing which has strong relationship with cloud storage and how cloud computing and cloud storage work together. In the second part, I will discuss distributed file system. Cloud storage system is a kind of distributed file system, so we can understand cloud storage through the discussion of distributed file system. And then we will continue to learn something about storage virtualization. That would help us understand the realization of cloud storage technologies. Finally, we will look into the storage performance. My goal is to benchmark the cloud storage file system, so the final part is helpful for the definition of benchmark functions.

2.2 Relevant Techniques and Concepts

2.2.1 Cloud Computing and Cloud storage

Cloud storage is a natural extension of SaaS (Software-as-a-Service and cloud computing. Cloud storage has strong relationship with cloud computing. Therefore, I should understand something about cloud computing technologies. Cloud computing is Internet-based computing, whereby shared resources, software, and information are provided to computers and other devices on demand, like the electricity grid. Cloud architecture, the systems architecture of the software systems involved in the delivery of cloud computing, typically involves multiple cloud components communicating with each other over application programming interfaces, usually web services. This resembles the Unix philosophy of having multiple programs each doing one thing well and working together over universal interfaces.^[4]

Figure 2-1

Cloud Storage is a model of networked Computer data storage where data is stored on multiple virtual servers, generally hosted by third parties, rather than being hosted on dedicated servers. Hosting companies operate large data centers; and people who require their data to be hosted buy or lease storage capacity from them and use it for their storage needs. The data center operators, in the background, virtualize the resources according to the requirements of the customer and expose them as virtual servers, which the customers can themselves manage. Physically, the resource may span across multiple servers.^[7]

2.2.2 Distributed file system

In computing, a distributed file system or network file system is any file system that allows access to files from multiple hosts sharing via a computer network. This makes it possible for multiple users on multiple machines to share files and storage resources.The client nodes do not have direct access to the underlying block storage but interact over the network using a protocol. This makes it possible to restrict access to the file system depending on access lists or capabilities on both the servers and the clients, depending on how the protocol is designed.^[6]

In contrast, in a shared disk file system all nodes have equal access to the block storage where the file system is located. On these systems the access control must reside on the client.Distributed file systems may include facilities for transparent replication and fault tolerance. That is, when a limited number of nodes in a file system go offline, the system continues to work without any data loss.^[6]

2.2.3 Storage Virtualization

Storage virtualization is a concept in IT System Administration, referring to the abstraction (separation) of logical storage from physical storage so that it may be accessed without regard to physical storage or heterogeneous structure. This separation allows the Systems Admin increased flexibility in how they manage storage for end users.Most implementations allow for heterogeneous management of multi-vendor storage devices, within the scope of a given implementation's support matrix. This means that the following capabilities are not limited to a single vendor's device (as with similar capabilities provided by specific storage controllers) and are in fact possible across different vendors' devices.^[8]

Data replication techniques are not limited to virtualization appliances and as such are not described here in detail. However most implementations will provide some or all of these replication services.When storage is virtualized, these services must be implemented above the software or device that is performing the virtualization. The physical storage resources are aggregated into storage pools, from which the logical storage is created. More storage systems, which may be heterogeneous in nature, can be added as and when needed, and the virtual storage space will scale up by the same amount. The software or device providing storage virtualization becomes a common disk manager in the virtualized environment. Logical disks (vdisks) are created by the virtualization software or device and are mapped (made visible) to the required host or server, thus providing a common place or way for managing all volumes in the environment.

2.2 4 Storage performance

Every data center manager is concerned with storage system performance. Knowing how a system will run in the data center -- and predicting its limitations in your particular environment -- is absolutely essential for data center managers. Solid performance data can be an invaluable guide for making new acquisitions, allowing money to be allocated for the most beneficial products. Performance testing can analyze system behaviors under different or added load conditions, helping managers plan necessary storage and infrastructure upgrades. Testing can also reveal possible bottlenecks or potential problems in storage systems, significantly aiding in the troubleshooting process.

Throughput is the amount of data transferred in a unit of time and is most commonly measured in kilobytes per second (KBps) or megabytes per second (MBps). The throughput in an environment depends on many factors, related to both hardware and software. Among the important factors are Fibre Channel link speed, number of outstanding I/O requests, number of disk spindles, RAID type, SCSI reservations, and caching or prefetching algorithms.

A virtualized environment makes effective use of available resources, but at the same time it can impose more load on the storage infrastructure because of increased consolidation levels. An I/O command generated in a virtualized environment must pass through extra layers of processing that enable all the useful features of virtualization. It is important to understand the potential bottlenecks at various layers and make the necessary configuration changes to get optimal storage performance.

2.3 Conclusion

In this chapter, I focus on four parts: cloud computing, distributed file system, storage virtualization, and storage perform. After learning about those technologies, we could understand the relationship between cloud computing and cloud storage, how distributed file system works, the structure of cloud storage.

Cloud storage is a natural extension of cloud computing and it is a new developing distributed file system. Cloud storage is integrated with various kinds of advanced technologies, such as Storage Virtualization, storage pool, network protocols, and Web 2.0. The development of cloud storage will greatly improve the performance and capability of data storage and it will be applied to many areas.

With those technologies that we learn previously, we can understand cloud storage better. In next chapter, we will focus on the performance and capability of cloud storage. I will benchmark the cloud storage file system with the tool that I develop.

3. Modeling

3.1 Introduction

In order to accomplish my tasks of my internship, I want to design a file system benchmark tool to benchmark the cloud storage systems. This file system benchmark can generates and measures a variety of file operations. This tool which is developed by java programming language can run under different operating systems, such as Linux and windows.

The benchmark tool can test I/O performance for the following operations: read, write, re-read, re-write, random read, random write.

3.1.1 Definitions of the Tests

(1) Read: This test measures the performance of reading an existing file.

(2) Write: This test measures the performance of writing a new file. When a new file is written not only does the data need to be stored but also the overhead information for keeping track of where the data is located on the storage media. This overhead is called the “metadata” It consists of the directory information, the space allocation and any other data associated with a file that is not part of the data contained in the file.

(3) Re-Read: This test measures the performance of reading a file that was recently read. It is normal for the performance to be higher as the operating system generally maintains a cache of the data for files that were recently read. This cache can be used to satisfy reads and improves the performance.

(4) Re-write: This test measures the performance of writing a file that already exists. When a file is written that already exists the work required is less as the metadata already exists. It is normal for the rewrite performance to be higher than the performance of writing a new file.

(5) Random Read: This test measures the performance of reading a file with accesses being made to random locations within the file. The performance of a system under this type of activity can be impacted by several factors such as: Size of operating system’s cache, number of disks, seek latencies, and others.

(6) Random Write: This test measures the performance of writing a file with accesses being made to random locations within the file. Again the performance of a system under this type of activity can be impacted by several factors such as: Size of operating system’s cache, number of disks, seek latencies, and others.

From the definition of tests, we can know that this tool can test the file system in 6 ways. We can choose which ways to test and how to specify the tests. For example, we can just focus on the “Read” and “Write” I/O operations, and we can specify the “file size” and “block size” for benchmark.

3.1.2 Benchmark Functions

The benchmark tool contains three main functions which are “All Test”, “Specific Test” and “Throughput Test” respectively. The three functions are designed to facilitate those who use this tool to benchmark cloud storage system. Users can observer the capability and performance of the file system in different ways. And it is easy to compare the features during various systems. Next, I want to describe the three functions respectively.

(1) All Test

This function is quit general. It includes all kinds of tests, such as “Read Test”, “Write Test”, “Reread Test”, “Rewrite Test”, “Random Read Test”, and “Random Write Test”.

In order to test those 6 performances and produce the data that we expect, we can input various parameters, such as “Maximum Size of file”, “Minimum Size of file”, “Maximum Size of block”, “Minimum Size of block ”, and“ Test Path”. With different parameters, we can obtain different results from the same file system. In this way, we can observe the performance generally and accurately.

Having finished the tests, all results would be recorded in a “.txt” file. It is convenient to analyze the results of performances and compare with other file systems.

(2)Specific Test

Comparing with All Test, Specific test is more particular. Instead of giving the region of file size and block size, users set up the specific values of file size and block size. “All Test” function takes a long time to test, and “Specific Test” takes much less time that “All Test” .Because “Specific Test” just focus on the given file size and block size. It would be efficient for users to benchmark file systems.

(3) Throughput Test

This function benchmarks the capability of one file system by create multiple threads. For example, for read test, there might be several threads which are created at the same time execute reading operation synchronously. After that benchmark tool will calculate the average and total transmission speed of reading. This function can be used to stress testing and that would be helpful to make clear of the capability of file system.

3.2 UML

In this section I will present the modeling of my project with UML, and I will use three modeling measures: User Case Diagram, Class Diagram, and Sequence Diagram.

3.2.1 User Case Diagram

The benchmark tool provides users with three ways for testing: All Test, Specific Test, and Throughput Test