基于改进Simhash的虚拟机镜像去重方法

Virtual machine image clustering deduplication algorithm based on improved Simhash

作者：张灿阳(四川大学计算机学院)；刘晓洁(四川大学网络空间安全学院)

Author：ZHANG Can-Yang(College of Computer Science,Sichuan University)；LIU Xiao-Jie(College of Cyberspace Security, Sichuan University)

收稿日期：2019-05-08 年卷（期）页码：2020,57(1):57-65

期刊名称：四川大学学报: 自然科学版

Journal Name：Journal of Sichuan University (Natural Science Edition)

关键字：云计算；虚拟机镜像；数据存储；重复数据删除

Key words：Cloud computing; Virtual machine image; Data storage; Deduplication

基金项目：国家重点基础研究发展计划,国家自然科学基金,其它

中文摘要

在云环境中，传统意义上的物理服务器正在逐渐被各式虚拟机所取代，云数据中心中托管的虚拟机镜像所占用的存储空间急剧增长，如何高效地管理这些镜像文件已成为云计算研究热点之一．由于虚拟机镜像内部存在大量空白重复数据块，这在一定程度上导致了镜像内部冗余率较高．其次，不同的虚拟机镜像可能运行了相同的操作系统和应用程序，使得镜像之间同样存在较多的重复数据．因此，高效的去重方法对于减少镜像的空间占用率尤为重要．针对海量虚拟机镜像，传统的去重策略将产生巨大的时间开销，同时会消耗巨大的内存空间和CPU资源，影响数据中心的性能．本文提出一种基于改进Simhash算法的海量虚拟机镜像多级去重方法，将一个完整的镜像文件分割为操作系统镜像段和应用数据镜像段，同时提取各部分的特征值，利用DBSCAN（Density-Based Spatial Clustering of Applications with Noise）聚类算法完成对镜像段的分组，将相似度较高的镜像段聚为一类，从而将全局去重分解为规模较小且重复率较高的分组内部去重，实现了指纹索引数据完全存放于内存中的重复数据删除，大幅减少了磁盘I/O次数，达到缩短去重时间的目的．

英文摘要

In the cloud environment, traditional physical servers are gradually being replaced by various virtual machines．The storage space, occupied by virtual machine images hosted in cloud data centers, has increased dramatically．How to efficiently manage these image files has become one of research hotspots in the cloud computing．Due to the large number of blank duplicate blocks inside the virtual machine image, which leads to a high degree of internal redundancy of the image．Second, different virtual machine images may run the same operating system and applications, so that there is more duplicate data between the images．For a large number of virtual machine images, the traditional deduplication strategy will generate huge time overhead, and will consume huge memory space and CPU resources, which will affect the performance of the data center．This paper proposes a multi level deduplication method based on improved Simhash algorithm for massive virtual machine images, which divides a complete image file into operating system image segment and application data image segment, extracts the feature values of each part, and uses DBSCAN clustering algorithm for grouping the image segments．In this way, the image segments with higher similarity are grouped into one class, thereby decomposing the global deduplication into smaller internal weights with higher repetition rate, and the fingerprint index data is completely stored in the memory．This deduplication algorithm greatly reduces the number of disk I/Os and shortens the deduplication time.

【关闭】

论文摘要

基于改进Simhash的虚拟机镜像去重方法

Virtual machine image clustering deduplication algorithm based on improved Simhash