1.BGI-Shenzhen, Shenzhen 518083, Guangdong, China
2.Shenzhen Institute of Synthetic Biology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, Guangdong, China
3.Guangdong Provincial Key Laboratory of Genome Read and Write, BGI-Shenzhen, Shenzhen 518120, Guangdong, China
4.Guangdong Provincial Academician Workstation of BGI Synthetic Genomics, BGI-Shenzhen, Shenzhen 518120, Guangdong, China
5.China National GeneBank, BGI-Shenzhen, Shenzhen 518120, Guangdong, China
6.Big Data Institute, University of Oxford, Oxford, OX3 7LF, United Kingdom
7.TAICHI AI Ltd., London, N1 7GU, United Kingdom
[ "平质(1987—),男,博士,助理研究员。研究方向为合成生物学、DNA存储、生物信息分析算法。 E-mail:pingzhi@genomics.cn" ]
[ "张颢龄(1996—),男,助理研究员。研究方向为合成生物学、计算机科学、神经网络。 E-mail:zhanghaoling@genomics.com" ]
[ "朱砂(1985—),男,博士,罗氏(英国)资深统计分析师。研究方向为生物遗传模型相关的概率论。长期从事计算机统计方法研究和程序开发。E-mail:sha.joe.zhu@gmail.com" ]
[ "沈玥(1986—),女,博士,研究员。研究方向为合成生物学、合成基因组学、DNA合成技术与工具开发。E-mail:shenyue@genomics.cn" ]
收稿:2020-12-01,
修回:2020-12-31,
纸质出版:2021-06-30
Scan QR Code
近年来DNA存储因其数据存储密度与保存时间方面的优势而备受关注,有望在如光盘、硬盘等传统存储介质之外作为一种新型信息存储方式,满足海量数据存储及特殊应用领域数据加密存储的迫切需求。DNA存储流程中,二进制信息到DNA碱基序列的相互转换(即编解码)方法是实现数字信息技术与生物技术衔接的最核心步骤。尽管DNA存储编解码研究已有丰富进展,但与现有上下游衔接技术的兼容性,对不同存储文件的适配性、存储稳健性和数据安全性等尚缺少一个可量化比较与评估的系统。因此,本研究开发了一个DNA存储编解码方法的可扩展集成与评估平台Chamaeleo,以模块化集成方式对已开发的编解码方法进行系统性量化分析与评估,可针对不同类型文件进行编解码方法的择优方案输出。Chamaeleo以开源方式运行,以便于未来新编解码方法和评价指标的持续加载,促进该领域开放交流,推动规范化有序发展。
The emerging field of DNA based data storage has attracted considerable interests for the enormous potentials of DNA in high density and durability as a medium. Compare to traditional storage material such as magnetic
optical and electronic storage media
the use of DNA as storage media has been considered as a promising novel solution to meet the global demand for storing the skyrocketing amount of data worldwide. In addition
DNA storage adds an extra layer of protection for the stored information because the coding and decoding process of DNA based data storage relies on the combined implementation of DNA synthesis and sequencing technologies
which are not as commonly used as technologies in information communication area. Transcoding between binary digital data and quaternary DNA molecules is the most important step in the whole process of DNA-based data storage. Several coding methods have been developed using different programming languages in the past decades
however
it is difficult to compare the overall performance of these methods due to different software architectures and varying parameters. Thus
it brings challenges for researchers to further develop or for users to compare and choose the suitable methods as needed. In this study
we introduce an integrated evaluation platform "Chamaeleo" to address the issues as stated above. One of the key features of Chamaeleo is the integration of existing coding schemes and modulization of functions including data handling
transcoding
index operating and error-correcting as a user-friendly design. The other key feature is the function of evaluating a coding scheme in a qualitative and quantitative manner. A set of widely recognized and accepted indexes are chosen to evaluate the compatibility with DNA writing and reading technologies
the robustness regarding tolerance of introduced errors or data loss and the complexity of transcoding rules. Considering the rapid advancement in this field
Chamaeleo is designed as an open-source style for researchers to incorporate new coding schemes and evaluation indexes into the platform
thus encouraging the community to contribute together in the shaping of future DNA based data storage.
2
PING Z , MA D Z , HUANG X L , et al . Carbon-based archiving: current progress and future prospects of DNA-based data storage [J ] . Gigascience , 2019 , 8 ( 6 ): gizo75 .
DONG Y M , SUN F J , PING Z , et al . DNA storage: research landscape and future prospects [J ] . National Science Review , 2020 , 7 ( 6 ): 1092 - 1107 .
CHURCH G M , GAO Y , KOSURI S , Next-generation digital information storage in DNA [J ] . Science , 2012 , 337 ( 6102 ): 1628 .
GOLDMAN N , BERTONE P , CHEN S , et al . Towards practical, high-capacity, low-maintenance information storage in synthesized DNA [J ] . Nature , 2013 , 494 ( 7435 ): 77 - 80 .
GRASS R N , HECKEL R , PUDDU M , et al . Robust chemical preservation of digital information on DNA in silica with error-correcting codes [J ] . Angewandte Chemie International Edition , 2015 , 54 ( 8 ): 2552 - 2555 .
ERLICH Y , ZIELINSKI D . DNA Fountain enables a robust and efficient storage architecture [J ] . Science , 2017 , 355 ( 6328 ): 950 - 954 .
PING Z , CHEN S , HUANG X , et al . Towards practical and robust DNA-based data archiving by codec system named 'Yin-Yang' [EB/OL ] . [ 2021-05-26 ] . https://doi.org/10.1101/829721 https://doi.org/10.1101/829721 .
HAO M , QIAO H , GAO Y , et al . A mixed culture of bacterial cells enables an economic DNA storage on a large scale [J ] . Communications Biology , 2020 , 3 ( 1 ): 416 .
PRESS W H , HAWKINS J A , JONES S K , et al . HEDGES error-correcting code for DNA storage corrects indels and allows sequence constraints [J ] . Proceedings of the National Academy of Sciences of the United States of America , 2020 , 117 ( 31 ): 18489 .
WANG Y X , NOOR-A-RAHIM M , GUNAWAN E , et al . Construction of bio-constrained code for DNA data storage [J ] . IEEE Communications Letters , 2019 , 23 ( 6 ): 963 - 966 .
NGUYEN T T , CAI K , IMMINK K A S , et al . Constrained coding with error control for DNA-based data storage [C ] // 2020 IEEE International Symposium on Information Theory (ISIT) . 2020 .
BLAWAT M , GAEDKE K , HUETTER I , et al . Forward error correction for DNA data storage [J ] . Procedia Computer Science , 2016 , 80 : 1011 - 1022 .
FOWLER M . Refactoring: improving the design of existing code [M ] . Boston : Addison-Wesley Professional , 2018 .
COX B J . Object-oriented programming: an evolutionary approach [M ] . Boston : Addison-Wesley , 1986 .
KOCH J , GANTENBEIN S , MASANIA K , et al . A DNA-of-things storage architecture to create materials with embedded memory [J ] . Nature Biotechnology , 2020 , 38 ( 1 ): 39 - 43 .
TANENBAUM A S BOS H . Modern operating systems [M ] . London : Pearson , 2015 .
SAYOOD K . Introduction to data compression [M ] . Burlington : Morgan Kaufmann , 2017 .
FENG L , FOH C H , JIANFEI C , et al . LT codes decoding: design and analysis [C ] // 2009 IEEE International Symposium on Information Theory . 2009 .
YAZDI S H T , GABRYS R , MILENKOVIC O . Portable and error-free DNA-based data storage [J ] . Scientific Reports , 2017 , 7 ( 1 ): 1 - 6 .
ORGANICK L , CHEN Y J , ANG S D , et al . Probing the physical limits of reliable DNA data retrieval [J ] . Nature Communications , 2020 , 11 ( 1 ): 1 - 7 .
HECKEL R , MIKUTIS G , GRASS R N . A characterization of the DNA data storage channel [J ] . Scientific Reports , 2019 , 9 ( 1 ): 9663 .
MACKAY D J MAC KAY D J . Information theory, inference and learning algorithms [M ] . Cambridge : Cambridge University Press , 2003 .
KOSURI S , CHURCH G M . Large-scale de novo DNA synthesis: technologies and applications [J ] . Nature Methods , 2014 , 11 ( 5 ): 499 - 507 .
KULSKI J K . Next generation sequencing-advances, applications and challenges [M ] . London : Intech Open , 2016 : 3 - 60 .
CHEN Y J , TAKAHASHI C N , ORGANICK L , et al . Quantifying molecular bias in DNA data storage [J ] . Nature Communications , 2020 , 11 ( 1 ). DOI: http://doi.org/10.1038/s41467-020-16958-3 http://dx.doi.org/http://doi.org/10.1038/s41467-020-16958-3 .
MOON T K . Error correction coding: mathematical methods and algorithms [M ] . Hoboken : John Wiley & Sons , 2005 .
STINSON D R , PATERSON M . Cryptography: theory and practice [M ] . Boca Raton : CRC Press , 2018 .
PASCHKE J , BURKERT J , FEHRIBACH R . Computing and estimating the number of n-ary Huffman sequences of a specified length [J ] . Discrete Mathematics , 2011 , 311 ( 1 ): 1 - 7 .
KOSHY T . Catalan numbers with applications [M ] . Oxford : Oxford University Press , 2008 .
AVAL J C . Multivariate fuss-catalan numbers [J ] . Discrete Mathematics , 2008 , 308 ( 20 ): 4660 - 4669 .
WEST D B . Introduction to graph theory [M ] . Hoboken : Prentice Hall , 1996 .
COMPEAU P E , PEVZNER P A , TESLER G . How to apply de Bruijn graphs to genome assembly [J ] . Nature Biotechnology , 2011 , 29 ( 11 ): 987 - 991 .
BOUCHET A . Greedy algorithm and symmetric matroids [J ] . Mathematical Programming , 1987 , 38 ( 2 ): 147 - 159 .
MILO R , SHEN-ORR S , ITZKOVITZ S , et al . Network motifs: simple building blocks of complex networks [J ] . Science , 2002 , 298 ( 5594 ): 824 - 827 .
BORNHOLT J , LOPEZ R , CARMEAN D M , et al . A DNA-based archival storage system [C ] // Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems . 2016 .
CAFFERTY B J , TEN A S , FINK M J , et al . Storage of information using small organic molecules [J ] . ACS Central Science , 2019 , 5 ( 5 ): 911 - 916 .
KENNEDY E , ARCADIA C E , GEISER J , et al . Encoding information in synthetic metabolomes [J ] . PLoS One , 2019 , 14 ( 7 ): e0217364 .
0
浏览量
1
Downloads
2
CSCD
关联资源
相关文章
相关作者
相关机构
京公网安备11010802024621