On Monday, October 23, 2017, Professor Paul Anderson and Walter Blair, a student in the Computer Science Master’s program, presented a paper on Streamlining the Genomics Processing Pipeline via Named Pipes and Persistent Spark Datasets with colleagues Leonard de Melo Joao and Larry Davis at the 17th IEEE International Conference on BioInformatics and BioEngineering in Washington, DC.
In this paper we investigate the use of Unix named pipes and an in-memory datagrid to reduce the I/O requirements of conventional and exploratory genomics processing pipelines. Apache Spark provides an in-memory framework for distributed computational genomics that has realized significant improvements over conventional pipelines in speed and flexibility. Even in the Spark framework, however, pipeline components create I/O bottlenecks by reading and writing intermediate files that are later discarded. Apache Ignite provides a framework for persisting a Spark dataset in memory between modular pipeline applications, and Unix named pipes have long provided a mechanism by which data can be transferred in-memory. We compared the runtime performance of a standard genomics pipeline that transmits Spark data using named pipes and/or Ignite’s in-memory datagrid. Our results demonstrate that Ignite can improve the runtime performance of in-memory RDD actions and that keeping pipeline components in memory with Ignite and named pipes eliminates a major I/O bottleneck.
The aim of this year’s IEEE International Conference on Bioinformatics and Bioengineering is building synergy between Bioinformatics and Bioengineering/Biomedical, two complementary disciplines that hold great promise for the advancement of research and development in complex medical and biological systems, agriculture, environment, public health, drug design. Research and development in these areas are impacting the science and technology in such fields as medicine, food production, and forensics among others though advancing fundamental concepts in molecular biology, understanding living organisms at multiple levels, developing innovative implants and bio-prosthetics, and improving tools and techniques for the detection, prevention and treatment of diseases.
The BIBE series provides a common platform for the cross fertilization of ideas, and for shaping knowledge and scientific achievements by bridging these two very important and complementary disciplines into an interactive and attractive forum.
This series of BIBE Conferences was founded in 2000. BIBE is the first leading Conference of its kind.