Scientific data crunching on the cloud

Cloud computing is becoming an attractive option for scientists who require massive computing power to run experiments in high-energy physics and astronomy, among other scientific fields.

In 2008, the U.S. National Aeronautics and Space Administration (NASA) launched its Nebula cloud computing platform comprising existing and newly developed open-source software to provide high performance, instantly available computing, storage, and network resources to NASA’s research community. This eliminates the need to construct expensive data centers whenever scientists require additional data processing.

The National Science Foundation in the U.S. is also partnering with Microsoft to provide American researchers free access to Redmond’s Azure cloud computing platform for scientific data crunching. Announced in February this year, the goal of the project is to help researchers cope with exploding volumes of data that need to be analyzed to yield meaningful results.

Outside the U.S., the Amazon EC2 cloud computing platform is starting to be used by science researchers to generate and process large data sets. In the Asia-Pacific region, at least one group of researchers in a new phase of the Belle project has been using Amazon EC2 to augment its own grid computing infrastructure.

The Belle project, which won a 2008 Physics Nobel Prize for its Japanese researchers, had earlier observed large matter-antimatter asymmetry. The project is run by the KEK High Energy Accelerator Research Organization near Tokyo.

According to a KEK press statement on its Nobel Prize win, “symmetry means that a physical situation will be unchanged under certain transformations.

“One of the important examples of broken symmetry arose immediately after the Big Bang. The most basic form of the Big Bang Theory predicts that equal amounts of matter and antimatter should have been created at the early stage of the universe. They should annihilate each other, but there was some matter remained. This is the broken symmetry thought to be responsible for the visible parts of the universe”.

In Belle II, the next phase of the project, researchers will aim to identify the reasons for this asymmetry. According to Professor Martin Servoir from the University of Melbourne in Australia, the amount of data generated in this phase is likely to be 50 times greater than before.

Servoir is among the 343 global researchers from 13 countries including China, Germany, Korea, India, Russia, United States and Japan who are jointly working on the project.

“What we’ll like to do is to go one step further and win a Nobel Prize for ourselves,” Servoir said. “We need to make measurements that are more precise than what we’ve done so far, and discover something that no one has seen before,” he added.

Generating the data required in this new project phase is no small feat, as the researchers would need computing power that’s 50 times more than what they currently have, Servoir said.

The project team is now preparing for the grand experiment, which won’t be up and running until 2014. “Although Moore’s Law is great, there is not enough time to get over the 50-times-more-computing-power problem,” Servoir said.

“We’re looking to find computing resources as quickly and cheaply as we can, and have turned to Amazon EC2 as a means of providing those resources,” he added.

On Amazon EC2, Servoir is able to replicate the project’s grid computing infrastructure on some 250 instances — or units of compute resources — that are fully accessible to his own in-house systems.

Additionally, “storage elements” were also created on EC2 instances to hold temporary data required to generate more data. This helps to ensure efficient use of cloud computing resources which Amazon charges by the hour.

Currently, all the data generated from Amazon EC2 instances are sent back to the project’s own computing systems and not stored on the cloud. This partly alleviates any security issue that most cloud computing adopters are concerned with.

Besides, “the data generated is sufficiently difficult to analyze that we couldn’t imagine anyone else who might be interested in it,” Servoir said.

In any case, security is ensured by using a secure protocol to move data around as well as “a secure version of Linux that is very hard to hack”, Servoir revealed.

“We were also invited to look at the extra security measures available on EC2 which go well above and beyond what we used to have that’s only available in virtualized environments hosted by Amazon.

“That extra layer totally blew any concern out of the water,” he added.

While generating all that scientific data to answer some of life’s biggest questions is laudable, more work needs to be done to address the long-term preservation of scientific data which is increasingly created in diverse, largely-proprietary formats that may not exist forever.

For now, data curation and preservation remains an afterthought in most scientific endeavors. Servoir, for one, acknowledges the problem, but passes the buck to his colleagues specializing in data preservation when quizzed. Data preservation, in fact, should start right from the beginning with data creation, since only researchers directly involved in the project would know the context in which data was created.

As Carole Palmer, director of the Center for Informatics Research in Science and Scholarship at the University of Illinois, noted in ScienceDaily, data curation and preservation “is about doing good science, and ensuring the science is reproducible, but it’s also about finding ways to bring data together from different sources and using them in new ways.”

“To replicate a study or re-use data you have to know where a data set came from and how it’s been processed. Tracking all the context and transformations is part of the curation process,” she said.