This blog post was co-authored by Joseph Sirosh, Corporate Vice President, and Rimma V. Nehme, Principal Software Engineer, at the Data Group at Microsoft.
Figure 1: Visible objects of the Sloan Digital Sky Survey (SDSS) DR7 dataset.
Over the last 15 years a database helped revolutionize an entire field of science. Astronomical discovery and sophisticated analyses of properties of the aggregate universe was turbocharged by a vast public mapping effort of the sky, called the Sloan Digital Sky Survey, whose data was served in a public database built with Microsoft SQL Server. This was the first in the field and opened up an entirely new window into the Universe.
Figure 2: The Fourth Paradigm: Data-Intensive Scientific Discovery book dedicated to Jim Gray.
Now scientists in every field, from astronomy to zoology, are recognizing that the rate of accumulation of data in their fields are greatly outstripping the rate of accumulation of interpretation, i.e. the rate at which the scientific community can assimilate data into an interpretive framework. And there’s widespread recognition that powerful scientific discoveries lie hidden in such massive data. The Fourth Paradigm of scientific discovery, driven by novel techniques for analyzing massive data, is a driving force in science like never before.
Sloan Digital Sky Survey: The Cosmic Genome Project
It all started in the early 90’s when Dr. Alex Szalay together with the late Dr. Jim Gray took on a daring endeavor to build what could be called the first “DataScope” – an efficient data intensive computing infrastructure for astronomers called the Sloan Digital Sky Survey (SDSS) using Microsoft SQL Server as the back-end database.
Figure 3: Jim Gray, Alex Szalay and other astronomers at Super Computing 2003
SDSS had a bold goal – to create a map of the universe in a database for exploration by all. It is often referred to as the Cosmic Genome Project. A dedicated 2.5-m-diameter telescope in New Mexico used a 120-megapixel camera to image more than one-quarter of the entire night sky, 1.5 square degrees of sky at a time, about eight times the area of the full Moon, both inside and outside of the Milky Way, and helped create a three-dimensional (3D) map of millions of galaxies, quasars and stars.
The SDSS maps sparked a revolution in the way astronomy is practiced. No longer did scientists have to wait months for access to a telescope to learn about the night sky; instead, entire research projects could be accomplished by querying the online database. The SDSS made its entire data set available through SkyServer database – an online portal for public use, and invited volunteer contributions to scientiﬁc research. Prior to SDSS, only the leading scientists and astronomers had telescopes and instruments to collect data for serious research, with most others largely excluded from direct and active engagement with astronomy. Now, with access to the visual data that SkyServer offers, anyone with Internet access could explore the universe with data just as the top scientists do.
Figure 4: SDSS-IV can view the whole Milky Way
Why Microsoft SQL Server?
While building applications to study the correlation properties of galaxies, Szalay and his team have discovered that many of the patterns in their statistical analysis involved tasks that were much better performed inside the database engine than outside, on flat files. The Microsoft SQL Server gave them high-speed sequential search of complex predicates using multiple CPUs, multiple disks and large main memories. It also had sophisticated indexing and data joining algorithms far outperforming hand-written programs against flat files. Many of the multi-day batch files were replaced with database queries that ran in minutes thanks to the sophisticated query optimizer.
Figure 5: Dr. Jim Gray in front of the Sloan telescope in Apache Point, NM
The most recent version of the database has a 15TB queryable public dataset, with about 150TB additional raw and calibrated files. A recent scan of the logs showed more than 1.6 billion web hits in the past 14 years and more than four million distinct IP addresses accessing the site. The total number of professional astronomers worldwide is only about 15,000. Furthermore, the multiuser collaborative environment in SDSS called CasJobs which allows users to launch extensive analyses has more than 6,820 registered users – almost half of the professional astronomy community.
SDSS has been successful in generating new scientiﬁc discoveries, including the measurements of thousands of asteroids, maps of the complicated merger history of the outer Milky Way, and the ﬁrst detection of the baryon acoustic peak – a measurement of how structure formed from ultra-low frequency standing sound waves in the early universe. These surveys have produced data to support 5,800 papers with more than 245,000 citations. This has made SDSS one of the highest impact projects in the field of astronomy.
The amount of astronomical data in SkyServer is truly unprecedented. When the SDSS began in 1998, astronomers had data for less than 200,000 galaxies. Within five years after SDSS began, SkyServer had data on 200 million galaxies in the database. Today, the SDSS data exceeds 150 terabytes, covering more than 220 million galaxies and 260 million stars. The images alone include 2.5 trillion pixels of original raw data. SkyServer allows users to search for stars at a given position in the sky, or they can search for galaxies brighter than a certain limit. Users can also enter queries to the database in SQL directly, which allows more ﬂexible and sophisticated searches.
Examples of queries users can ask in SkyServer:
- What resources are in this part of the sky?
- What is the common area of these surveys?
- Is this point in the survey?
- Give me all objects in this region
- Give me all “good” objects (exclude “bad” areas)
- Give me the cumulative counts over areas
- Compute fast spherical transforms of densities
- Interpolate sparsely sampled functions (extinction maps, dust temperature, …)
Figure 6: SkyServer portal
Another project that SDSS data access has enabled is a “citizen science” website, called Galaxy Zoo, where Internet volunteers have classified galaxies using SDSS images. Typically, astronomers used to classify galaxies by eye. If you have 200 million galaxies, on average at three per minute, classification would take 600 million minutes or 1142 years of 24 hours per day, seven days per week. Galaxy Zoo was the first astronomy crowdsourcing portal which allowed private citizens to look at data by eye, and contribute classifications to scientists in a much shorter time.
Figure 7: Hanny’s Voorwerp. The mass (shown here in green) is a new cosmic object discovered by a Dutch school teacher, an astronomy novice, while using Galaxy Zoo.
There have been a number of scientific discoveries using Galaxy Zoo including determination of the relation between the morphology of galaxies and their environment and the discovery by a Dutch school teacher of Hanny’s Voorwerp – a very rare type of astronomical object called a quasar ionization echo. These discoveries would not have been possible without the participation of thousands of Galaxy Zoo volunteers – between them, they have visually classiﬁed over 40 million galaxies to date.
From SkyServer to SciServer: Big Data infrastructure for science
A new effort called SciServer, a descendant from SkyServer, aims to go beyond astronomy and build a long-term, flexible ecosystem for scientists to provide access to the enormous data sets from observations and simulation to enable collaborative research. SciServer aims to meet the challenges of Big Data in scientific world. By building a common infrastructure, the goal is to create data access and analysis tools useful to all areas of science. Led by Alex Szalay, the work on SciServer will deliver significant benefits to the scientific community by extending the infrastructure developed for SDSS astronomy data to many other areas of science.
Figure 8: SciServer: A collaborative research environment for large-scale data-driven science.
The approach in designing SciServer is the same as in the SkyServer: bring the analysis to the data. This means that scientists can search and analyze Big Data without downloading terabytes or petabytes of data, resulting in much faster processing times. Bringing analysis to data also makes it much easier to compare and combine datasets allowing researchers to discover new and surprising connections between data and make experiments more reproducible.
To help ease the burden on researchers, the team developed SciDrive, a cloud data storage system for scientific data that allows scientists to upload and share data using a Dropbox-like interface. The interface automatically reads the data into a database, and one can search online and cross-correlate with other data sources. SciDrive tries to address the “long tail” of a huge number of small data sets that scientists have. The goal is to try bring many small, seemingly unrelated data to a single place and see if new value emerges. People can simply drag and drop (and share) their data without any metadata required.
In the heart of it all is SQL Server
SDSS team in collaboration with Jim Gray took on the enormous task of putting all of the astronomy data into SQL Server database, preserving as much provenance as possible, and making the data as accessible and query-able as possible.
Database logical design
Figure 9: Dr. Alex Szalay
The processed image data were stored in databases. The logical database design consisted of photographic and spectrographic objects. They were organized into a pair of snowflake schemas. Sub-setting views and many indices gave convenient and fast access to the conventional subsets (such as stars and galaxies). Procedures and indices were defined to make spatial lookups even more convenient and faster.
Database physical design
SkyServer initially took a simple approach to database design (see Figure 11 below) and it worked right from the beginning. The design counted on the SQL storage engine and the query optimizer to make all the intelligent decisions about data layout and data access. As Alex Szalay put it: “Great query optimizer made all the difference. Even ‘the worst’ query plans were actually quite good!”
Figure 11: The photoObj table at left is the center of one star schema describing photographic objects. The specObj table at right is the center of a star schema describing spectrograms and the extracted spectral lines. The photoObj and specObj tables are joined by objectId. Not shown are the dataConstants table that names the photoObj flags and tables that support web access and data loading.
“Indexing the Sky”
To speed up the access, the base tables were heavily indexed (these indices also benefited view access). In addition to the indices, the database design includes a fairly complete set of foreign key declarations to insure that every profile has an object; every object is within a valid field, and so on. The design also insisted that all fields were non-null. These integrity constraints were invaluable tools in detecting errors during loading and they aided tools that automatically navigated the databases.
Beyond the file group striping (to automatically get the sum of the disk bandwidths without any special user effort), SkyServer used, for the most part, all of the SQL Server default values; there was not much special tuning. This is the hallmark of SQL Server – the system aims to have the out-of-the box performance to be great, and the SkyServer project has been a true testimonial to that goal.
Spatial data access
Figure 12: Hierarchical triangular mesh
“Spatial was special.” Astronomers are particularly interested in executing spatial queries to obtain galactic clustering and large-scale structure of the universe. The common theme in SDSS experience was that it was possible to embed spatial concepts in a relational framework in a very simple manner. To make spatial area queries run quickly, SDSS team integrated the hierarchical triangular mesh (HTM) code with the SQL Server, which became a new “spatial access method” in the engine. HTM is a method to subdivide the surface of a sphere into spherical triangles of similar, but not identical, shapes and sizes. It is basically a quad-tree that is particularly good at supporting searches at different resolutions, from arc seconds to hemispheres. The HTM library was an external stored procedure wrapped in a table-valued stored procedure spHTM_Cover(<area>).
So all the users had to do was to simply invoke the procedure call similar to this: select * from spHTM_Cover(‘Circle J2000 12 5.5 60.2 1’) which would return the table with four rows, each row defining the start and end of a 12-deep HTM triangle like below.
Another optimization technique used by SkyServer was the zoning idea (segmenting space into zone buckets and then segmenting zones by an offset). The main idea behind zoning was to try to push the logic entirely into SQL (the zone code was all native to SQL), which allowed the query optimizer to do a very efficient job at filtering the objects. In particular, the zone design gave a three-fold speedup for the table-valued functions.
integration of .NET common language runtime (CLR) with SQL Server in 2005 enabled astronomers to implement user code that runs inside the database server process. CLR was in particular a very important feature to SDSS as it gave astronomers the ability to write astronomy-specific logic in the form of user-defined functions, aggregates and stored procedures to build critical science functionality and run the compiled code in the database. As Alex put it, “Support for object-oriented types made a dramatic change for SkyServer.”
Astronomers wanted a tool that would be able to quickly answer questions like: “find asteroid candidates” or “find other objects like this one”, which originally gave the motive to build the SQL-based backend. Indeed, right from the beginning Jim Gray asked Alex Szalay to define 20 typical queries astronomers might want to ask and then together they designed the SkyServer database to answer those queries. The anecdote is that the conversation went as follows:
Jim: What are the 20 questions you want to ask?
Alex: Astronomers want to ask anything! Not just 20 queries.
Jim: Ok, start with 5 queries.
[it took Alex 30 minutes to write them all down]
Jim: Ok, add another 5 queries.
[it took Alex 1 hour to write them all down]
Jim: Ok, now add another 5 queries.
[Alex gave up and went home to think about them]
Alex (said later): In 1.5 hours, Jim taught me a lot of humility!
Alex (said later): It also taught us the importance of long-tail distribution and how to prioritize.
The queries corresponded to typical tasks astronomers would do. Translating the queries into SQL required a good understanding of astronomy, a good understanding of SQL, and a good understanding of the databases. As Alex put it: “We were surprised and pleased to discover that all 20 queries had fairly simple SQL equivalents.” Below is one of the query examples used in SkyServer to detect asteroids:
Q: Provide a list of moving objects consistent with an asteroid.
select objID, -- return object ID sqrt( power(rowv,2) + power(colv, 2) ) as velocity, -– velocity dbo.fGetUrlExpId(objID) as Url -- url of image to examine it. into ##results from PhotoObj -- check each object. where (power(rowv,2) + power(colv, 2)) between 50 and 1000 -- square of velocity and rowv >= 0 and colv >=0 -- negative values indicate error
This is a sequential scan of the PhotoObj table to evaluate the predicate on each of the objects. It finds asteroid candidates. Here is a picture of one of such objects:
Above query returns ‘slow moving’ objects. To find fast moving objects one can write a slightly different query which looks for streaks in the sky that line up. These streaks are not close enough to be identified as a single object.
SELECT r.objID as rId, g.objId as gId, dbo.fGetUrlExpEq(g.ra, g.dec) as url FROM PhotoObj r, PhotoObj g WHERE r.run = g.run and r.camcol=g.camcol _ and abs(g.field-r.field)<2 -- nearby -- the red selection criteria and ((power(r.q_r,2) + power(r.u_r,2)) > 0.111111 ) and r.fiberMag_r between 6 and 22 and r.fiberMag_r < r.fiberMag_g and r.fiberMag_r < r.fiberMag_i and r.parentID=0 and r.fiberMag_r < r.fiberMag_u and r.fiberMag_r < r.fiberMag_z -- the green selection criteria and ((power(g.q_g,2) + power(g.u_g,2)) > 0.111111 ) and g.fiberMag_g between 6 and 22 and g.fiberMag_g < g.fiberMag_r and g.fiberMag_g < g.fiberMag_i and g.fiberMag_g < g.fiberMag_u and g.fiberMag_g < g.fiberMag_z and g.parentID=0 -- the matchup of the pair and sqrt(power(r.cx -g.cx,2)+ power(r.cy-g.cy,2)+power(r.cz-g.cz,2))*(10800/PI())< 4.0 and abs(r.fiberMag_r-g.fiberMag_g)< 2.0_ And you can also add a third query select top 10 ra, dec, (rowv*rowv + colv*colv ) as velocityVector, * from PhotoObj where -- object SATURATED | BRIGHT | BLENDED and object DEBLENDED_AS_MOVING (flags & ( cast(0x0000000000040000 as bigint) | cast(0x0000000000000002 as bigint) | cast(0x0000000000000008 as bigint) ) ) = 0 AND (flags & cast(0x0000000100000000 as bigint)) > 0 -- PSF magnitude / psfCount r r range between 14.5 and 21.5 AND type = 6 AND (psfMag_r > 14.5) and (psfMag_r < 21.5) -- veolocity vector larger than 0.05 deg/day and smaller than 0.5 deg/day. AND (rowv*rowv + colv*colv > 0.0025) AND (rowv*rowv + colv*colv < 0.25) And dec > -1.25 AND dec < 1.25 -- Limit to specific part of the Stripe-82 region AND (ra > 300 or ra < 60) order by (rowv*rowv + colv*colv ) desc
Here is a picture of one of such ‘faster moving’ objects:
When asked about T-SQL, one of the astronomers said that it was ‘almost like English’ to them, and they could easily understand what was going on. Another astronomer put it: “SQL can serve as a ‘helpdesk’ – if somebody has a problem, another person can answer the question when query is sent to them.” A graphical query plan that’s viewable before submitting an MS-SQL query provided details on which query steps would take the largest fraction of execution time and—in most cases— gave users all the information necessary to improve query performance.
The configuration for multiple release support in SDSS is shown in Figure 12 below. DR12 (the latest release) DB servers have the following hardware configuration today:
- Total data size: 12 TB
- Number of filegroups: Two (Primary has 8 files, Secondary has one file, see Figure 13)
- Servers: Four identical nodes with one copy of DB on each
- System manufacturer: Supermicro
- System type: x64-based PC
- Processor(s): Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz (2 processors)
- Logical cores: 32
- Physical CPUs: 2
- Total physical memory: 128 GB
- HDD size: 3.0 TB
- Total HDDs: 24
Figure 13: DR12 Hardware configuration at JHU for SDSS Servers.
In the production cluster, there are three to four DB servers per release, so that the public and collaboration users can be adequately supported, and the queries can be load-balanced on different boxes. Quick and long queries are pointed to separate servers.
Figure 14: DR12 database properties
“When we started working on SDSS, we had fun, and we were hoping people will use it. Working with SQL Server was fun, but astronomy is also fun” – said Alex Szalay. Science is increasingly driven by data (big and small), combined with changing sociology – surveys today are analyzed by individuals from all over the world, not just by a few experts. The move from hypothesis-driven to data-driven science is a reality, and the SQL Server-powered SDSS is the first true “telescope” for data that has made the vision of a ‘DataScope’ a reality.
Joseph & Rimma
Follow Joseph on twitter @josephsirosh, and Rimma at @rimmanehme.