Interviews by Stephen Ibaraki, I.S.P.
Mike Stonebraker: World-renowned Database Legend
This week, Stephen Ibaraki, ISP, has an interview with Mike Stonebraker.
Dr. Stonebraker, widely acknowledged as the
world’s foremost database expert, brings a long history of outstanding database
research, entrepreneurial achievement and experience to his ventures. This
year, Dr. Stonebraker received the 2005 IEEE John von Neumann Medal,
which is the IEEE’s most prestigious technical honor.
Until 2000, Dr. Stonebraker was an electrical engineering and computer science professor at the University of California at Berkeley. He is currently a professor at MIT. Dr. Stonebraker has held visiting professorships at the Pontifico Universitade Catholique (PUC), Rio de Janeiro, Brazil; the University of California, Santa Cruz; and the University of Grenoble, France. Additional professional activities include: leading alternative data management strategies for NASA’s Earth Observing System; Chairman, Technology Council, Science Tools Corporation; General Chairman and other significant positions for SIGMOD from 1982; member of the Technical Advisory Committee for Citicorp, DB Software, and Bull and member of SIMC’s (Security Industry Middleware Council, Inc.) Board of Directors.
Q: Mike, with your remarkable research and entrepreneurial history, we are indeed fortunate to have you for this interview. Congratulations on your John von Neumann Medal, a particularly signature achievement! Can you comment?
A: I am thrilled to be the 14th recipient of this prestigious prize and to thereby be included in a group of winners that includes Gordon Bell, Fred Brooks and Carver Meade. It is indeed a special honor.
Q: What are your short, medium, and long-term strategies, goals, hopes for StreamBase?
A: The short term goal for StreamBase is to make the first collection of customers very successful. Our medium goal is to change the way computer people think about streaming data and get them to realize that StreamSQL (SQL with stream extensions) is the right paradigm for real-time low-latency stream processing. The long term goal is to participate in the “sea change” that will be caused by cheap micro-sensor technology. This will cause everything on the planet of material significance to be sensor tagged to report its state and/or location in real time. The downstream firehose of information will be processed by engines such as ours.
Q: Your company is currently focused on financial services since they have an immediate need to analyze/correlate information from multiple feeds in milliseconds and at much lower costs. One test, where you were processing 140,000 messages per second on a $1,500 PC versus 900 for a major RDBS system illustrates the clear advantages. How do you see your technology specifically being applied to this and other areas in the future, and what competitive advantages will it bring?
A: Our stream processing engine is especially beneficial in low-latency high volume financial services applications such as feed processing, electronic trading, real time risk analysis, compliance and real-time bond pricing. Off into the future, similar opportunities exist in network management, homeland security, military applications, real-time weather alerting, and industrial process control; any place where there is a firehose of real time information that must be processed quickly. One way to think of electronic trading is to imagine a field being plowed by a collection of bulldozers. One of them turns up a nickel and if you are the quickest one to run out and get it, then you get to keep it. Our engine provides competitive advantage in such situations.
Q: You have a proven method of demonstrating the power of your StreamBase technology by solving a customer’s most difficult problems within a week. Detail a typical scenario and explain why standard solutions do not work effectively.
A: StreamSQL provides the right high level operations to build a certain class of applications very quickly. For example, one large firm subscribes to several feeds of financial trade data. They wanted to forward to their trading engines the best available data; i.e. they wanted to consolidate the feeds by passing on the first arriving data from whichever feed is most timely and then discard the late duplicates. We wrote this pilot in half a day using a total of 18 of our operations. It could process more than 100,000 messages per second. This compares very favorably with a general purpose language such as Java or C++, where development times would be measured in weeks or months.
Q: What prompted your decision to support Sun Solaris, Linux, and Windows running on Intel servers?
A: These are the platforms that our customers ask us to support.
Q: With StreamBase, you read TCP/IP data streams producing asynchronous messages and you have APIs for consuming the messages in customer applications. Without the need to store data, your StreamSQL creation allows for the processing of data streams producing SQL joins and aggregates and correlation of multiple streams. You have adaptors for working with popular financial services’ feed formats and a workflow-based GUI for rapid application development. Can you further comment on how this works? What types of time-series operations can be performed?
A: Your question contains a high level description of the StreamBase engine. Using our GUI, you “drag and drop” our operators from a palette onto a workspace. When you are satisfied with an application you can test it with our synthetic message generator. When an application works correctly, you can deploy it across multiple computer systems (for ultimate scalability). On each system, StreamBase has a real time scheduler that “pushes” messages through our operators as quickly as possible. By avoiding process switches and the necessity of storing the data, we can produce exceptional performance.
A: Since Streambase is a novel paradigm that requires customers to rethink the way they do things, we usually ask them to point us at their hardest problem. We go away and write a StreamBase application to solve this problem and then return in a few days to show them how we worked on their problem. The customer usually can see how to move our application into one they can deploy in production. Hence, the most effective benchmark is the customer’s actual business problem.
Q: Ingres and Postgres are open source, and you support the model such as with Linux. Describe its future in 5 and 10 years.
A: I think open source DBMSs will capture a substantial piece of the DBMS market because of their attractive price. I expect the open source movement to grow healthier over the next decade.
Q: With your deep knowledge of enterprises, technology and business value, choose three areas that need addressing and share your views in each area.
Area 1: Constructing a solution that will allow one to retrieve information from a mix of textual files (e.g. HTML) and structured data in databases. One can think of this as providing a google-like extension to what has been called “the hidden web”, because it is hidden behind databases. I wish I had a good idea on how to do this. Obviously, text retrieval is ineffective in the hidden web and SQL (even extended with text) is not an end-user language.
Area 2: Data warehouses. Customers are putting data into data warehouses at an accelerating rate and then asking ad-hoc queries that paw over very large subsets. Many warehouse users are in considerable pain and I have some ideas on how to provide pain-relief in this area.
Area 3: Semantic heterogeneity. There is great hype surrounding web services to glue together information and services from different enterprises. However, imagine that you are the French administrator and your salaries are net pay after taxes, including a lunch allowance, and in Euros. In contrast, I am the U.S. administrator and my salaries are gross amounts in dollars. Although these two elements can both be called “salary”, obviously there is considerable meta-data required to interpret a value. A web service to merely read the value from a local database will be unhelpful, because the reader has no idea what the units are or how to interpret the data. Dealing with schemas that were designed independently but need to inter-operate is a really hard challenge.
Q: Mike, it is such a privilege to have you come in sharing your deep insights. Considering your busy schedule, we appreciate and thank you for the time you have spent with us.
A: Thank you for your time. It has indeed been a pleasure.