WIRED magazine advertises science that deals with massive amounts of data

Saturday, August 28th, 2010




New trends in computing appear every now and then. Dealing with massive amounts of data is not new but a few interesting applications that can greatly benefit from our increased processing power are on the horizon. Wired magazine describes these applications in the following two articles that appeared recently:


Sergey Brin’s Search for a Parkinson’s Cure


What You Want: Flickr Creator Spins Addictive New Web Service

The first article deals with the problem of how the new drugs are developed. It turns out that the biggest problem is not whether it is useful or not but whether it poses any danger (or in other words has any side effects) or not. For example, aspirin was discovered in 1899 but it was not until a 100 years later that it was noticed that patients who take aspirin regularly have descreased risk of heart attack. In this case, the side effect was positive. However, in many other cases it is negative. The difficulty of testing a new drug is that it takes a lot of time to establish a strong correlation between taking a drug and a certain change in patient’s body. The problem is that increased body temperature is a possible result of many things – food, outside environment, which people the patient talks to, etc. A more comprehensive monitoring system is needed to take those things into account. However, working with such a multi-dimensional data set is only possible with use of automated tools and requires lots of processing power. This is what Google is good at.

The second article describes Hunch, a system that tries to build a psychological model of you. It asks you a number of random questions, for example whether you believe in alien’s kidnapping or not and then matches your answers to those of other people. Then it can give you recommendations based on what people with similar answers like. The system makes a step further, however. It can try to guess your answers to arbitrary questions based on what people who are similar to you answered. I have tried to allow Hunch to learn a fair amount of my preferences – I answered over 100 questions. After that it started to recommend what seemed like interesting guesses. However, when it tried to predict my answers its rate of correct answers was around 50%. Therefore, the system did not learn much yet. On the other hand, a human being is a lot more complex creature whose model obviously does not fit into 100 questions.

Another system mentioned in the article which I liked a lot more is Aardvark. It uses a unique combination of computing technology: messaging, tagging, social networking, etc. Its idea is simple and I guess many people have thought of it. What if you have a question but you don’t know who to ask. Then most likely you head on to a web forum or a mailing list and ask. The problem is that you have to find an appropriate forum and wait a couple of days to get an answer.

The problem is that Internet does not allow you to find right people instantly. Try typing Who knows Chinese out there in a search engine and see what happens. Aardvark is a kind of social search engine. Once you type your question the system determines its topic and tags it appopriately. Then it searches over its database of users who have indicated that they are experts in this particular area. In addition, Aardvark will check who is online at the moment to avoid sending your question to someone who is possibly on vacation. The expert will get an IM notification from Aardvark asking whether (s)he wants to help. If yes, the expert can type the answer immediately and even chat with the person who asked the question. The whole process is real-time which actually encourages people to ask questions.

I have tried Aardvark both to ask questions and to answer linux questions. In either case the experience was positive. I got answers that were quite valuable. For example, when I asked how to learn Chinese I was given a link to a web site with online language course which even had an iPhone app to facilitate learning. I could not find it otherwise in app store.

Aardvark contacted me through Google Talk a few times. It is actually fun to talk to a robot because this way you can help out real people. I guess this is one cool application of artificial intelligence – robots are helping people to socialize!

Reading list Spring 2010

Friday, April 16th, 2010

I have read the first three issues of Communications of ACM of year 2010: January, February, and March. Overall, I have noticed that CACM is aiming at a broader scope, not only CS-topics but also biology and physics. Therefore, nowadays it is more like Science magazine or Nature. But of course in every article there is a computational aspect that connects computer science with another area of knowledge. I found out that cross-disciplinary articles are more engaging than purely technical articles. The nature has lots of secrets that computer science helps reveal.

Jan 2010
Rebuilding for Eternity. Bundler – open source version of Photosynth.
Automated translation of Indian Languages
New Search Challenges and Opportunities
Data in Flight. Implementation of StreamSQL. Stanford streams, MIT Aurora, SQL Stream.
Other people’s data – XIgnite

Last but not least – two articles that discuss Google’s parallel engine – Map-Reduce. I have noticed that CACM contains lots of articles dedicated to Google’s technology, for example there is an article discussing the evolution of Google file system in one of the following issues. At the same time there are no articles from other software giants, for example Microsoft, Apple, or IBM. This is not because those companies do not innovate. Everybody knows that programmers went nuts writing iPhone apps. The reason of Google domination is I believe that amount of sponsor money that it gives to ACM. It is fine, Google has created lots of innovative frameworks but other companies deserve attention as well.

Map Reduce and Parallel DBMSs: Friends or Foes?
MapReduce: A Flexible Data Processing Tool

Feb 2010

The best issue I have ever read! To start with, its cover story is dedicated to new model of computation, quantum algorithms. This topic is not new. When I was an undergraduate student in Russia in late 1990s there was lots of buzz of how quantum algorithms can change the cryptography. With its strong mathematical tradition, Russians were trying to explain quantum algorithms from the number theory point of view. To me it was totally incomprehensible. Or I should say that my mind was more inclined toward an algorithmic perspective of quantum computers. In this article CACM does a great job on explaining the notion of quantum algorithm at the level that was most appropriate to me as a software engineer. It briefly mentions computational complexity challenges and explains how quantum algorithms might help tackle those.

Recent progress in Quantum algorithms

Type Theory comes to age. Aura, Jif for security. Philip Walder
An interview with Michael Rabin

A few billion lines of code later.

Another great article in the same issue! When I was a student (again) but this time in a graduate school in the United States I was lucky to witness the emergence of a new technology – practical bug detection using static analysis. But I will start with a brief introduction on how industrial research is transformed into a widely adopted mature technology.

In my life so far I saw two such events. More experienced people might name a few other cases but here is what I can say. In late 1990s computer graphics has advanced rapidly because of increased processing power. Researchers began experiments with massive amounts of data or images. This is how light field mapping technology was developed simultaneously at several universities as well as at Microsoft and Intel. Its idea is to build a 3D model of an object from a number of images taken with an inexpensive camera. I was lucky to participate in the development of this technology as an undergraduate intern at Intel-Nizhny Novgorod in 2001-2002. However, it was only a research project which was soon abandoned. However, in year 2010 there is a commercialized version of this technology Photosynth that Microsoft has created.

When I joined graduate school in Stony Brook in 2002 application security was a hot research area. Everybody was thinking how to protect the programs against viruses. This is why we have created DIRA – a dynamic protection tool that instrumented programs with additional instructions that made it resilient against buffer overflow attacks. But again, the project was soon abandoned. However, Dawson Engler was able to transform the technology landscape with his static bug finder. In this article he describes his experiences with making commercial tool from a research project.

Software Model Checking takes off
Assessing the Changing US IT R&D Ecosystem

March 2010
Chasing the AIDS virus

Cover story is another must-read article! It explains the mechanics of AIDS virus. I never thought that it can transform itself to avoid the medicine it is exposed to.

Making decisions based on the Preferences of Multiple Agents

This article describes various algorithms of voting with applications to social networks. Very comprehensive discussion.

Engineering the web’s third decade
Orchestrating coordination in pluralistic networks
GFS: Evolution on fast-forward
Global IT management: structuring for scale, responsiveness, and innovation

Reading a few Communications of ACM articles

Friday, January 8th, 2010

During the holidays I have read a number of ACM articles from the issues I received earlier as well as from December 2009 issues that I received a few days ago. The most interesting articles are:

Ready for Web OS? Mentions Sam King, tablet crunch pad.
A Smart Cyberinfrastructure for Research. Microformats, data portability, codeplex, mit breadcrumbs, zune social, livelabs entity extraction.
An Interview with Ping Fu
You Don’t Know a Jack about Software Maintenance
Scratch: Programming for All
Sound Index: Charts For the People, By the People
What Intellectual Property Law Should Learn from Software
The Status of the P versus NP Problem
Just for You. Greg Linden ran personalized news site Findory
The Pathologies of Big Data
CTO Roundtable: Cloud Computing. Animoto on Facebook
Hard-Disk Drives: The Good, the Bad, and the Ugly
Database and Information Retrieval Methods for Knowledge Discovery. MSR Libra, Cimple DBLife, KnowItAll/TextRunner, YAGO WordNet NAGA

Research trend of the year: Parallel Computing

Wednesday, December 30th, 2009

So what were those cool ideas this year? In the last few issues of CACM the topic of parallel computing has received lots of attention. Basically, researchers are saying that lots of time and money have been spent on parallel research but most programmers are still writing single-threaded programs or even if they are multi-threaded they do not scale with the number of processors.

Here are the articles on this topic which I found only in three issues of CACM from September through November 2009:

When I noticed the increased attention to parallel computing I started thinking whether I encountered parallel programming before. When I was an intern at Intel I attended an introductory course to parallel computing during which we were implementing standard algorithms such as sorting on a parallel computer using OpenMP. That was in 2001 or so. Since then I saw OpenMP in the literature every now and then until it suddenly disappeared in 2005. All subsequent articles on parallel computing that I read did not even mention OpenMP as a predecessor of whatever new framework they were dealing with. Thus I felt alleviated when I read an article of an independent writer Face the Inevitable. The experiences of that author are very similar to mine. The author explains the lack of attention to OpenMP with its very specific applications.

A couple of years ago another parallel programming framework was extremely popular but its fate was the same – it felt to oblivion. I mean Google’s MapReduce technology or its open-source version Hadoop. The explanation of its current unpopularity is probably the same – the applications are quite limited.

The authors of the Berkely article at least learned the lessons of the previous frameworks. Their article proposes an application-driven approach. The authors consider a number of potential killer applications of parallel computing. They are using a multi-layered approach. The application writer will need to adopt a number of parallel design patterns. Then the developers of the middle ware will create libraries that implement such design patterns. The target hardware on which these libraries are executed are not specified yet. Possibly, it is a multi-processor computer with homogeneous or heterogeneous processors. The authors propose an FPGA architecture to facilitate flexible experimentation.

Besides the lack of parallel killer app, the ideal parallel hardware is also a moving target. So far, success has been achieved only in special domains. For example, Anton is a biological computer which features long pipelines executing specialized instructions that compute forces of interaction among molecules. This is an exceptional architecture because long pipelines are considered harmful for parallel processors in general. Thus, an ideal parallel computer is something that reseachers have not created yet.

To summarize, after a decade of research on parallel computing it is not clear which paradigm the programmers will accept, which middleware they will use, and on which hardware the programms will get executed. We are entering a new decade with lessons learned from previous failures and lots of ideas on how to design an ideal stack of parallel computing. Thus I think that after 5-10 years we will use parallel programming on the daily basis.

Distinguished Lecture Series, continued

Friday, October 9th, 2009

One of the best things that you can enjoy if you are a graduate student in the United Sates is opportunity to meet renowned researchers. When I was at Stony Brook University it was called Distinguished Lecture Series. Through this series I have met a lot of interesting people, for example Charles Leiserson, Avi Silberschatz, Randy Katz, etc. There were also lectures of Michael Brin, the father of one of Google founders as well as Hector Garcia-Molina, their academic advisor but unfortunately I have not attended their talks.

When I was an undergraduate student in early 2000s I worked in a research lab at Intel Nizhny Novgorod site. We have worked on a light field mapping project whose goal was to re-build a 3D model of an object from a hundred of images and render it quickly from an arbitrary angle. At that time I got to know the research projects of Graphics lab in Stanford University, in particular that of Prof. Marc Levoy. At that time I was considering going to graduate school but of course a PhD program at Stanford was extremely competitive to get in. So I ended up in SUNY Stony Brook.

This is where I met one of the former students of that lab, Olaf Hall-Holt. He was working on visualizing urban areas. At that time I was a lot more interested in other areas of computer science even though I spent some time at Intel working in computer graphics project. But it was nice to meet Olaf as a connection with the Stanford graphics lab.

In a couple of years while I was working on computer security issues at SUNY another representative of Stanford graphics lab visited SUNY. It was Pat Hanrahan. He was giving his distinguished lecture.

I have left Stony Brook in 2006 and eventually went to work for a big company in Finland. And a few days ago I have got a chance to meet another professor from Stanford Graphics Lab – Marc Levoy. He was describing his recent work but when I asked him whether he still works on light fields he answered that yes he is applying that work to microbiology.

To summarize, during the span of almost 10 years I have been working in different areas of computer science such as security, mobile technology, and others. But wherever I am I keep on meeting people from Stanford graphics lab. Is not that surprising? I am wondering after all if I should have sticked with the first project on computer graphics that I worked on at Intel instead of trying out so many exciting things. Another surprising finding is that the work that I am doing now is related to that of Stanford graphics lab, even though it might look as two different areas at a first glance. Is there some hidden connection that keeps me close to graphics?