Yahoo Pipes tutorial

Yahoo Pipes is a framework that allows to build mashups within minutes once you learn how to use it. However, as any other application Yahoo Pipes has its tricks. In this tutorial we will build a pipe and discuss the the tricky parts.

There is a lot of activity in the area of mashups recently. These are similar frameworks from a number of companies:

The goal of such frameworks is programming without knowing programming. A typical user works with a number of information sources, including first and foremost webpages, then news feeds, pictures, maps. Web pages are examples of unstructured information. Imagine a bus schedule which is represented as an HTML table. This information is semi-structured as it is clear that each bus stop is listed on a new row whereas arrival and departure times are in different columns of the table. Yahoo Pipes is well suited to transform this type of information into RSS feeds. This format is convenient for further processing. For example, it is possible to automatically geocode the information, that is to find the location names within the data and calculate its coordinates. After geocoding you can display the RSS feed on a Yahoo map.

Here are a few examples implementing this idea: Bay area train schedule and Indian bus route.

As you can see, there are three parts in these examples:

  1. Converting semi-structured HTML pages into RSS feeds.
  2. Enhancing the structured information, for example geocoding it.
  3. Visualizing the results, for example drawing on a map.

I think that Yahoo is particularly good at the first two parts. The visual interface for working with regular expressions makes it very easy to parse HTML and extract only relevant parts of it. To make sense of text after it has been extracted Yahoo provides a number of services and APIs. Here is a very useful Yahoo Shortcuts service. It parses the text and automatically tries to make sense of each word. For example, if I am describing my adventures in Helsinki, then Yahoo Shortcuts will understand that Helsinki is a city in Finland and enhance RSS with geotagging information. Then it is immediately possible to display RSS feed on a map. Thus, we got GeoBlogging at no cost.

Another possibility for geocoding is to use Geocoding API. The shortcut service is designed as a plugin for bloggers, whereas the API is for automatic use.

As of the visualization part, Yahoo can display results on a map. However, there are a number of interesting visualization options including Google Charts and Timeline. These visualizers are not built into Yahoo Pipes framework. However, the output of the pipe is a standardized feed in either RSS or JSON format. You can supply it to any visualizer you want. Also, it is possible to include the output feed in Google Reader as any other blog.


running pipe

All right, lets discuss how to build a pipe. This is a real-world example. I am a road runner and I have a schedule of marathons in different places across Finland. I do not know where these cities are, possibly next to me or far away. Manual lookup takes a lot of time, so I would like to display them on a map. Here is the resulting pipe. To open it in the editor click on Edit source.


fetch block

The fetch page block downloads the web page. We want only the table with the schedule so I am cutting it using text in the beginning and at the end of table. Now the rest of HTML is deleted. HTML tag
is used as the delimiter. The result of this block is a set of items corresponding to individual rows of the table.

A set is the only data structure in the Yahoo Pipes programming language. Of course, there are input and output formats such as RSS, CVS, etc. but they are external to the language. The notion of set is intuitive. Basically, it is a collection of anything. Each item has named attributes. If there is an attribute called title then it is used as the name of the item as a whole. If there is no title then a unique number is assigned to the item.


rename block

The rename block preserves the information in each row as we are going to extract city names from it for geocoding. The original information is included in the output RSS feed.


regex block

The regex block extracts city names. Basically, it removes all HTML tags and the name of the marathon which is the part of the sentences before the comma. The word after the comma is the name of the city. At this point we have completed the first part, that is we have structured the information. We have a city name for each row of the table. Of course, you have to learn regular expressions to use this block. The convention is that of Perl. The following flags are available:

  1. global – replace all occurances, not just the first one
  2. space – allow dot character match with any whitespace character
  3. multi-line search
  4. case insensitive search


loop block

Now lets make meaning out of each city name. The loop block iterates over the set of rows and applies the inner block to each item. We are using Yahoo! Shortcuts block to try to find a place with that name.


filter block

It is possible that the extracted city name does not make any sense. The filter block removes those entries which location was not found.


loop block

What we have at this point is a set which we need to transform into an RSS feed. The loop block has an Item Builder block inside which converts a set item into an RSS item. To display it in an RSS viewer either in Yahoo Pipes or in Google Reader, each item of the RSS feeds should include the following attributes:

  1. title - the name of the feed item.
  2. link - the URL of the item. Think of a blog – each story has a title. If the user click on it then the story content is loaded using this URL.
  3. description - a brief summary of the item.


location extractor block

In this example we do not have link attribute because we will display the results on a map only. We are almost done. The RSS feed includes the names of the cities, their geo information, and the original description that we copied from the schedule in the beginning. The Location Extractor block prepares the data for displaying on a map.

pipe output block

We connect it to the Pipe Output block and off we go.