Posts Tagged ‘data manipulation’

I recently had to work with data visualization in JavaScript. Of course the obvious choice was to go with d3 for the visualization, but I needed something for the data manipulation. Preferably something declarative and having substantial experience with Linq using something similar came to mind. This post is meant as a getting started guide for using query-js. query-js is the npm module that resulted from my efforts.  query-js is a series of methods that lets you perform sequence operations on arrays and sequences.
Instead of giving a theoretical description of the module I’m going to work through a few simple examples on how to use some of the more important methods.

There’s a lot of public data available using REST APIs and often the data is in an easy to use JOSN format. One of these sources is EU statistics on Gross District Product. GDP is the district version of the Gross National Product or in other words an indication of prosperity in the district.

To get the data we would need to do a http request to the end point of the service and we are also going to use query-js (surprise!) so start by installing both request and query-js and the requiring both as well

 

npm install request
npm install query-js

var request = require("request"),
 Query = require("../query.js"),
 url = "https://inforegio.azure-westeurope-prod.socrata.com/resource/j8wb-jxec.json?$offset=0";

The URL that I sneaked in is the URL for the service endpoint. It returns an array of objects. Each object is in the format o

{
  "ipps28_2011" : "221.7",
  "nuts_id" : "BE10",
  "nuts_name" : "Région de Bruxelles-Capitale / Brussels Hoofdstedelijk Gewest"
}

The first property (ipps28_2011) is the actual GDP figure. The second one (nuts_is) is an identification of the district, where the two first letters are the country identification. With those information let’s see what it would require to get all the districts, and find all countries that have at least one poor region.

//get the data
request.get(url,function(error, response, body){
 var data = JSON.parse(body),
     query = new Query(data),
     //group by the first two letters in the district code aka the country
     byCountry = query.groupBy(function(d){ return d.nuts_id.substr(0,2); });
});

In the code we firstly request the data and parses it and then we can start on the Querying part. The first query we perform is to group by country (or actually by the first two letters of the district identification). The result of that is an object that can either be treated as a regular object or as a sequence of key value pairs. In other words all the sequence operations of query-js are available. So we could find the values for the nordic countries like this


    var nordics = byCountry.where(function(country){
        var countryId = country.key;
        return countryId === "SE" || 
               countryId === "FI" ||
               countryId === "DK";
    });

That would filter out all values for Sweden (SE), Finland (FI) and Denmark (DK). Norway is part of the Nordics but are not part of EU, so no data for them.

We could also look for all regions in the lowest category (ipps28_2011 < 50)


    var lowIncomeDistricts = query.where(function(gdp){return gdp.ipps28_2011 < 50;});

or what if we wanted to get all countries with at least one poor region?


     var countryWithLowIncomeDistricts = byCountry.where(function(country){
          return country.any(function(gdp){ return gdp.ipps28_2011 < 50); });
     });

That uses another of the sequence operations that query-js provides. the. any(predicate) method. It will return true if at least one element in the sequence satisfies the predicate. So in this case it will return true if at least one district in a given country has an ipps of less than 50.

now with a bit of query-js dirt under our finger nails, let’s tak a slightly more complex task. How about we find the average ipps28 for all nordic countries in the EU? We’ve already seen how to get the values for all Nordic countries so that should be easy. However this time we want to have all entries in the same collection instead of grouped by country. Then we’d want to extract the ipps28 from each of item and lastly we’d want to compute the average of them all. In list form:

  1. Filter on country
  2. extract ipps28
  3. compute average

We are going to use where for the filtering. Extracting data from an object is a projection, so for step two we are using select and there’s a method for computing the average.


   var avg = query.Where(function(gdp){
      var countryId = gdp.nuts_id.substring(0,2);
      return countryId === "SE" || 
             countryId === "FI" ||
             countryId === "DK";
   }).select(function(gdp){
      return parseFloat(gdp.ipps28_2011);
   }).average();

You should be able to see the three steps from above well represented in the code however we can shorten it a bit if we’d like. THe execution will actually be exactly the same in both scenarios (aka the performance will be the same)


   var avg = query.Where(function(gdp){
      var countryId = gdp.nuts_id.substring(0,2);
      return countryId === "SE" || 
             countryId === "FI" ||
             countryId === "DK";
   }).average(function(gdp){
      return parseFloat(gdp.ipps28_2011);
   });

As you can see, the only difference is that the projection is now passed to the average method instead of having a specific projection step.

There are many ways to skin a cat. We’ve looked at two different approach for finding the average of the GDP in the Nordic countries of EU. There’s a third one that will let me introduce another important method, namely concat.


    nordics.select(function(country){ return country.value: })
           .concat()
           .select(function(gdp){return gdp.ipps28_2011;})
           .average();

concat comes in three flavours. We will look at two of them. One that takes no arguments and concatenates the elements of the sequence (the elements themselves have to be sequences) and the other one is used below and is a short hand for first projecting and then concatenating. THe performance of them is the same since the second version is implemented based on a select and a concatenation.


    nordics.concat(function(country){ return country.value: })
           .select(function(gdp){return gdp.ipps28_2011;})
           .average();

There are still more ways to skin this cat. Often you would want to concatenate then do a project of the elements of a sequence of sequences and concatenate the result. The method selectMany does just that. Instead of iterating over the elements of the sequence it iterates over the elements of the elements of the sequence and produces a new sequence. So the above could also be written as


    nordics.select(function(country){ return country.value: })
           .selectMany(function(gdp){return gdp.ipps28_2011;})
           .average();

First we have to project the sequence we already have, because what nordics holds a sequence of key-value pairs where the value is another sequence. By selecting the value of each we end with a sequence of sequences. On which we then call selectManhy on and we then get the average of the projected values.

We can actually shorten this slightly. selectMany can accept two projections. The first one then being for the value of the outer sequence and the second one being for the elements in the inner sequences. THat is the above could also be written as following


    nordics.selectMany(function(country){ return country.value: }, function(gdp){return gdp.ipps28_2011;})
           .average();

and since select has a shorter form we can rewritten slightly. If the argument provide to select is not a function but a string, select will treat it as a simple projection returning the value of the property with the name given by the string. That is

sequence.select(function(e){ return e.value;});

is semantically equivalent to

sequence.select("value");

and since selectMany internally uses select we can write our example as follows


    var averageForDistricsInTheNordics = nordics.selectMany("value", "ipps28_2011")
                                                .average();

(*) This is likely going to change, so that you will have to explicitly add them.

Advertisement