data visualization for two-way tables /1

introduction

Bar chart is in many cases the simplest and most effective way to visualize a discrete data set. But working on more than one data set, does it exist a graph which is better than the simple cluster histogram, at least in some cases?

I will try to answer to this question splitting my discussion in more than one post and introducing a different graph type in each of them.

By way of example it is convenient for me to look into the following table which has been already commented and discussed in some other professional sites. It shows the energy quantities (in quadrillions of BTU) consumed in US during 1995 for different sources and for different uses (approximations are mine from this scheme):

uses oil gas coal nuclear renewable

totals
transportation 27.0 0.6 0.0 0.0 0.4 28.0

industrial 9.6 7.9 2.0 0.0 1.4 20.9
residential and commercial

2.4 8.1 0.1 0.0 0.6 11.2
public services 1.2 6.0 20.8 8.1 3.8 39.9

totals 40.2 22.6 22.9 8.1 6.2 100.0

For such a table, one can consider the following views (interactive version):

  1. the inner series of columns percentages
  2. the total serie of column percentages (the column of normalized row totals)
  3. the inner series of row percentages
  4. the total serie of row percentages (the row of normalized column totals)
  5. the set of all inner values

but it is difficult to imagine a graph which is able to display all these views in the same time in a simple way. Let us see.

mosaic graph

In an article on The Skeptical Optimist, previous data are shown in a Marimekko or mosaic or matrix graph:

energy2005.gif

The graph emphasize view 1. (each column data becomes a stacked bar in a vertical section of the box) and view 4. (the row of column totals becomes the horizontal stacked bar formed by the vertical sections of the box), while other views can be derived only indirectly through a comparison of areas of different rectangles.

Mosaic graph draws three main criticisms.

First. The reader would be captured by the areas of the rectangles, but he would find difficult to understand the meaning of their dimensions (the width being proportional to the column weight, the height being proportional to the cell weight in its column). Finally the graph would contain an information overload [1].

Second. It is difficult to read the data along the rows, because that implies a comparison of areas of rectangles having different widths and heigths. Similarly, it is difficult to get a sense of the weight of each row. To correct such drawback, it has been proposed a special version of the mosaic graph which adds a separate vertical section stacking the row weights.

GraphicalEquity1-300x251.png

However, in this graph the width of the separate vertical section has no meaning, unlike the other ones.

Third. As in all stacked histograms, it is difficult to evaluate the height of uneven rectangles.

On the ground of the above arguments, there are some severe advices against mosaic graph [2] and some alternative representations have been suggested [3][4][5]. I will shortly present them next time.
While I agree to nearly all previous arguments, I have to note that most of the times there are different approaches to data analysis, and most of the times each of them is equally just, in the sense that they respond to different questions. So, it appears natural that, since mosaic graph enphasizes a vertical view of the data in the table, other data views appear less straightforward (notice that there is a mosaic graph version which draws the rows as horizontal box sections to render data reading in the horizontal way). In the original graph I appreciate the formatting expedient to make the vertical borders stronger than the horizontal ones, so pointing out that each vertical section should be viewed as a separate entity, or at least advicing that comparison of rectangles within the same vertical section, which involves only evalutaion of heights, should not be the same as that between different vertical sections, which involves evaluation of areas. Furthermore, I think it is difficult to make a definitive choice between the information overload of an only graph, and the information dispersion of a set of four different graphs which has been proposed as a replacement for mosaic graph [6]. They both have goodnesses and drawbacks.

weighted cluster bar chart

Focusing particularly on the reply to the third criticism, I propose an alternative to mosaic graph, which I call weighted cluster bar chart. It is in essence a grouped bar chart where groups are weigthed by width. Here it is applied to the current data:

energy_cols.png

Weigthed cluster bar chart is built so that:

  • the bar serie in the background represents the row totals column (view 2.);
  • the weighed bar series in the foreground represent single columns (view 1.).

Bars are weighted in a sense that their widths are proportional to the row totals. Therefore the full cluster basis represents the column totals row (view 4.), horizontally stacked.
This chart contains as many information as the matrix plot, but it is simpler to interpret.
All the heights describe percentages; all the bars get up from the same floor. So doing, one avoids shortcomings which stacked histogram suffers, and makes measurements and comparisons easier.

Each bar area is also meaningful, being proportional to the corresponding cell value. This is not made to help in comparing bar of different colors, which would fall in the same problems which have already been described. Indeed, this should highlight a special property of the graph: the sum of the foreground bar surface areas equals the background bar surface area. So, it should be easy to understand how much each item weights upon the total. For example, in the leftmost cluster, oil determines almost entirely consumption for transportation, while in the rightmost cluster nuclear represents a small part of public consumptions even if it is entirely used for them.

Weighted cluster bar can chart be easily built in Excel using the same method described here for the mosaic graph.
For the sake of completeness, let me show the graph version which reports the different uses percentages for each source:

energy_rows.png

Obviously, since it requires as many colours as the the data series, like in mosaic graph, weighted cluster bar chart is recommended for tables with a limited number of rows and columns, like the one considered so far. In the case of larger tables it is convenient to look for other solutions. Next time. :)

2 thoughts on “data visualization for two-way tables /1

  1. Pingback: a heat bubble chart for two-way tables | sei-uno-zero-nove

  2. Pingback: visualizing Simpson’s paradox | sei-uno-zero-nove

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.