Notice: Even if the main subject (batch data processing) of this article is still up-to-date, it describes two outmoded services: Google Trends for Websites, which is no longer available, and Google Image Charts API, which is now deprecated.
Several times I have happened to read an interesting work or news and to have an idea on how to delve deeper into the matter, then to renounce because such a task would have required to collect and analyse a lot of online data and I haven’t had the time and the right tool to do it.
It was so also the first time I have seen the World map of social networks. The chart is nice but the information is a bit poor. Which is the penetration of Facebook or the other most popular social network in each country? Which is the variation occurred in each country between two successive dates? Which is the difference about the number of users between the most popular social network and the second one? Indeed, the situation where the users of the two most used social networks are in a ratio of about 1:1 is very different from the one where the ratio is about 1000:1. Anyway, among the two declared sources of data, Alexa offers freely only rank statistics, as far as I know, and Google Trends for Websites requires to make hundreds of queries to obtain the necessary data, and this was very discouraging.
No time and no tool, I have said. But in the last year I have recognized that often Google spreadsheet functions for external data can save time and offer the right tool. For example, when disputing CounterPunch article on Fukushima accident effects in the USA, these functions have spared me the boring job to copy down all the data involved. The case of social network maps is even better to show how to take advantage of Google tools.
Trying to answer my previous questions I have build five different maps which are shown in the gallery below and at the bottom of this post, too.
Obviously my analysis is far from perfect. For example, I haven’t considered those social networks which are most used only in one or few countries. Such a task would have made my work too much complicated. Nevertheless, I think that my maps bring out some interesting facts: the growth of Facebook and its differential penetration among the countries, the widespread collapse of MySpace, the popularity of Twitter compared to Facebook in Japan.
Indeed, here my purpose is to explain how Google resources allow to deal the case of social network statistics with little efforts and great flexibility, and more generally how they are ideal to carry out some mechanical tasks in data analysis work, in particular online data collection and chart generation. In other words, I think that my work has a general interest: you can simply consider this as a case study to take cue from and apply to your interest area.
In the step-by-step description of my work, I refer to this Google Spreadspeet which contains all the necessary data and calculations. The key elements are represented by the function =IMPORTXML()
, which allow to retrieve specific portions of html code for a given page, the Google Apps Script, used to process a great amount of data, and the Google Image Charts API which generates the maps. Here are the main operations carried out to build the spreadsheet.
- Traffic data mining from Google Trends. Fisrt it’s necessary to understand how Google Trends for WebSite works. Each query for a given domain and a given country generates a specific url and returns a page containing a chart dynamically created by Google Image Charts API. For example, the query for wikipedia.org in the United States load the page http://trends.google.com/websites?q=wikipedia.org&geo=US Its html code contains one
tag whose
src
parameter http://chart.googleapis.com/chart?cht=lc&chd=e:2Y2Y2Y2Y2Y2Y2Y2Y,B6B6B6B6B6B6B6B6B6B6B6B6B6B6B6B6B6B6B6B6B6B6B6B6B6B6B6B6B6B6B6B6B6,fvYveWgFgJgrfNf6fvfhdUb5cRcXdkdjftfQgcgpfyc6aOcMgTg7iofyeBfzhOe6cNbCaebebVczfCeaimeFaCcjZHdffnf1fXelc4aeaFYBXuVJT5Ol,&chds=0.0,2.19083205E7&chs=580×188&chco=ffffff00,ffffff00,4684eeff,4684eeff&chls=1.0,1.0,0.0%7C1.0,1.0,0.0%7C1.75,1.0,0.0%7C1.5,3.0,3.0&chxt=x&chxr=0,0.0,100.0&chxl=0:%7C%7CJan+2009%7C%7C%7CApr+2009%7C%7C%7CJul+2009%7C%7C%7COct+2009%7C%7C%7CJan+2010%7C%7C%7CApr+2010%7C%7C%7CJul+2010%7C%7C%7COct+2010%7C%7C%7CJan+2011%7C%7C%7CApr+2011%7C%7C%7CJul+2011%7C&chxs=0,443322ff,9.0,0.0&chm=v,443322ff,1,-1,1%7Ct+Daily+Unique+Visitors,676767ff,0,0,10,1%7Ct+Google+Trends,676767ff,0,6,10,1&chg=9.375,33.33,1.0,1.0,3.125 displays the dynamic version of the following chart:The most recent number of daily unique visitors can be derived from this last url according to data formats syntax. The sheetsingle_data
reports the details of such tasks. I consider the three most known social networks, facebook.com, twitter.com and myspace.com: for each of these domains, one=IMPORTXML()
instance retrieves the chart source url, and then its string is decoded to obtain the corresponding number of unique daily visitors. - Getting some other propaedeutic data. Another instance of
=IMPORTXML()
is used in sheetpop_data
to return (a) the list of all the world countries from the region drop-dopwn list displayed in any page on trends.google.com/websites, while a simple copy+paste operation from page http://www.internetworldstats.com/list2.htm to sheetstates_data
provides (b) some basic demographic statistics (population and internet users amount) for all the countries in the world. I thank Enrique De Argaez, Miniwatts Marketing Group CEO, who has grant me the use of such data I use later to derive some simple indexes. Then information (a) and (b) is used to set up the table in sheetsummary_data
designed to gather all the necessary data. Countries entitle rows while number of visitors/people and relative indexes entitle columns. - Batch data collection. Google set a limit of 25 (or 50, as somewhere reported)
=IMPORTXML()
instances per spreadsheet, while the numbers to retrieve are hundreds. So, it is not possible to fill the table in sheet summary_data all at once. One could complete one row at a time, copying cells fromsingle_data
sheet and pasting values intosummary_data
sheet. Fortunately, Google App Script allows to automate such a task. I have written a script of less than a dozen code lines (seeTools > Script Editor
) which is very general and therefore can be recycled without any change to many oher situations because it assumes only the creation of two named ranges:rows_to_process
is the range containing the row headings in the summary table, andprocessed_row
is the range containing the values to fill each row in summary table. Looping over each row ofrows_to_process
, it copies its values to the first cells ofrow_processed
and then pastes back the full set of its updated values. - Chart generation. In the last columns of the table in summary_data sheet, I have combined the data so far collected to build as many normalized indexes ranging from 0 to 1. Then, just at the bottom of each one of these columns, =ARRAYFORMULA() mode is used to concatenate and process all the valid values according to Google Image Chart API sintax into a string which returns the url to generate the relative thematic map. Since image width cannot be greater than 500 px, final world maps are composed by two contiguos maps: one drawing the western hemisphere and the other drawing the eastern hemisphere.
-
Facebook penetration
(ratio between numbers of unique visitors and population size)lowest rate ◼, highest rate ◼
-
Facebook penetration
(ratio between numbers of unique visitors and number of internet users)lowest rate ◼, highest rate ◼
-
Relative variation of Facebook visitors in the last year
(ratio between number of visitors at June 30, 2010 and June 30, 2011)highest drop ◼, none variation ◼, highest growth ◼
-
Relative variation of MySpace visitors in the last year
(ratio between number of visitors at June 30, 2010 and June 30, 2011)highest drop ◼, none variation ◼), highest growth ◼
-
Ratio between Facebook and Twitter visitors
highest ratio in favour of Twitter ◼ 1:1 ratio ◼, highest ratio in favour of Facebook ◼
This work is released underthe CC BY-NC-SA license.
Pingback: sei-uno-zero-nove » Blog Archive » creazione automatica di mappe tematiche sui social network
Pingback: sei-uno-zero-nove » Blog Archive » use google spreadsheet as a proxy
Good stuff!
However, when I access your spreadsheet and login with a google-account I cannot edit and play with the data. Making it editable (without save ;-) would be a great option.
Why don’t you simply make your own copy and then apply all the edits you want?
Open the spreadsheet, then do File > Make a copy.
Uuups. The only excuse for missing this is a general reluctance to copy paste operations in Germany due to some of last year’s political events. ;-)
Txs!
Pingback: use google spreadsheet as a proxy | FreeShareHere
Pingback: campione non solo d’inverno, forse | sei-uno-zero-nove