Zipf’s law, Kleiber’s law, and finding interesting patterns in browsing history

In the chapter The Long Tail of the Law, in Alex Bellos’ book The Grapes of Math, he shows some different statistical patterns that share the same properties:

When some researchers counted out word frequencies in the book Ulysses in 1940, they found an interesting pattern:

WordRankFrequency
I102653
say100265
bag100026
orangefiery100002

As you can see, 10 times farther down the list has 10 times fewer usages. Among other things, they found about 50 percent of words are used only once – with 50 percent of words re-used multiple times this at least gives English/language learners a good chance of understanding the gist of a book, while all readers often end up having at least one or two odd words they have to look up in any given book.

Step down on the (Logarithmic) scale

When the above is charted out, it looks like a line if you use a logarithmic scale on both axes.

Another surprising pattern is the #1 metro district, New York has exactly 10x the population as the #10 metropolitan district, Cleveland. Hamilton/Middletown, the #100th ranking district had 1/100 of New York’s population, 0.11 million compared to 12 million.

This is explained in “Zipf’s law”, stating population = k/ranka where a is near 1.

A similar pattern was noticed by a Swiss zoologist, Max Kleiber: “Kleiber’s law” relates to metabolic rate, efficiency of producing energy for larger mammals:

metabolic rate = 70(mass)3/4

This also is linear on a logarithmic axes graph. Physicists and biologists have found that it seems these may share similar mathematics being the optimal combination of large transport mechanism – large roads and large cities going to smaller streets and towns they branch into, and biological arteries branching out into narrower and smaller blood vessels.

As they noted, “hysteria” happens whenever a Google service goes down, but when the author’s website is down no one may notice… In a similar vein, anyone using an alternative social media network or operating system or ham radio repeater app, may be painfully aware of the fact that the 10th most popular one might have 1/10th… or less other users and support when they need to get help. Now after reading this you are probably wondering, does this pattern also arise in web browsing patterns? Well, thankfully we can analyze this, analyze it with a simple Python script right on your computer!

Firefox History Analysis

The first step is to find the history file in your computer: In the ~/.mozilla/firefox directory you will find a randomly named directory ending in “.default”. Copy the “places.sqlite” in this directory into a working directory along with a python script in the same directory:

#!/usr/bin/python3

import sqlite3
con = sqlite3.connect('places.sqlite');
cur = con.cursor()
nth = 0;
for row in cur.execute(' SELECT rev_host, count(*) as count from moz_places group by rev_host order by count desc;'):
    nth += 1
    print( '%s: %s %s times' % (nth, row[0][::-1], row[1]))

Before writing this script I browsed the Sqlite using DB Browser for SQLITE (available in Ubuntu store for free) and found the Url, ids, and “rev_host” which looks like the host domain reversed. A string is reversed back in Python with [::-1] slice notation. Selecting with count and group by is a good way to count rows with the same value in a certain sql column.

In the output of this script you may see something like:

1: .o************ 16005 times
...
10: .w************* 1166 times
...
100: .f************** 98 times
...
1000: .s************ 8 times

It’s interesting to see which domain hostnames you visit more urls at on a regular basis (redacted here) and see that the pattern also generally follows the same as above!

Chrome/Chromium History Analysis

Now let’s say you want to check Chrome/Chromium. The History sqlite is in ~/.config/chromium/Default/History and you should copy it somewhere and rename it “History.sqlite” so that the Sqlite browser can open it.

The urls don’t have a domain listed here so we can use the urllib to parse it. For example:

>>> from urllib.parse import urlparse
>>> urlparse('https://stackoverflow.com/questions/3150275/jquery-input-select-all-on-focus').netloc
'stackoverflow.com'

So to count these by most viewed domains you can count these from the urls table:

#!/usr/bin/python3
from urllib.parse import urlparse
import sqlite3
con = sqlite3.connect('History.sqlite');
cur = con.cursor()
nth = 0;
domaincounts = {}
for row in cur.execute('SELECT url from urls;'):
    url = row[0]
    domain = urlparse(url).netloc
    if domain in domaincounts:
        domaincounts[domain] += 1
    else:
        domaincounts[domain] = 1
print(domaincounts)

Now this counts the domains you have visited – but not in order, just outputting a dictionary. Use the counter to get the top results:

#!/usr/bin/python3
from urllib.parse import urlparse
from collections import Counter
import sqlite3
con = sqlite3.connect('History.sqlite');
cur = con.cursor()
nth = 0;
domaincounts = {}
for row in cur.execute('SELECT url from urls;'):
    url = row[0]
    domain = urlparse(url).netloc
    if domain in domaincounts:
        domaincounts[domain] += 1
    else:
        domaincounts[domain] = 1
counts = Counter(domaincounts)
nth = 0
for c in counts.most_common(10000):
    nth += 1
    print("%s: %s %s times" % (nth, c[0], c[1] ))

Leave a Reply

Your email address will not be published. Required fields are marked *

× 4 = forty