SLOC the Web

Published: Saturday, Jul 4, 2020 Last modified: Monday, Oct 26, 2020

Since my video on measuring the SLOC using ohcount, I’ve been wanting to tackle the elephant in the room. The Web!!

You can moan about the Web all day, but if you are not looking at the codebase, then you’re just a consumer and you’re not really helping. Don’t worry, I’ve been guilty of this too, though I’m trying to change.

The gist of SLOC, is if there is more code, there is more complexity. As The Notorious B.I.G. said: Mo Code, Mo Problems.

There are too thorny issues whilst counting to be aware of before we get started:

  1. Tests, I made a testbloat.sh script to rip them out
  2. Dependencies (look at Arch’s PKGBUILD for some clues)

Firefox & chromium actually have a few dependencies compared to surf’s Webkit. This suggests that {Firefox,Chromium} pull in a lot of dependencies into their source distribution and compile it in. Hence we should see much bigger SLOC for {Firefox,Chromium}…

testbloat.sh chromium:

6.2G src with tests
3.7G src without tests

Flamegraph via https://github.com/brendangregg/FlameGraph

Webkit

testbloat.sh webkit2gtk:

227M src with tests
201M src without tests

Gecko

testbloat.sh firefox:

2.7G src with tests
1.2G src without tests

More than half the code base is… tests !

Source distribution

Using Archlinux, I grabbed the {firefox,webkit2gtk,chromium} distributions like so:

asp checkout $1
cd $1/trunk
makepkg -os --skippgpcheck
cd src/$1*

Lets look at:

I chose cloc over ohcount since it was faster. It ignores files that it doesn’t deem as source. Tbh, I think every file in a source distribution should be considered source. Since pruning non-source files, tests and documentation is a slipperly slope!

major browser files count

files.gpt source

major browser lines count

lines.gpt source

I was surprised to see chromium/blink to be more LOC than Firefox!

Thank you to Stackoverflow for getting tips how to plot the above.

Git checkout

From which the source distribution (x86 target) is somehow 🤷 derived!

So with the git history, the projects weigh in at:

hendry@knuckles ~/sloc $ du -sh *
11G     WebKit
26G     blink
5.6G    gecko
bar chart comparison between webkit, gecko, blink
$Data <<EOD
browser srcurl rev commitcount files lines
WebKit git://git.webkit.org/WebKit-https.git 3a2f99102aca947abcf9f70d0785dc3e5c073560 226748 310036 43304162
blink https://chromium.googlesource.com/chromium/src fa66724154f74bab505fe38c4b6d0d31b5a83ed0 906439 330146 42244206
gecko git@github.com:mozilla/gecko-dev.git 668686ae0504450a8c93501d1eb115f201eb982d 716280 281292 43547855
EOD
set terminal svg
set datafile separator ' '
set title 'git source'
set yrange [1000:*]
set logscale y
set ytics format "%.0s%c"
set style data histogram
set style histogram cluster gap 1
set style fill solid border -1
set boxwidth 0.9
set key left top
plot $Data u 4:xtic(1) ti col,\
     '' u 5 ti col,\
     '' u 6 ti col

It’s an incredible coincidence how the sloc (without .git) is roughly the same between the three git checkouts!

blink/ 42244206
gecko/ 43547855
WebKit/ 43298218

Data was generated by:

hendry@knuckles ~/sloc $ cat wc.sh
echo browser srcurl rev commitcount files lines
for i in WebKit blink gecko
do
        G="--git-dir ./$i/.git"
        commit=$(git $G rev-parse HEAD)
        commitcount=$(git $G rev-list --count $commit)
        srcurl=$(git $G config --get remote.origin.url)
        files=$(find $i/ -not -path '*/\.git/*' -type f | wc -l)
        lines=$(find $i/ -not -path '*/\.git/*' -type f -exec cat {} + | wc -l)
        echo $i $srcurl $commit $commitcount $files $lines
done

TODO: Check above

Webkit

Gecko

LOC over time

I’m also keen to examine their sloc over time using a tool I wrote https://github.com/kaihendry/graphsloc

Can anyone make collect-stats.sh faster because these took DAYS to gather the data.

2010-2020 blink lines changes from git

45M lines of code when I add up all the lines of all the commits. However that falls considerably short of the 100M of find chromium-83.0.4103.116 -type f -exec cat {} + | wc -l. However if you look at cloc’s analysis of 49M… it’s pretty close to 45M!

Gecko

gecko lines changes from git

I’m not quite sure why there are horizontal lines. Do please look at the gecko source CSV and collect-stats.sh.

The total is ~130M, but when I look at just the firefox-78.0.1 source distribution… it’s just 43M or if you just want to look at source files with cloc … 23M. I suspect the git - mercurial bridge is problematic.

Webkit

2010-2020 webkit lines changes from git

40M doesn’t come close to the 5-3M of code I got from the https://webkitgtk.org sources… that’s because a release is cut from the git, which is documented upon https://trac.webkit.org/wiki/WebKitGTK/Releasing

Concluding remarks

Firefox has the cleanest git to source mapping. Firefox has rust / python tool chains in the source that bloat it quite a bit. Firefox has the largest amount of sloc in git, but Chromium has more in their source distribution because Blink’s git repo is not a “monorepo” like Firefox’s. For e.g. their JS runtime (V8?) is not in the Blink git repo.

Webkit is just a kit/library, and whilst the smallest code base, there a lot of dynamically linked dependencies that are difficult to sloc / measure.

Blink’s source esp when you take in consideration gclient et al is expansive and firmly under the grip of Google.

Sidenote: I wanted to follow the Webkit source through Webkit, for example git log --follow -- ./third_party/blink/renderer/core/layout/layout_table_cell.h | tig but it appears firmly forked!