We will compute the pearson correlation (r-statistic) score by comparing the base number (column 1) with the corresponding language column. We will also compute the Serial correlation, by creating staggered columns that measure how close a number is in a sequence to the one before it.
gnuplot -p -e '
set xlabel "Base Sequence";
set ylabel "Alphabetic";
set xtics 1,1,12;
set ytics 1,1,12;
set title "Alphabetic Number Plot with Correlation Score";
set rmargin 25; set key at graph 1.5,0.9;
set size ratio 0.45;
stats "alphabetic.tab.stagger" using 1:2 name "E";
stats "" using 1:3 name "D";
stats "" using 1:4 name "G";
stats "" using 1:5 name "T";
stats "" using 1:6 name "C";
stats "" using 1:7 name "L";
stats "" using 2:8 name "ES";
stats "" using 3:9 name "DS";
stats "" using 4:10 name "GS";
stats "" using 5:11 name "TS";
stats "" using 6:12 name "CS";
stats "" using 7:13 name "LS";
set label 1 sprintf("%10s %6s %6s", "", "Base", "Stagger") at graph 1.07,0.95;
plot "" using 1:2 with lines lw 3 title sprintf("%10s %+.3f %+.3f", "English", E_correlation, ES_correlation),
"" using 1:3 with lines lw 3 title sprintf("%10s %+.3f %+.3f", "Dutch", D_correlation, DS_correlation),
"" using 1:4 with lines lw 3 title sprintf("%10s %+.3f %+.3f", "German", G_correlation, GS_correlation),
"" using 1:5 with lines lw 3 title sprintf("%10s %+.3f %+.3f", "Turkish", T_correlation, TS_correlation),
"" using 1:6 with lines lw 3 title sprintf("%10s %+.3f %+.3f", "Chinese", C_correlation, CS_correlation),
"" using 1:7 with lines lw 1 title sprintf("%10s %+.3f %+.3f", "Lexicon", L_correlation, LS_correlation)
'
It looks like Dutch has the lowest (near 0) correlation to both the base sequence and it’s own staggered sequence, with Turkish mirroring it’s staggered randomness somewhat.
The least random alphabetic sequences are English and German.
Which language provides the most random alphabetically sorted sequence?
Data
Sourced from comments in thread (English from image, Dutch from [email protected], German from [email protected] , Turkish from some rando, Chinese from [email protected], Lexicographical from [email protected])
Plot with Correlation Scores
We will compute the pearson correlation (r-statistic) score by comparing the base number (column 1) with the corresponding language column. We will also compute the Serial correlation, by creating staggered columns that measure how close a number is in a sequence to the one before it.
Staggered Table
cat alphabetic.tab \ | awk '{print $0"\t"prE"\t"prD"\t"prG"\t"prT"\t"prC"\t"prL;prE=$2;prD=$3;prG=$4;prT=$5;prC=$6;prL=$7}' \ | tee alphabetic.tab.stagger
Plot Code
gnuplot -p -e ' set xlabel "Base Sequence"; set ylabel "Alphabetic"; set xtics 1,1,12; set ytics 1,1,12; set title "Alphabetic Number Plot with Correlation Score"; set rmargin 25; set key at graph 1.5,0.9; set size ratio 0.45; stats "alphabetic.tab.stagger" using 1:2 name "E"; stats "" using 1:3 name "D"; stats "" using 1:4 name "G"; stats "" using 1:5 name "T"; stats "" using 1:6 name "C"; stats "" using 1:7 name "L"; stats "" using 2:8 name "ES"; stats "" using 3:9 name "DS"; stats "" using 4:10 name "GS"; stats "" using 5:11 name "TS"; stats "" using 6:12 name "CS"; stats "" using 7:13 name "LS"; set label 1 sprintf("%10s %6s %6s", "", "Base", "Stagger") at graph 1.07,0.95; plot "" using 1:2 with lines lw 3 title sprintf("%10s %+.3f %+.3f", "English", E_correlation, ES_correlation), "" using 1:3 with lines lw 3 title sprintf("%10s %+.3f %+.3f", "Dutch", D_correlation, DS_correlation), "" using 1:4 with lines lw 3 title sprintf("%10s %+.3f %+.3f", "German", G_correlation, GS_correlation), "" using 1:5 with lines lw 3 title sprintf("%10s %+.3f %+.3f", "Turkish", T_correlation, TS_correlation), "" using 1:6 with lines lw 3 title sprintf("%10s %+.3f %+.3f", "Chinese", C_correlation, CS_correlation), "" using 1:7 with lines lw 1 title sprintf("%10s %+.3f %+.3f", "Lexicon", L_correlation, LS_correlation) '
It looks like Dutch has the lowest (near 0) correlation to both the base sequence and it’s own staggered sequence, with Turkish mirroring it’s staggered randomness somewhat.
The least random alphabetic sequences are English and German.
Updated: Added chinese and staggered analysis.
c/dataisbeautiful
You put a lot of work into this.
Thank you for doing and sharing this
This is the second comment I’ve seen like this from you.
Please never stop.
I didn’t expect soneone to put that much effort into it.
Thanks! This is awesome!