Generate Statistics about Listserv Mailing Lists
How to visualise the dynamics of number of subscribers and postings per year and mailing list? At AEGEE this is done at https://mail.aegee.org/stats.html .
Statistics of mailing lists can show trends, e.g. if people write less, or if changes in the subscription procedure impacts the number of subscribers. In AEGEE when members apply for events, they have a checkbox whether they want to be subscribed to the major mailing lists. When this checkbox was removed, the number of new subscribers decreased.
https://mail.aegee.org/stats.html demonstrates visually the statistics and puts the results in machine readable csv format. Below is explained how to create such statistics for your Listserv instance. The data is extracted from the .CHANGELOG files.
What needs to be adjusted:
- The path where the output should be: lines 2-4, variables F, S, P - set something which is served by your web server.
- The path to Listserv in the variable HOME_LISTSERV and possibly all other relative paths.
- The set of lists, for which statistics should be generated: in the MLS variable and in charts.py.
File ml-stats-html #
#!/bin/bash
F=./stats.html.bak
S=./subscribers.csv.bak
P=./postings.csv.bak
HOME_LISTSERV=/home/listserv/
MLS=(AEGEE-L ANNOUNCE-L AEGEE-EVENT-L AEGEE-SU-L AEGEENEWS-L AEGEE-AGORA-L BOARDINF-L)
rm -f $F $P $S
YEAR=$(date +%Y)
if [[ $(($YEAR / 400 * 400)) -eq $i || ( $(($YEAR / 4 * 4 )) -eq $i && $(($YEAR / 100 * 100)) -ne $i ) ]]; then
DAYS=366
else
DAYS=365
fi
declare -A LAST_VALUES
echo "<!DOCTYPE html><html lang='en'><head><title>Statistics about AEGEE Mailing Lists</title><style>td {text-align:right}
tr:nth-child(even){background-color:#f0f0f0}</style></head><body>
<h1>Statistics on AEGEE Mailing Lists</h1>
<h2>Table of Content</h2>
<ul>
<li><a href='#subscribers'>Statistics on Subscriptions and Unsubscriptions</a></li>
<li><a href='#postings'>Statistics on Postings</a></li>
<li><a href='#sub-links'>Subscription Links to the Mailing Lists</a></li>
</ul>
<p>The extr lines extrapolate the data from the current year to the whole year. The current number is multiplied by ${DAYS} and divided by the number of days which passed this year ($(date +%j|sed 's/^0*//')). The lines after extr present data from 1st January until $(date +'%a %d %b %Y %H:%M %Z').</p>
<h2 id='subscribers'>Statistics on Subscriptions and Unsubscriptions</h2>
<p>The SUBSCRIBERS column is the number of subscribers on 1st January. ADDED is the number of new subscribers to a mailing list during the year. REMOVED is the number of removed addresses from a mailing list for any reason.</p>
<table><tr><th></th>" >> $F
for ml in ${MLS[@]}; do
echo "<th colspan='3' style='border-left:solid'>$ml</th>" >> $F
done
echo "</tr><tr><th>Year</th>" >> $F
printf "Year" >> $S
for ml in ${MLS[@]}; do
echo "<th style='border-left:solid;hyphens:manual'>Sub­scribers</th><th>Added</th><th style='hyphens:manual'>Re­moved</th>" >> $F
printf ",$ml Subscribers,$ml Added,$ml Removed" >> $S
done
echo "</tr>" >> $F
echo >> $S
i=0
for year in extr $(seq $YEAR -1 2002); do
echo "<tr><td>$year</td>" >> $F
printf $year >> $S
for ml in ${MLS[@]}; do
l=${ml,,}
if [ $year = extr ]; then
ADD=`grep -ahE "^$YEAR.{10} (ADD|SUBSCRIBE) " ${HOME_LISTSERV}home/${l}.changelog*|wc -l`
REMOVE=`grep -ahE "^$YEAR.{10} (DELETE|SIGNOFF|AUTODEL|EXPIRE) " ${HOME_LISTSERV}home/${l}.changelog*|wc -l`
SUBS=`grep -ahE "^$YEAR.{10} SUBCOUNT " ${HOME_LISTSERV}home/${l}.changelog*`
SUBS=${SUBS:24}
IMPORT=`grep -ahE "^$YEAR.{10} IMPORT " ${HOME_LISTSERV}home/${l}.changelog*|sed 's/.* IMPORT \(.*\) .* .*/\1/'`
if [ "$IMPORT" != '' ] ; then
for imp in $IMPORT; do
ADD=$((ADD + imp))
done
fi
ADD=$(echo "scale=3; $ADD * $DAYS / $(date +%j)" |bc|awk '{printf("%d\n",$1 + 0.5)}')
REMOVE=$(echo "scale=3; $REMOVE * $DAYS / $(date +%j)" |bc|awk '{printf("%d\n",$1 + 0.5)}')
SUBS=$((SUBS + ADD - REMOVE))
else
ADD=`grep -ahE "^$year.{10} (ADD|SUBSCRIBE) " ${HOME_LISTSERV}home/${l}.changelog*|wc -l`
REMOVE=`grep -ahE "^$year.{10} (DELETE|SIGNOFF|AUTODEL|EXPIRE) " ${HOME_LISTSERV}home/${l}.changelog*|wc -l`
SUBS=`grep -ahE "^$year.{10} SUBCOUNT " ${HOME_LISTSERV}home/${l}.changelog*`
IMPORT=`grep -ahE "^$year.{10} IMPORT " ${HOME_LISTSERV}home/${l}.changelog*|sed 's/.* IMPORT \(.*\) .* .*/\1/'`
if [ "$IMPORT" != '' ] ; then
for imp in $IMPORT; do
ADD=$((ADD + imp))
done
fi
if [ "${SUBS}" != '' ] ; then
SUBS=${SUBS:24}
X=$(( ${LAST_VALUES[$ml]} - ADD + REMOVE))
if [ "${LAST_VALUES[$ml]}" != "" -a "$X" != "$SUBS" -a $year != extr -a $year != $YEAR ]; then
SUBS="=$SUBS"
fi
else
if [ "$ADD" = "0" -a "$REMOVE" = "0" ]; then
SUBS=""
ADD=""
REMOVE=""
else
SUBS=${LAST_VALUES[$ml]}
SUBS=$((SUBS - ADD + REMOVE))
fi
fi
fi
if [ "$SUBS" != "" ]; then
if [ ${SUBS:0:1} = '=' ]; then
LAST_VALUES[$ml]=${SUBS:1}
else
LAST_VALUES[$ml]=$SUBS
fi
fi
echo "<td style='border-left:solid'>$SUBS</td><td>$ADD</td><td>$REMOVE</td>" >> $F
printf ,$SUBS,$ADD,$REMOVE >> $S
i=$((i+1))
if [ $(( $i % ${#MLS[@]} )) = 0 ]; then
echo "</tr>" >> $F
echo >> $S
fi
done
done
echo "</table>
<p>The cells with = show the real logged number of subscribers, when it is different from the calculated number. Same data in CSV machine readable format: <a href='./subscribers.csv'>./subscribers.csv</a>.</p>
<div><img src='./subscribers.png'><br><em>Chart on the subscribers per year on 1st January by Michele Marchi</em></div>
<p>Missing triples means there is no data. Lowest SUBSCRIBERS value in each column is for part of a year, not necessary from 1st January.</p>
<h2 id='postings'>Statistics on Postings to AEGEE Mailing Lists</h2>
<p>The following table presents the amount of postings (new emails) distributed over AEGEE-wide mailing lists on https://lists.aegee.org.</p>
<p>The SUM column is the total of all postings from the previous columns for the year, not counting ALL-OF-BEST-L.</p>
" >> $F
MLS=(AEGEE-L ANNOUNCE-L AEGEE-EVENT-L AEGEE-SU-L AEGEENEWS-L AEGEE-AGORA-L AEGEE-ALL-L BOARDINF-L)
printf "<table><tr><th>YEAR</th>" >> $F
printf Year, >> $P
for ml in ${MLS[@]}; do
printf "<th>%s</th>" ${ml} >> $F
printf $ml, >> $P
done
printf "<th>SUM</th><th>ALL-OF-BEST-L</th></tr>\n" >> $F
printf "SUM,ALL—OF—BEST-L\n" >> $P
for year in extr $(seq $YEAR -1 2002); do
printf "<tr><td>%4s</td>" $year >> $F
printf $year, >> $P
SUMP=0
BEST=`grep -ahE "^$year.{10} POST " ${HOME_LISTSERV}home/all-of-best-l.changelog*|wc -l`
if [ $year = extr ] ; then
BEST=`grep -ahE "^${YEAR}.{10} POST " ${HOME_LISTSERV}home/all-of-best-l.changelog*|wc -l`
BEST=$(echo "scale=3; $BEST * $DAYS / $(date +%j)" |bc|awk '{printf("%d\n",$1 + 0.5)}')
else
BEST=`grep -ahE "^$year.{10} POST " ${HOME_LISTSERV}home/all-of-best-l.changelog*|wc -l`
fi
for ml in ${MLS[@]}; do
l=${ml,,}
j=${l:0:-2}
if [ $year = extr ] ; then
POST=`grep -ahE "^${YEAR}.{10} POST " ${HOME_LISTSERV}home/${l}.changelog*|wc -l`
POST=$(echo "scale=3; $POST * $DAYS / $(date +%j)" |bc|awk '{printf("%d\n",$1 + 0.5)}')
else
POST=`grep -ahE "^$year.{10} POST " ${HOME_LISTSERV}home/${l}.changelog*|wc -l`
fi
SUMP=$((SUMP+POST))
printf "<td>%s</td>" $POST >> $F
printf $POST, >> $P
done
printf "<td>%5s</td><td>%4s</td>\n" $SUMP $BEST >> $F
printf "$SUMP,$BEST\n" >> $P
done
echo "</table><p>Same data in CSV machine readable format: <a href='./postings.csv'>./postings.csv</a>.</p><h3 id='sub-links'>Subscription Links to the Mailing Lists</h3><ul>" >> $F
for ml in ${MLS[@]}; do
printf "<li><a href='/cgi-bin/wa?SUBED1=%s&A=1'>/cgi-bin/wa?SUBED1=%s&A=1</a></li>" ${ml} ${ml}>> $F
done
echo "<hr/></html>" >> $F
for i in $F $P $S; do
mv $i ${i::-4}
gzip -fk9 ${i::-4}
done
./charts.py
File charts.py #
#!/usr/bin/python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
df = pd.read_csv("./subscribers.csv")
###################
# converts a Series into numeric after extracting the numbers from each element
def clean_cell(column_series):
# for some reason it considers "=2901" as float instead of string
if type(column_series[1]) == str or type(column_series[1]) == float:
column_series = column_series.str.extract(r'-?(\d+)')
column_series = pd.to_numeric(column_series.iloc[:, 0])
return column_series
###################
df = df.filter(regex='Subscribers|Year')
df = df[df["Year"] != "extr"]
df["Year"] = pd.to_numeric(df["Year"])
df = df.apply(lambda x: clean_cell(x) if ("Subscribers" in x.name) else x)
plt.plot(df["Year"], df["AEGEE-L Subscribers"], label = "AEGEE-L")
plt.plot(df["Year"], df["ANNOUNCE-L Subscribers"], label = "ANNOUNCE-L")
plt.plot(df["Year"], df["AEGEE-EVENT-L Subscribers"], label = "AEGEE-EVENT-L")
plt.plot(df["Year"], df["AEGEE-SU-L Subscribers"], label = "AEGEE-SU-L")
plt.plot(df["Year"], df["AEGEENEWS-L Subscribers"], label = "AEGEENEWS-L")
plt.plot(df["Year"], df["AEGEE-AGORA-L Subscribers"], label = "AEGEE-AGORA-L")
plt.plot(df["Year"], df["BOARDINF-L Subscribers"], label = "BOARDINF-L")
plt.xticks(np.arange(2002, max(df["Year"])+3, 3))
plt.legend()
plt.savefig('./subscribers.png')