Generate Statistics about Listserv Mailing Lists

Generate Statistics about Listserv Mailing Lists

How to visualise the dynamics of number of subscribers and postings per year and mailing list? At AEGEE this is done at https://mail.aegee.org/stats.html .

Statistics of mailing lists can show trends, e.g. if people write less, or if changes in the subscription procedure impacts the number of subscribers. In AEGEE when members apply for events, they have a checkbox whether they want to be subscribed to the major mailing lists. When this checkbox was removed, the number of new subscribers decreased.

https://mail.aegee.org/stats.html demonstrates visually the statistics and puts the results in machine readable csv format. Below is explained how to create such statistics for your Listserv instance. The data is extracted from the .CHANGELOG files.

What needs to be adjusted:

  • The path where the output should be: lines 2-4, variables F, S, P - set something which is served by your web server.
  • The path to Listserv in the variable HOME_LISTSERV and possibly all other relative paths.
  • The set of lists, for which statistics should be generated: in the MLS variable and in charts.py.

File ml-stats-html #

#!/bin/bash
F=./stats.html.bak
S=./subscribers.csv.bak
P=./postings.csv.bak
HOME_LISTSERV=/home/listserv/
MLS=(AEGEE-L ANNOUNCE-L AEGEE-EVENT-L AEGEE-SU-L AEGEENEWS-L AEGEE-AGORA-L BOARDINF-L)

rm -f $F $P $S

YEAR=$(date +%Y)
if [[ $(($YEAR / 400 * 400)) -eq $i || ( $(($YEAR / 4 * 4 )) -eq $i  &&  $(($YEAR / 100 * 100)) -ne $i ) ]]; then
    DAYS=366
else
    DAYS=365
fi
declare -A LAST_VALUES

echo "<!DOCTYPE html><html lang='en'><head><title>Statistics about AEGEE Mailing Lists</title><style>td {text-align:right}
tr:nth-child(even){background-color:#f0f0f0}</style></head><body>
<h1>Statistics on AEGEE Mailing Lists</h1>
<h2>Table of Content</h2>
<ul>
  <li><a href='#subscribers'>Statistics on Subscriptions and Unsubscriptions</a></li>
  <li><a href='#postings'>Statistics on Postings</a></li>
  <li><a href='#sub-links'>Subscription Links to the Mailing Lists</a></li>
</ul>
<p>The extr lines extrapolate the data from the current year to the whole year.  The current number is multiplied by ${DAYS} and divided by the number of days which passed this year ($(date +%j|sed 's/^0*//')).  The lines after extr present data from 1st January until $(date +'%a %d %b %Y %H:%M %Z').</p>

<h2 id='subscribers'>Statistics on Subscriptions and Unsubscriptions</h2>
<p>The SUBSCRIBERS column is the number of subscribers on 1st January.  ADDED is the number of new subscribers to a mailing list during the year.  REMOVED is the number of removed addresses from a mailing list for any reason.</p>

<table><tr><th></th>" >> $F
for ml in ${MLS[@]}; do
  echo "<th colspan='3' style='border-left:solid'>$ml</th>" >> $F
done
echo "</tr><tr><th>Year</th>" >> $F
printf "Year" >> $S
for ml in ${MLS[@]}; do
    echo "<th style='border-left:solid;hyphens:manual'>Sub&shy;scribers</th><th>Added</th><th style='hyphens:manual'>Re&shy;moved</th>" >> $F
    printf ",$ml Subscribers,$ml Added,$ml Removed" >> $S
done
echo "</tr>" >> $F
echo >> $S

i=0
for year in extr $(seq $YEAR -1 2002); do
  echo "<tr><td>$year</td>" >> $F
  printf $year >> $S
  for ml in ${MLS[@]}; do
     l=${ml,,}
     if [ $year = extr ]; then
       ADD=`grep -ahE "^$YEAR.{10} (ADD|SUBSCRIBE) " ${HOME_LISTSERV}home/${l}.changelog*|wc -l`
       REMOVE=`grep -ahE "^$YEAR.{10} (DELETE|SIGNOFF|AUTODEL|EXPIRE) " ${HOME_LISTSERV}home/${l}.changelog*|wc -l`
       SUBS=`grep -ahE "^$YEAR.{10} SUBCOUNT " ${HOME_LISTSERV}home/${l}.changelog*`
       SUBS=${SUBS:24}
       IMPORT=`grep -ahE "^$YEAR.{10} IMPORT " ${HOME_LISTSERV}home/${l}.changelog*|sed 's/.* IMPORT \(.*\) .* .*/\1/'`
       if [ "$IMPORT" != '' ] ; then
           for imp in $IMPORT; do
             ADD=$((ADD + imp))
           done
       fi
       ADD=$(echo "scale=3; $ADD * $DAYS / $(date +%j)" |bc|awk '{printf("%d\n",$1 + 0.5)}')
       REMOVE=$(echo "scale=3; $REMOVE * $DAYS / $(date +%j)" |bc|awk '{printf("%d\n",$1 + 0.5)}')
       SUBS=$((SUBS + ADD - REMOVE))
     else
       ADD=`grep -ahE "^$year.{10} (ADD|SUBSCRIBE) " ${HOME_LISTSERV}home/${l}.changelog*|wc -l`
       REMOVE=`grep -ahE "^$year.{10} (DELETE|SIGNOFF|AUTODEL|EXPIRE) " ${HOME_LISTSERV}home/${l}.changelog*|wc -l`
       SUBS=`grep -ahE "^$year.{10} SUBCOUNT " ${HOME_LISTSERV}home/${l}.changelog*`
       IMPORT=`grep -ahE "^$year.{10} IMPORT " ${HOME_LISTSERV}home/${l}.changelog*|sed 's/.* IMPORT \(.*\) .* .*/\1/'`
       if [ "$IMPORT" != '' ] ; then
           for imp in $IMPORT; do
             ADD=$((ADD + imp))
           done
       fi
       if [ "${SUBS}" != '' ] ; then
         SUBS=${SUBS:24}
         X=$(( ${LAST_VALUES[$ml]} - ADD + REMOVE))
         if [ "${LAST_VALUES[$ml]}" != "" -a "$X" != "$SUBS" -a $year != extr -a $year != $YEAR ]; then
             SUBS="=$SUBS"
         fi
       else
           if [ "$ADD" = "0" -a "$REMOVE" = "0" ]; then
             SUBS=""
             ADD=""
             REMOVE=""
           else
             SUBS=${LAST_VALUES[$ml]}
             SUBS=$((SUBS - ADD + REMOVE))
           fi
       fi
     fi
     if [ "$SUBS" != "" ]; then
       if [ ${SUBS:0:1} = '=' ]; then
         LAST_VALUES[$ml]=${SUBS:1}
       else
         LAST_VALUES[$ml]=$SUBS
       fi
     fi
     echo "<td style='border-left:solid'>$SUBS</td><td>$ADD</td><td>$REMOVE</td>" >> $F
     printf ,$SUBS,$ADD,$REMOVE >> $S
     i=$((i+1))

    if [ $(( $i % ${#MLS[@]} )) = 0 ]; then
        echo "</tr>" >> $F
        echo >> $S
    fi
  done
done

echo "</table>
<p>The cells with = show the real logged number of subscribers, when it is different from the calculated number. Same data in CSV machine readable format: <a href='./subscribers.csv'>./subscribers.csv</a>.</p>
<div><img src='./subscribers.png'><br><em>Chart on the subscribers per year on 1st January by Michele Marchi</em></div>
<p>Missing triples means there is no data.  Lowest SUBSCRIBERS value in each column is for part of a year, not necessary from 1st January.</p>

<h2 id='postings'>Statistics on Postings to AEGEE Mailing Lists</h2>
<p>The following table presents the amount of postings (new emails) distributed over AEGEE-wide mailing lists on https://lists.aegee.org.</p>

<p>The SUM column is the total of all postings from the previous columns for the year, not counting ALL-OF-BEST-L.</p>
" >> $F
MLS=(AEGEE-L ANNOUNCE-L AEGEE-EVENT-L AEGEE-SU-L AEGEENEWS-L AEGEE-AGORA-L AEGEE-ALL-L BOARDINF-L)
printf "<table><tr><th>YEAR</th>" >> $F
printf Year, >> $P
for ml in ${MLS[@]}; do
  printf "<th>%s</th>" ${ml} >> $F
  printf $ml, >> $P
done
printf "<th>SUM</th><th>ALL-OF-BEST-L</th></tr>\n" >> $F
printf "SUM,ALL—OF—BEST-L\n" >> $P
for year in extr $(seq $YEAR -1 2002); do
  printf "<tr><td>%4s</td>" $year >> $F
  printf $year, >> $P
  SUMP=0
  BEST=`grep -ahE "^$year.{10} POST " ${HOME_LISTSERV}home/all-of-best-l.changelog*|wc -l`

  if [ $year = extr ] ; then
     BEST=`grep -ahE "^${YEAR}.{10} POST " ${HOME_LISTSERV}home/all-of-best-l.changelog*|wc -l`
     BEST=$(echo "scale=3; $BEST * $DAYS / $(date +%j)" |bc|awk '{printf("%d\n",$1 + 0.5)}')
   else
     BEST=`grep -ahE "^$year.{10} POST " ${HOME_LISTSERV}home/all-of-best-l.changelog*|wc -l`
   fi
  for ml in ${MLS[@]}; do
     l=${ml,,}
     j=${l:0:-2}
     if [ $year = extr ] ; then
       POST=`grep -ahE "^${YEAR}.{10} POST " ${HOME_LISTSERV}home/${l}.changelog*|wc -l`
       POST=$(echo "scale=3; $POST * $DAYS / $(date +%j)" |bc|awk '{printf("%d\n",$1 + 0.5)}')
     else
       POST=`grep -ahE "^$year.{10} POST " ${HOME_LISTSERV}home/${l}.changelog*|wc -l`
     fi
     SUMP=$((SUMP+POST))
     printf "<td>%s</td>" $POST >> $F
     printf $POST, >> $P
  done
  printf "<td>%5s</td><td>%4s</td>\n" $SUMP $BEST >> $F
  printf "$SUMP,$BEST\n" >> $P
done

echo "</table><p>Same data in CSV machine readable format: <a href='./postings.csv'>./postings.csv</a>.</p><h3 id='sub-links'>Subscription Links to the Mailing Lists</h3><ul>" >> $F

for ml in ${MLS[@]}; do
  printf "<li><a href='/cgi-bin/wa?SUBED1=%s&A=1'>/cgi-bin/wa?SUBED1=%s&A=1</a></li>" ${ml} ${ml}>> $F
done

echo "<hr/></html>" >> $F
for i in $F $P $S; do
    mv $i ${i::-4}
    gzip -fk9 ${i::-4}
done 
./charts.py

File charts.py #

#!/usr/bin/python

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re

df = pd.read_csv("./subscribers.csv")

###################

# converts a Series into numeric after extracting the numbers from each element
def clean_cell(column_series):
    # for some reason it considers "=2901" as float instead of string
    if type(column_series[1]) == str or type(column_series[1]) == float:
        column_series = column_series.str.extract(r'-?(\d+)')
        column_series = pd.to_numeric(column_series.iloc[:, 0])
    return column_series

###################

df = df.filter(regex='Subscribers|Year')
df = df[df["Year"] != "extr"]
df["Year"] = pd.to_numeric(df["Year"])

df = df.apply(lambda x: clean_cell(x) if ("Subscribers" in x.name) else x)

plt.plot(df["Year"], df["AEGEE-L Subscribers"], label = "AEGEE-L")
plt.plot(df["Year"], df["ANNOUNCE-L Subscribers"], label = "ANNOUNCE-L")
plt.plot(df["Year"], df["AEGEE-EVENT-L Subscribers"], label = "AEGEE-EVENT-L")
plt.plot(df["Year"], df["AEGEE-SU-L Subscribers"], label = "AEGEE-SU-L")
plt.plot(df["Year"], df["AEGEENEWS-L Subscribers"], label = "AEGEENEWS-L")
plt.plot(df["Year"], df["AEGEE-AGORA-L Subscribers"], label = "AEGEE-AGORA-L")
plt.plot(df["Year"], df["BOARDINF-L Subscribers"], label = "BOARDINF-L")
plt.xticks(np.arange(2002, max(df["Year"])+3, 3))
plt.legend()
plt.savefig('./subscribers.png')