Home

Awesome

The Squeak History Project

Project for Squeak 5.3 (and above) to explore and learn about the history of Squeak and Squeak-related projects.

How to Install

  1. Get a Squeak image (5.3 or newer).

  2. Load the Squeak History Project:

    Metacello new
    	baseline: 'SqueakHistory';
    	repository: 'github://hpi-swa/squeak-history/packages';
    	load.
    

After the code was loaded, mailing list archives will be downloaded. See BaselineOfSqueakHistory >> #loadData for details.

How to Use

The SqhMailmanAggregator can be used to enumerate all mail messages. Example queries can be found in the queries protocol. The message cache holds meta-data for each message in the image, that is, instances of SqhMailWrapper. The message body has to access the archive's file contents on disk, which is slower. Here is an example query, which requires disk access:

countMessageLines
	| count |
	count := 0.
	self messagesCachedDo: [:wrapper |
		count := count + wrapper mailMessage bodyText lineCount].
	^ count

Note that you should also derive some rules for author-key normalization to further improve the overall quality of query results. Just run this:

SqhMailmanAggregator new
	showProgress: true; "optional"
	deriveRulesForAuthorKeyNormalization. "ignore the warning"

For more information on normalization, see below.

Notes on Normalization

There are rules to normalize different kinds of information: author names, timestamps, and mail addresses. The goal is to identify contributors and, eventually, relevant discussions. Hand-selected rules can be found in SqhMailmanAggregator >> #rulesForAuthorKeyNormalization and #rulesForAuthorKeyClarification. Here is an excerpt:

"rulesForAuthorKeyNormalization"
'alankay' -> 'alancurtiskay'.
'hhirzel' -> 'hanneshirzel'.

"rulesForAuthorKeyClarification"
'squeakdev@reider.net' -> ('squeakdev' -> 'alanreider').
'squeak@bike-nomad.com' -> ('squeak' -> 'nedkonz').

Then, there is a simple algorithm to derive more normalization rules using e-mail addresses as identifier. If two messages came from the same address, then the author's names can be used to identify the same contributor. There is also a check to avoid mapping cycles. See SqhMailmanAggregator >> #deriveRulesForAuthorKeyNormalization for details. We filter generic addresses such as:

notifications@github.com
noreply@github.com
no-reply@appveyor.com
builds@travis-ci.org

Note that there is also a (hand-crafted) list of generic author names (or keys) in #genericAuthorKeys including github or travisci.