[Evolution] feature=remove_double_mail(); exist(feature)?where:feature.new(); banner(' :-)');

Dan Stromberg strombrg@dcs.nac.uci.edu
Tue, 01 Jun 2004 16:56:17 -0700


--=-IxIcv/9o0o0xtLkaCq8o
Content-Type: text/plain
Content-Transfer-Encoding: quoted-printable

On Mon, 2004-05-03 at 06:55, guenther wrote:
> > > If you really are using third party tools like fetchmail, you should =
not
> > > be afraid of using third party tools for eliminating dupes, in this c=
ase
> > > formail... ;)
> > >=20
> > > $ cat mbox | formail -D 8192 .msgid -s >> dedupe.mbox
> > > $ mv -f dedupe.mbox mbox
> >=20
> > Anything like that for email stored in Maildirs (specifically, Courier=20
> > IMAP)?
>=20
> Oops, yes, should have mentioned the above is for mbox files only -- as
> one could guess by the file names... ;)
>=20
> No maildir solution OTOH unfortunately, but 'man formail' will be
> informative. As formail will store the cache on disk after it is done
> precessing the current job, filtering on unique Message-IDs should be
> possible with single mail files in maildir format as well.
>=20
> Maybe someone else already has done this?
>=20
> ...guenther

You can do this relatively easily with "classify" (somewhere around the
net), or my sequivs (available on request) program.  Both are able to
divide files up into equivalence classes (groups of identical or similar
files).  classify is more flexible, but slower than my sequivs program.=20
And it may be tough to google on "classify".  Both are O(N^2)
unfortunately, but I came up with a heuristic that speeds things a lot
anyway.  I keep thinking I should make it O(nlogn) someday, but not
finding a real need for it.

Usage with sequivs would be a bit like:

find ~/Maildir/.folder -type f -print | sequivs | sed -e 's/^[^ ]* //'
-e '/^$/d' > /tmp/.folder.dups
xargs rm < /tmp/.folder.dups

Or you could combine it into one step if you trust my off-the-cuff
scripting too much.  :)  I'd really suggest going over .folder.dups
first to make sure those really are all duplicated files.

If it's a big folder, you could be waiting a while.  However, I did it
recently overnight (probably less than overnight, but I don't know by
how much) with 40k+ messages, so it's not that bad.

If I get enough requests, I'll add sequivs to
http://dcs.nac.uci.edu/~strombrg/software/index.html.  equivs is already
there, but usage is slightly different, and won't work on as big
collections of potentially-duplicated files.

--=20
Dan Stromberg DCS/NACS/UCI <strombrg@dcs.nac.uci.edu>


--=-IxIcv/9o0o0xtLkaCq8o
Content-Type: application/pgp-signature; name=signature.asc
Content-Description: This is a digitally signed message part

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.7 (GNU/Linux)

iD8DBQBAvRego0feVm00f/8RAsiIAJ4garWIdYxxQiacakyw93NRhnclrACfXWhl
FtpwtDoEnHoaqkMwxmG29LA=
=Cbw0
-----END PGP SIGNATURE-----

--=-IxIcv/9o0o0xtLkaCq8o--