Re: [SLUG] Is there an archive for the list?

From: Ed Centanni (ecentan1@tampabay.rr.com)
Date: Mon Sep 15 2003 - 00:52:51 EDT


Well, so far the project consists of a python script that populates an
SQL database (using sqlite for speed) with file offset pointers into a
unix-style mailbox file(s). A search requires both the database and the
original parsed email mailbox file. It's quite specific to email and
couldn't be general purpose by any stretch of the imagination. The
search engine (as such) is just straight SQL querys into a database of
file pointers. Hopefull we'll have some sort of user-friendly front end
"real soon now".

It should be able to do a search such as: "Find every email message that
contains the phrase "viagra for sale" and "chasing three-legged
chickens" in the second attachment, sent from "acme.com" using a
user-agent containing the words "mozilla" and "commodore64" between the
dates of Nov 1, 1999 10:20:44 GMT and Dec 24, 2000 01:22:34 EDT, sent to
"wiley.coyote@wb.com", has 235 lines in the message body who's
content-type is text/html and has a return-path containing
"tampabay.rr.com" and "feathers". Don't laugh. I bet you have some odd
stuff in your mailboxes too!

I'm quite familar with Perl but the project started with python just
because I wanted to learn python by doing a non-trivial project with it
and it seems to fit with the exception of typical script language
performance hits. I have no doubt it could be ported to Perl but after
working with python, Perl looks like a write-only language to me. (hey,
that sounds perfect for the CIA!)

Just this weekend I finished up the word and word location code and the
program took a huge performance hit. I realized that python (like most
scripting languages) just doesn't have good speed at some types of
processing. I admit to being a C/C++ programmer at heart. Although, I
keep trying to learn and use scripting languages, I always to seem to
run up against some performance issue that NEVER comes up with a
fully-compiled (as opposed to script pseudo-code compiled) language.

Anyway, I plan to put it up as a project at sourceforge.net in the very
near future. It's early alpha code from a non-guru python wonk (me!) so
hopefully some python gurus will lend a hand in optimizing it. You're
most welcome to take a look at it and see if there are similarities to
the CIA stuff. ;-}

Ed.

thor_consulting@yahoo.com wrote:

> Ed - sorry i missed the original post
>
> many moons ago i did this type of thing with Perl (Perl was made for this
> type of data-mining).
>
> i wrote a general purpose data-mining tool that works with everything form
> positional and tagged text data to dynamic format data.
>
> our database used proper noun (PN) and key phrase (KP) indices along with
> tf-idf ranking to improve search results.
>
> the search engine was developed at Syracuse U. and it's at the core of IBM's
> patent search system and part of the CIA's repertoire
>
> very cool stuff but i digress
>
> might be overkill but let me know if you want use Perl to do the mining
>
> thor
>
> mailto:thor_consulting@yahoo.com
> http://www.geocities.com/thor_consulting/
> ----- Original Message -----
> From: "Michael Manchester" <mchester@yahoo.com>
> To: <slug@nks.net>
> Sent: Friday, September 12, 2003 06:25
> Subject: Re: [SLUG] Is there an archive for the list?
>
>
>
>>Ed;
>>Sounds interesting. That might be fun to work on. As
>>long as there are no deadlines. I get enough of those
>>at work :) You can sign me up for the front end. I'm
>>thinking a browser front end. What do you think?
>>Mike M.
>>--- Ed Centanni <ecentan1@tampabay.rr.com> wrote:
>>
>>>I have a personal mail archive of the SLUG list that
>>>dates from
>>>11/30/1999 to the present. The file size is
>>>77,130,688 bytes.
>>>Realizing what a great knowledge-base it represents
>>>and desiring to make
>>>use of it, I started a small project to create a
>>>searchable index
>>>database of ALL aspects of the SLUG list. I call it
>>>"emine", short for
>>>"E-mail Data Mining".
>>>
>>>It's a python script that parses a standard
>>>unix-style email mailbox
>>>file and builds an SQL database of string lengths
>>>and file offset
>>>pointers into the mail file. The database doesn't
>>>contain the actual
>>>email, it just stores the location and size of
>>>everything in the mailbox
>>>file. The index database is normalized and has 9
>>>related tables that
>>>contain information for headertypes, mimetypes,
>>>mailboxes, messages,
>>>headers, words, attachments, word locations, and
>>>phrases from 2 to 4 words.
>>>
>>>It's a work in progress. At the moment it can
>>>populate all the tables
>>>except words, word locations, and phrases. It
>>>shouldn't take me more
>>>than a few evenings to finish that up. It would
>>>need a user friendly
>>>frontend to be useful as a search engine but that
>>>shouldn't be rocket
>>>science once the database is fully populated. The
>>>front end would get
>>>user input, query the database, fseek to the
>>>offset(s) in the mailbox
>>>file and output the results.
>>>
>>>If this seems interesting to any of you, I'm willing
>>>to put it up as an
>>>open source project for anyone to work on and use.
>>>If you know of a
>>>similar project already available I'd like to know.
>>>
>>>Ed.
>>>
>>>
>>>Michael Manchester wrote:
>>>
>>>
>>>>I thought at one time there was an archive of the
>>>>
>>>list
>>>
>>>>or at least talk about having an archive of the
>>>>
>>>list.
>>>
>>>>Mike M.
>>>>
>>>>=====
>>>>
>>><snip>
>>>
>>>
>>>
>>-----------------------------------------------------------------------
>>
>>>This list is provided as an unmoderated internet
>>>service by Networked
>>>Knowledge Systems (NKS). Views and opinions
>>>expressed in messages
>>>posted are those of the author and do not
>>>necessarily reflect the
>>>official policy or position of NKS or any of its
>>>
>>employees.
>>
>>
>>=====
>>---------------------------------
>>The requirements said
>>"Windows 95/98/NT or better"
>>So I installed Linux
>>---------------------------------
>>
>>__________________________________
>>Do you Yahoo!?
>>Yahoo! SiteBuilder - Free, easy-to-use web site design software
>>http://sitebuilder.yahoo.com
>>-----------------------------------------------------------------------
>>This list is provided as an unmoderated internet service by Networked
>>Knowledge Systems (NKS). Views and opinions expressed in messages
>>posted are those of the author and do not necessarily reflect the
>>official policy or position of NKS or any of its employees.
>>
>
>
> -----------------------------------------------------------------------
> This list is provided as an unmoderated internet service by Networked
> Knowledge Systems (NKS). Views and opinions expressed in messages
> posted are those of the author and do not necessarily reflect the
> official policy or position of NKS or any of its employees.
>
>

-----------------------------------------------------------------------
This list is provided as an unmoderated internet service by Networked
Knowledge Systems (NKS). Views and opinions expressed in messages
posted are those of the author and do not necessarily reflect the
official policy or position of NKS or any of its employees.



This archive was generated by hypermail 2.1.3 : Fri Aug 01 2014 - 20:39:54 EDT