Re: [SLUG] Web Page Storage

From: Eben King (eben1@tampabay.rr.com)
Date: Thu May 11 2006 - 12:59:50 EDT


On Thu, 11 May 2006, Kwan Lowe wrote:

>> So my question is how can one save what is open on a website say as a MS
>> Office .doc or OpenOffice OO file which is searchable, has the same
>> information, and does not take 10 minutes or so per page to do? [I am assume
>> of course that the website is something like HTML not something like Adobe
>> Acrobat.]
>
> If you're mainly concerned about the text of the document I'd suggest
> using wget or curl to pull the web page, then process the html to create a
> text document using links. I.e., wget URL; links -dump file.html >
> file.txt

Sometimes things in columns will get saved where

Col 1 line 1 Col 2 line 1
Col 1 line 2 Col 2 line 2
Col 1 line 3 Col 2 line 3
Col 1 line 4 Col 2 line 4

gets saved as

Col 1 line 1<tab>Col 2 line 1
Col 1 line 2<tab>Col 2 line 2
Col 1 line 3<tab>Col 2 line 3
Col 1 line 4<tab>Col 2 line 4

or worse as

Col 1 line 1
Col 2 line 1
Col 1 line 2
Col 2 line 2
Col 1 line 3
Col 2 line 3
Col 1 line 4
Col 2 line 4

when logically it should be

Col 1 line 1
Col 1 line 2
Col 1 line 3
Col 1 line 4
Col 2 line 1
Col 2 line 2
Col 2 line 3
Col 2 line 4

Insets are the same way.

Frames you have to be careful on -- be sure you have the URL of the frame
you want, not the URL of the document containing the frame. In Opera
forcing the appearance of the URL on e.g. shy popups is easy -- ^F8 -- but I
haven't found a way in FF.

-- 
-eben    ebQenW1@EtaRmpTabYayU.rIr.OcoPm    home.tampabay.rr.com/hactar

Every normal man must be tempted at times to spit upon his hands, hoist the black flag, and begin slitting throats. -- H.L. Mencken ----------------------------------------------------------------------- This list is provided as an unmoderated internet service by Networked Knowledge Systems (NKS). Views and opinions expressed in messages posted are those of the author and do not necessarily reflect the official policy or position of NKS or any of its employees.



This archive was generated by hypermail 2.1.3 : Fri Aug 01 2014 - 18:49:57 EDT