MAFF (a shit-show, unsustained)
Firefox used to have an in-house format called MAFF (Mozilla Archive File Format), which boiled down to a zip file that had HTML and a tree of media. I saved several web pages that way. It worked well. Then Mozilla dropped the ball and completely abandoned their own format. WTF. Did not even give people a MAFF→mhtml conversion tool. Just abandoned people while failing to realize the meaning and purpose of archival. Now Firefox today has no replacement. No MHTML. Choices are:
- HTML only
- HTML complete (but not as a single file but a tree of files)
MHTML (shit-show due to non-portable browser-dependency)
Chromium-based browsers can save a whole complete web page to a single MHTML file. Seems like a good move but then if you open an MHTML file in Firefox, you just get an ascii text dump of the contents which resembles a fake email header, MIME, and encoded (probably base64). So that’s a shit-show too.
exceptionally portable approach: A Firefox plugin adds a right-click option called “Save page WE”. That extension produces an MHTML file that both Chromium and Firefox can open.
PDF (lossy)
Saving or printing a web page to PDF mostly guarantees that the content and representation can reasonably be reproduced well into the future. The problem is that PDF inherently forces the content to be arranged on a fixed width that matches a physical paper geometry (A4, US letter, etc). So you lose some data. You lose information about how to re-render it on different devices with different widths. You might save on A4 paper then later need to print it to US letter paper, which is a bit sloppy and messy.
PDF+MHTML hybrid
First use Firefox with the “Save page WE” plugin to produce an MHTML file. But relying on this alone is foolish considering how unstable HTML specs are even still today in 2024 with a duopoly of browser makers doing whatever the fuck they want - abusing their power. So you should also print the webpage to a PDF file. The PDF will ensure you have a reliable way to reproduce the content in the future. Then embed the MHTML file in the PDF (because PDF is a container format). Use this command:
$ pdfattach webpage.pdf webpage.mhtml webpage_with_HTML.pdf
The PDF will just work as you expect a PDF to, but you also have the option to extract the MHTML file using pdfdetach webpage_with_HTML.pdf
if the need arises to re-render the content on a different device.
The downside is duplication. Every image is has one copy stored in the MTHML file and another copy separately stored in the PDF next to it. So it’s shitty from a storage space standpoint. The other downside is plugin dependency. Mozilla has proven browser extensions are unsustainable when they kicked some of them out of their protectionist official repository and made it painful for exiled projects to reach their users. Also the mere fact that plugins are less likely to be maintained than a browser builtin function.
We need to evolve
What we need is a way to save the webpage as a sprawled out tree of files the way Firefox does, then a way to stuff that whole tree of files into a PDF, while also producing a PDF vector graphic that references those other embedded images. I think it’s theoretically possible but no tool exists like this. PDF has no concept of directories AFAIK, so the HTML tree would likely have to be flattened before stuffing into the PDF.
Other approaches I have overlooked? I’m not up on all the ereader formats but I think they are made for variable widths. So saving a webpage to an ereader format of some kind might be more sensible, if possible.
IIUC you are referring to this extension, which is Firefox-only (like the save page WE is).
Indeed the beauty of ZIP is stability. But the contents are not. HTML changes so rapidly, I bet if I unzip an old MAFF file it would not have stood the test of time well. That’s why I like the PDF wrapper. Nonetheless, this WebScrapBook could stand in place of the MHTML from the save page WE extension. In fact, save page WE usually fails to save all objects for some reason. So WebScrapBook is probably more complete.
(edit) Apparently webscrapbook gives a choice between htz and maff. I like that it timestamps the content, which is a good idea for archived docs.
(edit2) Do you know what happens with JavaScript? I think JS can be quite disruptive to archival. If webscrapbook saves the JS, it’s saving an app, in effect, and that language changes. The JS also may depend on being able to access the web, which makes a shitshow of archival because obviously you must be online and all the same external URLs must still be reachable. OTOH, saving the JS is probably desirable if doing the hybrid PDF save because the PDF version would contain the result, not the JS. Yet the JS could still be useful to have a copy of.
It saves the rendered page. It also has a built-in rough DOM editor so that you can edit the document before saving. The way I have it set up it up is to remove all javascript from pages.
In principle the ideal archive would contain the JavaScript for forensic (and similar) use cases, as there is both a document (HTML) and an app (JS) involved. But then we would want the choice whether to run the app (or at least inspect it), while also having the option to offline faithfully restore the original rendering. You seem to imply that saving JS is an option. I wonder if you choose to save the JS, does it then save the stock skeleton of the HTML, or the result in that case?