I was intrigued to see how the demo GIF in the README was generated: https://github.com/tamnd/kage/blob/01e75b87ecc893bbba7943c63...
Turns out it's using another project by the same author: https://github.com/tamnd/ascii-gif
The script used for the demo is at https://github.com/tamnd/kage/blob/01e75b87ecc893bbba7943c63... and has a comment showing how to run it:
ascii-gif render docs/demo/kage.tape -o docs/static/demo.gif
Looks like it's an opinionated wrapper around https://github.com/charmbracelet/vhsHave you heard the good news about the terminal savior asciinema -- https://asciinema.org/
It's a cool tool/platform, but very different. Asciinema tries to make the "multimedia" itself better by making it actual text instead of being video/images, while the CLI command above turns actual text into multimedia supported by platforms already. Both are useful, both have their use cases :)
You can also do an animated svg which is way smaller than a gif because it's just text keyframes (https://github.com/vytskalt/pseudoc/blob/main/assets/factori...)
Very cool, never thought of that! "way smaller" is almost an understatement, when it's 50kb :P Neat that it loads in GitHub READMEs as well, which is probably a large reason people use .gif today.
I have a bunch of opinionated/personal-use binaries like this in my $HOME/bin/, like delete-all-npm, clean-rust-cache, download-youtube-playlist, and get-markdown <url>. It feels good, and I don't need to remember any commands. Sometimes my coding agent can figure out how to call some of those tools too ;))
FYI, on other platforms (Windows/MacOS), LiceCAP is a fantastic tool to record screen into compact GIFs by the author of Winamp and Reaper DAW:
One use I'd have for this is company wikis that you want to give folks easy offline access to (maybe the wiki has documentation that's useful at sites that don't have cellular coverage).
Cool!
It would be especially cool to have a version that didn't require the separate serving process - even though it's nifty you can package up a whole site as a single binary.
Maybe a single HTML entrypoint shim with a bit of javascript that could index into an archive (potentially embedded) of the site's content?
Submitting this to Hacker News is the right place! Thanks for your idea. I will consider implementing that :)
Also, in my mind, I already have a script/program to convert HTML to Markdown, so it could actually store everything on disk as a folder of Markdown files, and then commit them to a Git repo.
Not to load you up with too many ideas, but a markdown folder sounds a lot like obsidian, which has a plugin system now.
Epub would also be a great target.
I think the zim flow was perfect for offline use. I know I will be making use of it as soon as I can figure out how to pass chrome the cookies so I can be signed into the site. Didn't see it in the page, but I didn't look closely yet.
Not yet supporting cookies, since I created this tool for shadowing public websites first. I will add options to pass cookies later. It will pass them to the underlying Chrome/Chromium process, so it should not be hard to do.
I would use the shit out of this. I'm a heavy user of Logseq (OG, the md file-based version). Would LOVE to save my favorite web resources this way.
> Maybe a single HTML entrypoint shim with a bit of javascript that could index into an archive (potentially embedded) of the site's content?
So something like SingleFileZ https://github.com/gildas-lormeau/SingleFileZ or Gwtar https://gwern.net/gwtar ?
> kage serve $HOME/data/kage/paulgraham.com
If the result is static why does it need a server? Isn't it possible to make it so that it can simply be opened by the browser? Like:
$ firefox $HOME/data/kage/paulgraham.com
Then the result would be useable on machines without kage nstalled.
Usually JavaScript is blocked when you load pages that way.
Not all JavaScript, but a lot of APIs are restricted
Since when? You won't be able to make HTTP requests to localhost, as it'd be a different Origin, but I don't think any mainstream browser blocks JS outright when you use file:// to load and view HTML files.
Somewhere around 2019, each document loaded from file:// became its own origin in Firefox: https://bugzilla.mozilla.org/show_bug.cgi?id=1500453 (I didn't check when this happened in Chromium)
Related WHATWG discussion: https://github.com/whatwg/html/issues/3099
I thought all the JS was stripper?
I am quite familiar with this and it is factually false
Js modules donāt work on file urls (classic js does).
You could use python -m http.server instead. I haven't tried it yet, but it should work.
Actually, Kage has two parts: a crawler that crawls pages and converts them to clean HTML by capturing the DOM after rendering in Chrome/Chromium, and a pack/serve component that packages the result as either a ZIM file for Kiwix or an executable file.
Youāll likely run into a ton of CORS issues doing that.
I don't think so, there is no HTTP requests being done from JS as it's stripped away, and all the other resources are pulled down (and I'm assume their reference made relative), so really shouldn't be any issues because of CORS at all.
I find SingleFile [0] to be a much more robust version of this.
It strips out all the JavaScript too, but also packs everything into a single HTML file that is easy to transfer. Binary assets (like web fonts and images) are packed as base64 strings.
They also offer a CLI powered by Puppeteer. [1]
It seems this repo only saves one web page?
What I'm implementing here is mirroring a whole website, with all its subpages, so you can browse it all offline. For example, all essays from paulgraham.com.
Oh, I see. In that case, feature-wise, it is actually a modern alternative to HTTrack.
I think the misunderstanding stems from the browser's "Save As" reference in the description. It is misleading. You use "Save As" to save a single page, not an entire website.
Also, the description lacks a clear explanation of the project's purpose. It would be helpful to include a sentence explaining that the program downloads an entire website, not just a single page.
Singlefile supports scoped recursive crawls too: https://github.com/gildas-lormeau/single-file-cli#:~:text=an...
I highly recommend reading the singlefile source or https://archiveweb.page/ to see how they handle closed shadow DOMs, cross-origin iframes, websockets, media urls, deduping large assets, etc.
> For example, all essays from paulgraham.com
Not the same thing, but I made a clone of pgās website which can be used for exactly that: https://github.com/shawwn/pg
If you want to read all essays, just clone the repo and open any of the .html files. Or any of the .page files which generated them.
[flagged]
Um. Whose website are you on right now?
Love love love SingleFile too. The FF extension works pretty well for a clean save.
That said, Kage looks promising if OP can combine SingleFile reproduction quality with the HTTPTrack spidering approach. SPA's are kinda tricky with archiving and do wonder how well Kage would handle that
I've seen the option in IE- .mhtml.
For some reason it displays in IE better but I don't recall seeing this option in chrome of Firefox recently..
And thanks for the link. Let me implement this single HTML feature, it looks nice to have!
Yeah. An idea on top of that is to bundle an entire website into a single HTML page, with vendored JavaScript to enable client-side routing (all of the original pages' JS is still stripped out).
That way, the page is self-contained as it is, but requires no bundled binary code to serve the site. It is actually safer security-wise.
The vendored script can be as simple as this:
const site = {
"path-1": "<!DOCTYPE html><html> ... </html>",
"path-2": "<!DOCTYPE html><html> ... </html>",
// More paths
}
function attachListeners() {
for (const [path, html] of Object.entries(site)) {
document.querySelector(`a[href=${path}]`).onclick = () => {
document.documentElement.outerHTML = html
attachListeners()
}
}
}
document.addEventListeners("DOMContentLoaded", attachListeners)What's the difference with, any webbrowser on a computer, File -> Save as ?
That's for a single page, this handles the whole site. Also the browser Save As options often work poorly.
Save As works fine for simple websites with static content.
Let's say you have a site that fetches content from a database. If you Save As, then at best you'll get a local copy of an HTML page with JS that loads the content from the same remote database. It might not work (since the local copy has a different origin), or if it does, it requires you to be online, which defeats half of the purpose.
What this project, and SingleFile, both do is save a snapshot of what the rendered page actually looks like at that moment in time. The scripts are stripped out so it runs locally and has no external dependencies.
I've been using httrack (https://www.httrack.com) to download wikis to read on flights, which isn't perfect but better than I'd found previously. I'll try this out, I'd be delighted to have good results. Thanks for the post.
Specifically for wikis, is there a reason you wouldn't use Kiwix? For non "official" releases it's more complicated, but there are some services to generate the ZIM files. The desktop reader app is pretty good in my experience.
https://wiki.openzim.org/wiki/Build_your_ZIM_file
EDIT: https://get.kiwix.org/en/solutions/applications/kiwix-reader...
Kiwix has readers for almost every platform, Android, desktop, iPhone. That's why I made Kage produce ZIM file.
The executable file is mostly for people who don't have Kiwix installed yet, or just want to run the archive directly.
Thanks, never knew about this and great to hear about it.
This brings back memories. Around twenty years ago, internet was still expensive dial-up, so I used to go to an internet cafe, run HTTrack to download websites and manga, copy everything onto my tiny 128MB USB stick (felt very large at that time), then bring it home and read offline ;))
https://github.com/archiveteam/grab-site or browsertrix may be easier to use for some, it's what was used to save a lot of the data.gov stuff before it got taken down.
This is awesome, we wanted an offline copy of someoneās prototype (as built on Lovable, etc) so we could do version control and sharing in an easier format. Wrote our approach here: https://productnow.ai/blogs/extracting-html-from-ai-prototyp...
But will look into this now, see if we can swap some stuff out. Weāve really liked the idea of an offline mirror, makes a lot of collaboration use cases simpler
This seems like it has potential to create a lot of load on a site- are there settings to set how fast it clones or avoid images/videos? Is there a way to only get a subset of a website?
Could you help create a new issue for that? I will do it later. It is already 1:00 AM my time, but I am happy that anyone is interested in it. : )
Just pretend you're an AI crawler problem solved
I've accumulated a bunch of old website archives over the years. The funny thing is the ugly HTML dumps have been more useful than the "perfect" archive.
It's one of the reasons I've become a bigger fan of RSS over time. A feed from 10-ish years ago is often more usable today than a carefully preserved (application) website.
I have a project for creating and archiving RSS feeds, keeping the full history from the time the crawler starts. I need to clean up a bit, then will open source it soon.
Get a daily email with the the top stories from Hacker News. No spam, unsubscribe at any time.