off the shelf crawling and indexing

28 Jan 2025

i had started the last post rambling about making a search engine. i wanted to start a little bit of that right now.

a crawler to find content on the open web

apache nutch is basically the standard crawler, if you don’t write your own. but elastic also has a crawler they’ve started called “crawler” that also works pretty well and easily with elasticsearch. but you still have to “seed” them correctly… anyway, neither here nor there.

an indexer for that content

something to query the index

elasticsearch easily handles both of those. for now, i have no reason to not use it. tantivvy is a project to make lucene compatible indices in rust, and as a library, that’s something i’d use if i needed more control over the indices themselves. but for this toy, i do not.

a web portal to display everything

i probably just have to write a small backend that can translate queries to ES into simple json that i can then read from a small frontend. there are probably better, purpose specific frontends, eg kibana, but this isn’t a huge deal. over time, i’m going to need to customize this anyway.

i think i could get a lot of that done in a single docker-compose.yaml. yea, actually something like this.

anyway, discovering fediverse instances and content. that’s gonna take a little more thinking…