let's make a search engine

24 Jan 2025

i am not sure i’m going to actually spec out a full idea for a search but i do want to think about it a little bit. search is weird and definitely taken for granted with google so largely ubiquitous. but it is still kind of hard even though “information retrieval” is fairly old as a practice.

i really like mastodon and the idea of the fediverse, as a concept. i also really don’t like facebook, twitter, etc. but one of the things they did really well was make discoverability of people’s posts very easy (arguably too easy to our detriment). but mastodon doesn’t have that problem because its search is garbage. it provides a rudimentary tag search, and that’s about it.

am i going to fix that? no. but i’ve worked on search in the past, and recently i was given some motivation to create a search engine. but i think i want to take a bit of a twist here.

on the surface, there are just a few components of a search engine that you need:

a crawler to find content on the open web
an indexer for that content
something to query the index
a web portal to display everything

“that’s it.”

the crawler and indexer are the “hard” parts, and that’s where all the computer sciencey typically goes. the querier a little less complex. the web portal just a bunch of html and javascript to tie everything together.

so the first hard thing is finding fediverse instances out there in the internet, and once you find them, reading their content. there is neither an api to list all “actors” on an instance nor is there a way to list all instances out there. this is why centralized social media providers win almost every time, because they have all that information.

we could use a regular web crawler like apache nutch or elastic crawler and hope for the best. once it’s found, then we have to index it and you could do a lot worse than elasticsearch for that.

i feel like that’s a personal challenge. i’m not going to solve any major issues, but i am just going to brain dump a handful of ideas together—maybe they’ll congeal into something useful, maybe not.

typically, indexing creates immutable indices. we exploit that at work by sending off completed indices to AWS S3. we could do a similar thing with IPFS… but that sounds more fun to do than to actually serve a purpose.

in order to scale search, you have queries that search multiple indices, then reassembles those results back together into a final listing. this can be done many times, in many layers, and it can be really expensive to do a big search over many indices—another thing we experience at work.

what could be done though is that for an instance, it could actually produce its own search indices, and periodically publish them over some regular interval. this wouldn’t be something practical that i, myself, would do. instead, it would be something baked into whatever software running on the fediverse instance.

dumping those indices into IPFS would be an easy thing to do, too. then essentially, if you wanted to search an instance, you could pull down the CIDs of its indices, then do your search.

i’m not entirely sure that’s something i’m going to be able to pull off any time in the near future. i will just have to stick to using a web crawler and my own indexer lol.

i think i have enough of an idea of what i want to do here. stay tuned.