No, I actually count the n-grams as distinct words (up to 4-grams). The main lim...

		marginalia_nu on Sept 21, 2021 \| parent \| context \| favorite \| on: A search engine that favors text-heavy sites and p... No, I actually count the n-grams as distinct words (up to 4-grams). The main limiter is for that is space, so I only extract "canned" n-grams from some tags. I would first search for the bigram hello_world, that's an O(1) array lookup; as then documents merely containing the words hello and world (usually not a good search result), that's the algorithm I'm describing in the parent comment.

Makes sense. Every time you insert a new URL for a word you have to update the ranges for every other word since the URL file will be shifted?