How do I use a search engine?

Published: 2023-12-10

I was reading about Kagi, a search engine you pay a monthly subscription to use for. Our world is so fucked up, that explicitly paying with money for a service is the only hope to avoid having your data slurped for profit.

So I asked myself: how do I actually use a search engine?

§ Gathering the data

Open Firefox, press CTRL+H (to open the history), copy the past month data (e.g. November 2023), dump it into a text editor and sort the results:

cat nov-searches.txt | sort

I only use DuckDuckGo so I can easily filter out everything else. How many queries? wc says about 300:

cat nov-searches.txt | sort | rg duckduckgo.com | wc
    294     294   39068

Let's remove duplicated queries (I tend to refine searches by running multiple variants of the query terms), clean up a little bit the urlencoding and also filter out other search parameters not interesting. Let's just isolate the search terms; for example in the following query I want to isolate the part in red:

https://duckduckgo.com/?q=steez+posters&iar=images

cat nov-searches.txt | sort | rg duckduckgo.com \
    # isolate only the search terms, discard all the rest
    | rg --pcre2 -o -e 'q=(.+?)(?=&)' -r '$1' \
    # remove any kind of spaces (urlencoded or not)
    | tr '+' ' ' | tr '%20' ' ' | tr -d ' '

Before proceeding, a little pause to disentangle that little ripgrep black magic that I just learned:

Reminder that Regex 101 is a good resource for experimenting.

Finally, I'd like an idea of how many queries I run, excluding some of the variants (refinements over the same search):

cat nov-searches.txt | sort | rg duckduckgo.com \
    | rg --pcre2 -o -e 'q=(.+?)(?=&)' -r '$1' \
    | tr '+' ' ' | tr '%20' ' ' | tr -d ' ' |
    | sort | uniq -c | sort -rn

The above grouping is a little bit imprecise because I'm not filtering out similar matches. I need a tool to calculate proximity, for example the good old Levenshtein distance. I've tried using agrep but has some limitations. For the moment it will suffice, given the small sample.

§ The stats

The stats for the 3 last months:

monthtotal queriesunique queriesrepeated queries (avg)
Nov 2023294704.20
Oct 2023234504.68
Sep 2023226435.25

I'll save the output from the previous script into a file input.dat and plot the values using another tool I have a love/hate relationship with: gnuplot.

#!/usr/bin/env gnuplot
set terminal png size 800, 600
set output "searches.png"
set key center top
set style fill solid
set title "How do I use a search engine?"
unset xtics
plot 'input.dat' using 1 t "# of variants" with lp

This script will produce the following image for the November 2023 dataset.

x = one search, y = # of variants for that search

I infer that I might not be very good at searching stuff because in average I repeat about 4 times the same query with slight variations.

Also, for some reason I expected I was using a search engine more. Part of the reason is probably that I often use DDG bangs and some quick custom searches - in these cases I skip DDG and am redirected directly to the website that I want to query.