How do I use a search engine?

Published: 2023-12-10

I was reading about Kagi, a search engine you pay a monthly subscription to use for. Our world is so fucked up, that explicitly paying with money for a service is the only hope to avoid having your data slurped for profit.

So I asked myself: how do I actually use a search engine?

§ Gathering the data

Open Firefox, press CTRL+H (to open the history), copy the past month data (e.g. November 2023), dump it into a text editor and sort the results:

cat nov-searches.txt | sort

I only use DuckDuckGo so I can easily filter out everything else. How many queries? wc says about 300:

cat nov-searches.txt | sort | rg duckduckgo.com | wc
    294     294   39068

Let's remove duplicated queries (I tend to refine searches by running multiple variants of the query terms), clean up a little bit the urlencoding and also filter out other search parameters not interesting. Let's just isolate the search terms; for example in the following query I want to isolate the part in red:

https://duckduckgo.com/?q=steez+posters&iar=images

cat nov-searches.txt | sort | rg duckduckgo.com \
    # isolate only the search terms, discard all the rest
    | rg --pcre2 -o -e 'q=(.+?)(?=&)' -r '$1' \
    # remove any kind of spaces (urlencoded or not)
    | tr '+' ' ' | tr '%20' ' ' | tr -d ' '

Before proceeding, a little pause to disentangle that little ripgrep black magic that I just learned:

--pcre2 use the PCRE2 engine, needed for backreferences and look-around
-o only show the matched content
-e 'q=(.+?)(?=&)' the regular expression will isolate everything between q= and the first &
-r '$1' just print the content of the first group capture (in the previous example (.+?) prints steez+posters)

Reminder that Regex 101 is a good resource for experimenting.

Finally, I'd like an idea of how many queries I run, excluding some of the variants (refinements over the same search):

cat nov-searches.txt | sort | rg duckduckgo.com \
    | rg --pcre2 -o -e 'q=(.+?)(?=&)' -r '$1' \
    | tr '+' ' ' | tr '%20' ' ' | tr -d ' ' |
    | sort | uniq -c | sort -rn

The above grouping is a little bit imprecise because I'm not filtering out similar matches. I need a tool to calculate proximity, for example the good old Levenshtein distance. I've tried using agrep but has some limitations. For the moment it will suffice, given the small sample.

§ The stats

The stats for the 3 last months:

month	total queries	unique queries	repeated queries (avg)
Nov 2023	294	70	4.20
Oct 2023	234	50	4.68
Sep 2023	226	43	5.25

I'll save the output from the previous script into a file input.dat and plot the values using another tool I have a love/hate relationship with: gnuplot.

#!/usr/bin/env gnuplot
set terminal png size 800, 600
set output "searches.png"
set key center top
set style fill solid
set title "How do I use a search engine?"
unset xtics
plot 'input.dat' using 1 t "# of variants" with lp

This script will produce the following image for the November 2023 dataset.

x = one search, y = # of variants for that search

I infer that I might not be very good at searching stuff because in average I repeat about 4 times the same query with slight variations.

Also, for some reason I expected I was using a search engine more. Part of the reason is probably that I often use DDG bangs and some quick custom searches - in these cases I skip DDG and am redirected directly to the website that I want to query.