Skip to main content
added 1088 characters in body
Source Link
Marcus Müller
  • 51.2k
  • 4
  • 77
  • 119

If the CPU is not working, it's waiting on I/O or just sleeping. This factually, but uselessly answers your question

So. Why is your indexer software slow?

Swish hasn't seen a indexing-affecting update since ca 2005-2006. The assumptions it makes about storage – which is pretty important to an indexing engine – will simply not apply to your disk; a lot of things have happened in the last 20 years.

What specifically are you using Swish++ for? I hadn't heard of it before, so I dowloaded the source code, and quite honestly, um, that's some not-very-performant C++ by modern standards.

To figure out what your software actually is stuck on, you would be the only person that could do that. The act of doing so is called profiling. It's not hard!

  1. Make sure you have debug symbols for your specific swish++ build. So, probably you want a build with CCFLAGS="-O2 -g" set.
  2. Use a profiler. Linux comes with perf (that's probably also available for your linux distro in a package!). A simple perf record -ag ${your command line here} will make a recording of how long the CPU was in any particular function (it will tell you what you need to adjust to be able to take that capture). perf report will then show you the statistics interactively.

Proper profilers can be much more intricate than just the "stuck where for how many % of total time", but it's usually a good indication of what code you'd need to optimize to be faster.

Now, again, this is 20 year old C++ code, it doesn't follow any modern coding standards, it makes multiple questionable statements on performance in the Readme; if you're using this for modern HTML, it will not be properly equipped to do so, either. Maybe, it's time to replace it.

Just for comparison:

I hacked together a HTML search thing in python (you will need BeautifulSoup4, e.g. <your-package-manager-here> install python3-beautifulsoup4) in 15 minutes. It's certainly not the fastest parser you could use, the regex engine used for matching the contents of tags is by far not the fastest regex engine, but it searches 11 MB of git documentation in < 1s. That might be good enough for everyday work. Here you go: searchhtml.py.

Usage is simply (after making it executable once, `chmod 755 /path/to/htmlsearch.py):

/path/to/htmlsearch.py REGEX FILE1 [FILE2…]

Is that the GUI you wanted? nope. But neither is your swish or dwww approach. Bolting a bit of the textual Python TUI framework up front the same logic wouldn't be hard, though.

You could also modify it to generate a valid HTML document (using the same BeautifulSoup module), write that to a temporary file and open it in your browser, e.g. in zsh firefox =(htmlsearch_html_overview.py REGEX **/*.html).

If the CPU is not working, it's waiting on I/O or just sleeping. This factually, but uselessly answers your question

So. Why is your indexer software slow?

Swish hasn't seen a indexing-affecting update since ca 2005-2006. The assumptions it makes about storage – which is pretty important to an indexing engine – will simply not apply to your disk; a lot of things have happened in the last 20 years.

What specifically are you using Swish++ for? I hadn't heard of it before, so I dowloaded the source code, and quite honestly, um, that's some not-very-performant C++ by modern standards.

To figure out what your software actually is stuck on, you would be the only person that could do that. The act of doing so is called profiling. It's not hard!

  1. Make sure you have debug symbols for your specific swish++ build. So, probably you want a build with CCFLAGS="-O2 -g" set.
  2. Use a profiler. Linux comes with perf (that's probably also available for your linux distro in a package!). A simple perf record -ag ${your command line here} will make a recording of how long the CPU was in any particular function (it will tell you what you need to adjust to be able to take that capture). perf report will then show you the statistics interactively.

Proper profilers can be much more intricate than just the "stuck where for how many % of total time", but it's usually a good indication of what code you'd need to optimize to be faster.

Now, again, this is 20 year old C++ code, it doesn't follow any modern coding standards, it makes multiple questionable statements on performance in the Readme; if you're using this for modern HTML, it will not be properly equipped to do so, either. Maybe, it's time to replace it.

If the CPU is not working, it's waiting on I/O or just sleeping. This factually, but uselessly answers your question

So. Why is your indexer software slow?

Swish hasn't seen a indexing-affecting update since ca 2005-2006. The assumptions it makes about storage – which is pretty important to an indexing engine – will simply not apply to your disk; a lot of things have happened in the last 20 years.

What specifically are you using Swish++ for? I hadn't heard of it before, so I dowloaded the source code, and quite honestly, um, that's some not-very-performant C++ by modern standards.

To figure out what your software actually is stuck on, you would be the only person that could do that. The act of doing so is called profiling. It's not hard!

  1. Make sure you have debug symbols for your specific swish++ build. So, probably you want a build with CCFLAGS="-O2 -g" set.
  2. Use a profiler. Linux comes with perf (that's probably also available for your linux distro in a package!). A simple perf record -ag ${your command line here} will make a recording of how long the CPU was in any particular function (it will tell you what you need to adjust to be able to take that capture). perf report will then show you the statistics interactively.

Proper profilers can be much more intricate than just the "stuck where for how many % of total time", but it's usually a good indication of what code you'd need to optimize to be faster.

Now, again, this is 20 year old C++ code, it doesn't follow any modern coding standards, it makes multiple questionable statements on performance in the Readme; if you're using this for modern HTML, it will not be properly equipped to do so, either. Maybe, it's time to replace it.

Just for comparison:

I hacked together a HTML search thing in python (you will need BeautifulSoup4, e.g. <your-package-manager-here> install python3-beautifulsoup4) in 15 minutes. It's certainly not the fastest parser you could use, the regex engine used for matching the contents of tags is by far not the fastest regex engine, but it searches 11 MB of git documentation in < 1s. That might be good enough for everyday work. Here you go: searchhtml.py.

Usage is simply (after making it executable once, `chmod 755 /path/to/htmlsearch.py):

/path/to/htmlsearch.py REGEX FILE1 [FILE2…]

Is that the GUI you wanted? nope. But neither is your swish or dwww approach. Bolting a bit of the textual Python TUI framework up front the same logic wouldn't be hard, though.

You could also modify it to generate a valid HTML document (using the same BeautifulSoup module), write that to a temporary file and open it in your browser, e.g. in zsh firefox =(htmlsearch_html_overview.py REGEX **/*.html).

Source Link
Marcus Müller
  • 51.2k
  • 4
  • 77
  • 119

If the CPU is not working, it's waiting on I/O or just sleeping. This factually, but uselessly answers your question

So. Why is your indexer software slow?

Swish hasn't seen a indexing-affecting update since ca 2005-2006. The assumptions it makes about storage – which is pretty important to an indexing engine – will simply not apply to your disk; a lot of things have happened in the last 20 years.

What specifically are you using Swish++ for? I hadn't heard of it before, so I dowloaded the source code, and quite honestly, um, that's some not-very-performant C++ by modern standards.

To figure out what your software actually is stuck on, you would be the only person that could do that. The act of doing so is called profiling. It's not hard!

  1. Make sure you have debug symbols for your specific swish++ build. So, probably you want a build with CCFLAGS="-O2 -g" set.
  2. Use a profiler. Linux comes with perf (that's probably also available for your linux distro in a package!). A simple perf record -ag ${your command line here} will make a recording of how long the CPU was in any particular function (it will tell you what you need to adjust to be able to take that capture). perf report will then show you the statistics interactively.

Proper profilers can be much more intricate than just the "stuck where for how many % of total time", but it's usually a good indication of what code you'd need to optimize to be faster.

Now, again, this is 20 year old C++ code, it doesn't follow any modern coding standards, it makes multiple questionable statements on performance in the Readme; if you're using this for modern HTML, it will not be properly equipped to do so, either. Maybe, it's time to replace it.