[Koha] Koha Wiki migrated and upgraded

Thomas Dukleth kohalist at agogme.com
Fri Oct 28 01:06:04 NZDT 2022

The Koha Wiki now running MediaWiki Canasta has been up for a few hours at
the usual DNS subdomain https://wiki.test.koha-community.org .  The wiki
is now up to date with MediaWiki 1.35.07 long term stable using a MySQL
database and ElasticSearch with many fine enhancements such as
VisualEditor, customised AdvancedSearch, and dynamic archiving of obsolete
pages (which often still have useful information).  Please see further
below for details.

Unfortunately, the mail system on the server for the wiki for resetting
wiki login passwords, or creating new login users, etc. had previously
become broken and was missed for fixing amidst all the work which people
have been doing for releasing a new Koha version.  Someone will send a
message when the mail system is fixed.  If you know your Koha wiki login
username and password from previously, they will work.

There are problably many problems with result set relvance for search
queries within the wiki.  We will fix them over time and relevance ranking
should automatically improve with use, although, maintenance changes may
often count as use eroding recently updated relevance.  The wiki is now
using ElasticSearch which is used by Wikipedia and has better extension
support than database based indexing.  See some details about the
customised AdvancedSearch and the need for careful consideration in
improving search query indexing further below.

Sitemap creation, which assists Google and other web indexing systems in
indexing the Koha wiki, may not be working correctly in the way in which
we have configured the MediaWiki Canasta Docker container.  Google and
others can still index the content without the sitemap but the process
functions better with a sitemap.  The Canasta Docker container does some
things differently than the way they would function in a standard
environment such that less effort should be required for maintenance tasks
but we need a little more time to examine how some things such as sitemap
creation are intended to function in Canasta.  We should always be able to
use methods ordinarily used for sitemap creation in a standard environment
if necessary.

The Koha MediaWiki Canasta test instance should continue to be available
for first testing significant changes and bug fixes, at
https://wiki.test.koha-community.org .  Please do not make wiki
contributions that you want to save in the MediaWiki Canasta test instance
as they will not be carried over to the production wiki.  Continue to make
lasting contributions to the production wiki at
https://wiki.koha-community.org .

Please read below for an understanding of what to expect before reporting
issues about which we are already aware, such as the test database is not
a current copy of the wiki and the mail system for resetting login
passwords and creating new login users is not working.

You may report bugs to the bug "wiki needs updating to a later version",
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=23073 .


Migrating the Koha MediaWiki database from Postgres to MySQL and upgrading
to MediaWiki 1.35.07, the current long term stable version used a
repeatable process managed with a set of scripts which I developed in
bash, Perl, and Python as appropriate for the task and previous code in
the case of Python.  Choosing Postgres as the database for a test instance
for MediaWiki had left us with a mistake in database choice complicating
compatibility and future upgrades when the MediaWiki test suddenly became
the only Koha wiki running when the previous Koha wiki went down in the
midst of a community schism with LibLime long ago.  The database migration
and upgrade process has been developed and progressively tested over the
course over time from 2019 for ensuring that the database is migrated
correctly, etc.  I built the database migration process upon the
originally incomplete and sometimes mistaken Python script of Philipp
Spitzer which was a fantastic proven starting point without which the task
may have been some degree too much.

Mason James ran a web crawl and diff test to verify that the production
and another test database migration and upgrade of the wiki had the same
content except for evident changes where the production wiki had been
updated with new content.

The database was imported to MediaWiki Canasta which Tomás Cohen Arazi
identified and customised to connect to the Koha Portainer Docker
container management to provide MediaWiki in a Docker container with a
large set of important extensions to help make managing the MediaWiki
software easier.  See https://github.com/CanastaWiki for more about
MediaWiki Canasta.

After a minimial final testing period with Canasta in Koha Portainer, I
marked the old wiki instance which had been using Postgres readonly and
proceded with the database migration and upgrade using an up to date copy
of the database.  Tomás made further modifications in Portainer and Chris
Cormack redirected the DNS record for wiki.koha-community.org DNS to the
current server for the wiki.

Although you may never notice an interruption in service for the wiki.  We
may have to restart it to fix things which function a little differently
and a little more complicated to fix for the MediaWiki Canasta Docker
container than a standard operating system environment.  Once fixed,
maintenance of a Docker container should be easier than a standard
environment.  We may even move the server or some functions back to the
server where the wiki had been hosted for years thanks to Galen Charleton
and Equinox.

See further below for a little about other modifications which I made to
support dynamic archiving, etc.


The mail system on the server for the wiki for resetting wiki login
passwords, or creating new logins, etc. had previously become broken and
needs fixing as a matter of priority.

The test instance at https://wiki.test.koha-community.org has copy of the
database which will become outdated over time.  In future, we may set up a
process to update the database periodically.  The purpose of the test
instance is to support testing significant wiki changes or wiki bug fixes
first without the hazard of harming the production wiki.  Bug fixes need
testing and can at least temporarily break the wiki.  Significant changes
may fail to work as expected and might not be easily undone particularly
if the changes have been created by a script for mass editing.  Having the
latest wiki revisions is not usually needed for testing.

There have been bugs specific to MediaWiki Canasta rearranging some
standard files for the Docker container which have been addressed. 
However, there are at least some Canasta Docker specific bugs relating to
the Docker container environment.  Please report any instances of "Error
creating thumbnail: Unable to save thumbnail to destination" which I found
in the Koha History page https://wiki.koha-community.org/wiki/History . 
Instances of the bug can be fixed with command line shell access by
removing the images/thumb/$buggy_image_name subdirectory for the image and
making non-changing edit to the page which allows MediaWiki to recreate
the images/thumb subdirectory without a problem and the bug goes away.  We
should probably remove all the subdirectories in the images/thumb
directory proactively.  Yet, why is there a special problem for the Docker
container which does not exist in other test instances when not using a
Docker container and the container environment is running as the root user
as standard for Docker containers and the root user should have all the
permission necessary to access or create a thumbnail directory?  Changing
the ownership of the images directory and subdirectories back and forth to
test the effect temporarily broke a test instance of the wiki until the
container was restarted.


The VisualEditor extension used by Wikipedia is a WYSIWIG and guided forms
aid for visually editing the underlying wikitext for a page and using
guided forms for adding some features to a page.  Users can switch back
and forth between source editing in all wikitext syntax and VisualEditor,
however, it may be best to save the current edit before switching back and
forth to avoid problems of imperfect correspondence between wikitext
syntax and the VisualEditor model of wikitext.

The AdvancedSearch extension used by Wikipedia is helpful for a user
friendly interface to construct search queries and modify them by removing
terms which appear in a bubble with an [x] to remove the term. 
AdvancedSearch depends on ElasticSearch which performs remarkably well in
testing and allows the wiki to be reindexed in a couple of minutes if
necessary.  See further below for modifications to the AdvancedSearch

SemanticMediaWiki was reinstalled after copying the upgraded database. 
Modifying the AdvancedSearch extension in conjunction with special
AdvancedSearch navigation links and custom queries using carefully managed
standard wiki categories may be more helpful than SemanticMediaWiki. 
Furthermore, anyone experimenting with SemanticMediaWiki should be aware
that verbose syntax is required to avoid breaking most wikis with
SemanticMediaWiki after forthcoming MediaWiki updates in which a hook
commonly relied upon for SemanticMediaWiki which has been deprecated will
be removed.  Wikipedia does not use SemanticMediaWiki and thus some
MediaWiki developers may not have given sufficient consideration to
managing the issue.  The workaround may involve a potential performance
deficit when using SemanticMediaWiki search queries.

The MassEditRegex extension has power one might hope for in the name for
using regular expressions to modify a list of pages.  However, given its
power it remains commented out in LocalSettings.php for the production
system.  Use is intended to be for some special group of users such as
wiki administrors, however even they should be most strongly cautioned to
first test their process on a test instance of the wiki.  Furthermore, use
should be with a bot subaccount set up by the user so that they may be
identified as the work of a bot process and those mass changes may avoid
adversely affecting page modification priorities in search result sets. 
The creation of user bot subaccounts should be documented.  In testing,
MassEditRegex works fantastically well for adding categories to the bottom
of pages and templates to the top of pages which can be done without risk
of an inadequately debugged regular expression breaking page content in
the middle.


I modified the following to support dynamic archiving in which obsolete
content does not appear by default for search results unless the user goes
directly to the advanced search page without following provided navigation
links or changes the default VectorMod skin affecting the basic search


The AdvancedSearch extension has been modified to include two additional
form elements: one for excluding particular categories and another for
excluding particular templates.  These additional elements appear in the
user friendly AdvancedSearch term bubbles which can be individually
removed from a query by clicking on the [x] for the particular bubble.

Editing the non-English localisation files is still pending.  For
languages for which a non-English localisation file has not been edited,
the custom fields for category and template exclusion display a
description in English.

DeepCategory searches for subcategories of a category is disabled because
it requires a sparkle database and is only updated on a weekly basis for
Wikipedia.  Searching subcategories of a category should be less of an
issue with faceted use of categories which we should be carefully moving

Excluding particular categories supports dynamic archiving by supporting
search queries excluding obsolete pages with -incategory:"Obsolete", which
is automatically invoked from the navigation link "Advanced Search
current" or from simple search box when using the modified Vector skin,
VectorMod.  Obsolete pages are also noted with a prominent notice using
the Obsolete template.  Such pages should be updated if they can be, but
are otherwise available to consult most importantly for valuable
information they often contain which is not yet present in current pages. 
Archived obsolete pages can be found exclusively by following the
navigation link
"Advanced search obsolete archive" which includes incategory:"Obsolete"

The result set for search queries with incategory:"Obsolete" can be used
to identify the type of pages which should have the Obsolete category and
Obsolete template but do not yet, such as installation information for
some particular old Debian versions.  Various combinations of including
and excluding categories and templates can be easily used in the modified
AdvancedSearch to find pages which only have one of either the Obsolete
category or Obsolete template which should be used together or both
removed if the page has been updated to be current.

All wiki pages should have some category even if it may be
[[Category:Empty]] for people uncertain of what may be appropriate in the
moment.  Pages missing categories may not be disappearing from query
results by category when using ElasticSearch indexing as they had been
when using database based search indexing.  We can also query for pages
missing categories using
and correct the issue which has been neglected due to loss of time where
migrating and upgrading the wiki has been the priority with much less time
available otherwise especially since the pandemic.

We should take some care when thinking about faceted category use as no
wiki software uses fielded categories.  Thus there may be no concise way
to query for pages which address a topic in a general way or supplement
other documentation on a topic containing a lone category such as
[[Category:Circulation]], if we then have many other pages with
[[Category:RFCs]] and [[Category:Circulation]] but no longer
[[Category:Circulation RFCs]] as a possible change for faceting.  In such
an example, the search results of a query for incategory:"Circulation"
might have a result set in which pages for RFCs relating to circulation
issues containing both [[Category:RFCs]] and [[Category:Circulation]]
might crowd out more generally helpful pages with [[Category:Circulation]]
alone.  The problem may indicate a need for a navigation link to exclude
RFCs from a search query; designating old RFCs as obsolete; or both. 
Alternatively or additionally, we may be able to adjust the weighting of
the ElasticSearch indexing options such that pages containing
[[Category:RFCs]] have a lower weight and appear further down the result
set or pages with a single category such as [[Category:Circulation]] alone
or some particular additional categories such as
[[Category:Documentation]] have higher weight and appear further up the
result set.


Users are free to choose their own preferred MediaWiki skin and we can add
others.  VectorMod is merely set as the default to help people avoid
obsolete pages when submitting search queries from the simple search box
which appears on every page.

VectorMod is a custom version of the Vector skin which includes a modified
version of Vector/includes/templates/SearchBox.mustache supporting dynamic
archiving of obsolete content by excluding pages which have been
designated obsolete by automatically adding -inCategory:"Obsolete" to
basic search querries.  The syntax incategory requires using
ElasticSearch.  Previously, I replaced the SearchBox.mustache file in the
Vector skin
directly, which certainly worked without the extra effort of creating a
custom skin.

Automatically inserting -inCategory:"Obsolete" in the basic search box is
now somewhat elegant in conjunction with the modified AvancedSearch
extension as it uses explanatory language labels with a bubble which has a
removal [x] and allows autocompletion of query terms.

Significant renaming of references to Vector as VectorMod and vector as
vectormod has been scripted allows both Vector and VectorMod to be loaded
and available to users.

Thomas Dukleth
109 E 9th Street, 3D
New York, NY  10003
+1 212-674-3783

More information about the Koha mailing list