Nova Resource:Wikisource/Wikimedia OCR
This page documents how to set up the Wikimedia OCR tool that is used by the Wikisource extension.
Web server
Install and configure Apache and PHP.
sudo apt -y install php php-bcmath php-common php-cli php-fpm php-gd php-json php-xml php-intl php-curl apache2 libapache2-mod-php
Create the web server configuration file at /etc/apache2/sites-available/wikimediaocr.conf with the following:
<VirtualHost *:80>
DocumentRoot /var/www/tool/public
ServerName ocr.wmcloud.org
php_value memory_limit 512M
# Requests with these user agents are denied.
SetEnvIfNoCase User-Agent "(uCrawler|Baiduspider|CCBot|scrapy\.org|kinshoobot|YisouSpider|Sogou web spider|yandex\.com\/bots|twitterbot|TweetmemeBot|SeznamBot|datasift\.com\/bot|Googlebot|Yahoo! Slurp|Python-urllib|BehloolBot|MJ13bot|SemrushBot|facebookexternalhit|rcdtokyo\.com|Pcore-HTTP|yacybot|ltx71|RyteBot|bingbot|python-requests|Cloudflare-AMP|Mr\.4x3|MSIE 7\.0; AOL 9\.5|Acoo Browser|AcooBrowser|MSIE 6\.0; Windows NT 5\.1; SV1; QQDownload|\.NET CLR 2\.0\.50727|MSIE 7\.0; Windows NT 5\.1; Trident\/4\.0; SV1; QQDownload|Frontera|tigerbot|Slackbot|Discordbot|LinkedInBot|BLEXBot|filterdb\.iss\.net|SemanticScholarBot|FemtosearchBot|BrandVerity|Zuuk crawler|archive\.org_bot|mediawords bot|Qwantify\/Bleriot|Pinterestbot|EarwigBot|Citoid \(Wikimedia|GuzzleHttp|PageFreezer|Java\/|SiteCheckerBot|Re\-re Studio|^R \(|GoogleDocs|WinHTTP|cis455crawler|WhatsApp|Archive\-It|lua\-resty\-http|crawler4j|libcurl|dygg\-robot|GarlikCrawler|Gluten Free Crawler|WordPress|Paracrawl|7Siters|Microsoft Office Excel|MJ12bot|AhrefsBot|dotbot|amp-cloud|naver\.me\/spd|Adsbot|linkfluence|coccocbot|sqlmap)" bad_bot=yes
CustomLog ${APACHE_LOG_DIR}/access.log combined expr=!(reqenv('bad_bot')=='yes'||reqenv('dontlog')=='yes')
CustomLog ${APACHE_LOG_DIR}/denied.log combined expr=(reqenv('bad_bot')=='yes')
ErrorLog ${APACHE_LOG_DIR}/error.log
<Directory /var/www/tool/public/>
Options Indexes FollowSymLinks
AllowOverride All
Require all granted
DirectoryIndex index.php
RewriteEngine On
RewriteRule ^index\.php$ - [L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]
</Directory>
<Directory /var/www/tool/>
Options Indexes FollowSymLinks
AllowOverride None
Require all granted
Deny from env=bad_bot
</Directory>
ErrorDocument 403 "Access denied"
RewriteCond "%{HTTP_REFERER}" "^http://127\.0\.0\.1:(5500|8002)/index\.html" [NC]
RewriteRule .* - [R=403,L]
RewriteCond "%{HTTP_USER_AGENT}" "^[Ww]get"
RewriteRule .* - [R=403,L]
RewriteEngine On
RewriteCond %{HTTP:X-Forwarded-Proto} !https
RewriteRule ^/?(.*) https://%{SERVER_NAME}/$1 [R=301,L]
</VirtualHost>
Set PHP configuration in /etc/php/8.2/mods-available/wikimediaocr.ini:
max_execution_time = 60;
And enable it with sudo phpenmod wikimediaocr
Enable various Apache modules, and the web server configuration (and disable the default site, which isn't used):
sudo a2enmod php8.2 rewrite
sudo a2ensite wikimediaocr
sudo a2dissite 000-default
sudo apache2ctl graceful
Tool
Install dependencies:
sudo apt install git composer npm
Clone the repository, first removing the html/ directory created by Apache.
cd /var/www && sudo rm -rf html
sudo git clone https://github.com/wikimedia/wikimedia-ocr.git tool
cd /var/www/tool
Create .env.local with relevant values (see below).
sudo composer install --no-dev --optimize-autoloader
sudo npm install
# "npm audit fix" is required to get newer code with fixes for security issues.
sudo npm audit fix
sudo npm run build
Change ownership of all application files to www-data:
sudo chown -R www-data:www-data .
Add the cron job to update the app when there's a new tagged release with sudo crontab -e -u www-data then add:
MAILTO=tools.ocr@tools.wmflabs.org
*/10 * * * * /var/www/tool/vendor/wikimedia/toolforge-bundle/bin/deploy.sh prod /var/www/tool
Tesseract
The only configuration for Tesseract is to install it with all available OCR models (languages and scripts):
sudo apt install tesseract-ocr-all
At the time of writing, Tesseract 5 is the stable version which will be installed with that command.
Latest Tesseract
If for some reason the very latest Tesseract is required, it can also be installed from source.
- Install the required packages:
sudo apt-get install automake ca-certificates g++ git libtool libleptonica-dev make pkg-config
- Optionally install the man pages:
sudo apt-get install --no-install-recommends asciidoc docbook-xsl xsltproc
- Clone the Tesseract repo (home directory is fine):
git clone https://github.com/tesseract-ocr/tesseract.git
cd tesseractand checkout the latest tag for Tesseract 5, which at the time of writing is 5.3.2:git checkout 5.3.2
- Build from source:
./autogen.sh ./configure make
- Now remove the old Tesseract package, if present, with
sudo apt purge tesseract-ocr. This will also remove the trained data files, which we'll re-add later. Themakeprocess above takes the longest, so it's important to not remove the old Tesseract until afterwards so as to minimize downtime of the tool. Note that parallelization (make -j8) doesn't seem to make any difference. - Install the new version:
sudo make install sudo ldconfig
- Clone the trained data files:
cd ~ git clone https://github.com/tesseract-ocr/tessdata_fast.git
- Copy them to
/usr/local/share/tessdata:sudo cp tessdata_fast/*.traineddata /usr/local/share/tessdata
- Make sure all is well by running the check_tesseract script:
cd /var/www/tool/ ./check_tesseract.sh
Upgrading Tesseract 5 to git master or another branch
Assuming Tesseract 5 is already installed and you only need to upgrade it to a newer version, follow these simplified steps:
cdto the tesseract directory in your home dir (see above for cloning)- Checkout the version you want to upgrade to, e.g.
git checkout master && git pullfor git master - Run
sudo make cleanto clear any previously compiled stuff (this is probably not required, but it executes very fast and should provide additional guarantees) - Then follow the normal installation steps:
./autogen.sh ./configure make -j8 sudo make install sudo ldconfig
- Check the output of
tesseract --version: if you upgraded to git master, it should contain a commit number and that should match the latest commit on the master branch. - Run the check_tesseract script for the final checks:
cd /var/www/tool/ ./check_tesseract.sh
Google OCR
Add the php-bcmath package:
sudo apt install php-bcmath
Download the Google Cloud Vision API keyfile to your local system (see CONTRIBUTING.md for info on obtaining a keyfile), then use scp to copy it to the VPS instance:
scp keyfile.json username@ocr-prod01.wikisource.eqiad1.wikimedia.cloud:/home/username
sudo mv keyfile.json /var/www/
Make sure .env.local file points to the right place:
APP_GOOGLE_KEYFILE=/var/www/keyfile.json
You also may need to restart Apache:
sudo service apache2 restart
Transkribus
Wikimedia OCR also offers Transkribus as an OCR engine.
To configure it, set the following two environment variables in .env.local:
APP_TRANSKRIBUS_USERNAME=
APP_TRANSKRIBUS_PASSWORD=