Thread

Index > Scribe > Aspell: Invalid UTF-8 Sequence (Linux)
Author/Date Aspell: Invalid UTF-8 Sequence (Linux)
RoDen
13/07/2022 12:15pm
Encoding is hard-coded in SpellCheckAspell.cpp

s += "iso-8859-1.cmap";


Should be taken from .dat

charset	       cp1251


Otherwise non-ISO dictionaries are processed incorrectly:

../Code/SpellCheckAspell.cpp:970 - Dictionary for 'bg' install attempt 1
../Code/SpellCheckAspell.cpp:770 - Downloading 'ftp://ftp.gnu.org/gnu/aspell/dict/bg/aspell6-bg-4.1-0.tar.bz2' to '/home/user/src/scribe/trunk/Linux/Aspell/dict/bg/aspell6-bg-4.1-0.tar.bz2'
../Code/SpellCheckAspell.cpp:868 - Decompressing '/home/user/src/scribe/trunk/Linux/Aspell/dict/bg/aspell6-bg-4.1-0.tar.bz2'
../Code/SpellCheckAspell.cpp:868 - Decompressing '/home/user/src/scribe/trunk/Linux/Aspell/dict/bg/aspell6-bg-4.1-0.tar'
../Code/SpellCheckAspell.cpp:868 - Decompressing '/home/user/src/scribe/trunk/Linux/Aspell/dict/bg/bg.cwl'
Warning: The string "��������" is invalid. Invalid UTF-8 sequence at position 1. Skipping string.
Warning: The string "���������" is invalid. Invalid UTF-8 sequence at position 1. Skipping string.
Warning: The string "�����" is invalid. Invalid UTF-8 sequence at position 1. Skipping string.
Warning: The string "�������" is invalid. Invalid UTF-8 sequence at position 1. Skipping string.
Warning: The string "�������" is invalid. Invalid UTF-8 sequence at position 1. Skipping string.
Warning: The string "��������" is invalid. Invalid UTF-8 sequence at position 1. Skipping string.


Also, most Aspell dictionaries (except Danish, English, German, Greek & Portuguese) are outdated. It's better to use precompiled dictionaries from local distributions (which are better maintained) if it's possible. Or advise users to do so.
fret
13/07/2022 12:25pm
Encoding is hard-coded in SpellCheckAspell.cpp

s += "iso-8859-1.cmap";

This is in SetupPaths(), which is just checking for the presence of cmap files. I just chose a random one to test for (as opposed to all of them I guess). That 'iso-8859-1' doesn't effect the encoding or decoding at all.

The warnings you're seeing are likely due to bad input more so than issues with aspell. Are they easily reproducible? Maybe one particular email? Like, what are you doing at the time they show up in the console?
RoDen
13/07/2022 6:54pm
The warnings show up when I try to install any non-ISO dictionary. After the installation all such dictionaries are empty because all the strings with non-ISO symbols are skipped.

And I've managed to successfully install the Russian dictionary with the following changes:

s += "koi8-r.cmap";
// s += "iso-8859-1.cmap";


Also, spell checking doesn't work in the "Text" tab.
Reply