NAME
    transliterate.pl - Transliterate text files

SYNOPSIS
    transliterate.pl [options][input file]

    Start the transliteration engine with the given file as input. The input
    file defaults to STDIN if no filename is given.

OPTIONS
    --output <filename>
            Sets the output file to print to.

            If the file exists already and --force is not set, the user is
            asked if the file should be overwritten or appended to.

            Default: "STDOUT" (print to terminal)

    --config <filename>
            Sets the configuration file to use.

            Default: "config"

    --checkduplicates
            Prints all duplicate words within single table files and across
            tables that are replaced within the same group, then exits the
            program.

            Note that this simply prints all duplicates, even ones that are
            legitimate. When duplicates are found during normal operation of
            the program, they are simply combined in exactly the same way as
            the regular word choices.

            Also note that the words are still added as possible choices,
            which may be slightly confusing. If, for instance, a word "word"
            is stored in the tables "tablea", "tableb", and "tablec" with
            the replacements "a", "b", and "c", the first duplicate message
            will say that the first occurrence was in table "tablea" with
            the replacement "a", and the second duplicate message will say
            that the first occurrence was in table "tablea" with the
            replacement "a$b" (assuming $ is the value set as choicesep in
            the config). This is just something to be aware of.

            On that note, before duplicates are checked between tables in
            the same replacement group, duplicates inside the same file are
            already replaced, so that might be a bit confusing as well.

    --nochoices
            Disables prompting for the right word when multiple replacement
            words exist.

            This can be used to "weed out" all the unknown words before
            commencing the laborious task of choosing the right word every
            time multiple options exist.

    --nounknowns
            Disables prompting for the right word when a word is not found
            in the database.

            This can be used together with --nochoices to perform a quick
            test of how well the actual engine is working without having to
            click through all the prompts.

    --debug Prints information helpful for debugging problems with the match
            and group statements.

            For each match or group statement which replaces anything, the
            original statement is printed (the format is a bit different
            than in the config) and each actual word that's replaced is
            printed.

    --debugspecial
            This option is only useful for automatic testing of the
            transliteration engine.

            If --nochoices is enabled, each word in the input with multiple
            choices will be output, along with the number of choices (can be
            used to test the proper functioning of choicesep in the config
            file).

            If --nounknowns is enabled, each unknown word in the input is
            printed (can be used to test that the ignore options are working
            correctly).

    --force Always overwrites the output and error file without asking.

    --start <line number>
            Starts at the given line number instead of the beginning of the
            file.

            Note: when "Stop processing" is pressed, the current line number
            is printed out. This is the current line that was being
            processed, so it has not been printed to the output file yet and
            thus the program must be resumed at that line, not the one
            afterwards.

    --errors <filename>
            Specifies a file to write errors in. Note that this does not
            refer to actual errors, but to any words that were temporarily
            ignored (i.e. words for which "Ignore: This run" was clicked).

            If no file is specified, nothing is written. If a file is
            specified that already exists and --force is not set, the user
            is prompted for action.

    --help  Displays the full documentation.

DESCRIPTION
    transliterate.pl will read the given input file and transliterate it
    based on the given configuration file, prompting the user for action if
    a word has multiple replacement options or is not found in the database.

    See "CONFIGURATION" for details on what is possible.

    Note that this is not some sort of advanced transliteration engine which
    understands the grammar of the language and tries to guess words based
    on that. This is only a glorified find-and-replace program with some
    extra features to make it useful for transliterating text using large
    wordlists.

    WARNING: All input data is assumed to be UTF-8!

WORD CHOICE WINDOW
    The word choice window is opened any time one word has multiple
    replacement options and prompts the user to choose one.

    For each word with multiple options, the user must choose the right
    option and then press "Accept changes" to finalize the transliteration
    of the current line. The button to accept changes is selected by
    default, so it is possible to just press enter instead of manually
    clicking it. Before the line is finalized, the user may press "Undo" to
    undo any changes on the current line.

    "Skip word" just leaves it as is. This shouldn't be needed in most cases
    since choicesep should always be set to a character that doesn't occur
    normally in the text anyways.

    "Open in unknown word window" will open the unknown word window with the
    current word selected. This is meant as a helper if you notice that
    another word choice needs to be added.

    Warning: This is very inconsistent and buggy! Since the unknown word
    window is just opened directly, it isn't modified to make more sense for
    this situation. Whenever "Add replacement" is pressed, the whole line is
    re-transliterated as usual, but the word choice window is opened again
    right afterwards. If you just want to go back to the word choice window,
    press the ignore button for "whole line" since that shouldn't break
    anything. There are weird inconsistencies, though - for instance, if you
    delete all words in the tables, then press "Reload config", the line
    will be re-transliterated and none of the words will actually be found,
    but it will still go on because control passes back to the word choice
    window no matter what. Also, none of the word choices that were already
    done on this line are saved since the line is restarted from the
    beginning. As I said, it's only there as a helper and is very
    buggy/inconsistent. Maybe I'll make everything work better in a future
    release.

    "Stop processing" will exit the program and print the line number that
    was currently being processed.

UNKNOWN WORD WINDOW
    The unknown word window is opened any time a word could not be replaced.

    Both the context from the original script and the context from the
    transliterated version (so far) is shown. If a part of the text is
    selected in one of the text boxes and "Use selection as word" is pressed
    for the appropriate box, the selected text is used for the action that
    is taken subsequently. "Reset text" resets the text in the text box to
    its original state (except for the highlight because I'm too lazy to do
    that).

    The possible actions are:

    Ignore  "This run" only ignores the word until the program exits, while
            "Permanently" saves the word in the ignore file specified in the
            configuration. "Whole line" stops asking for unknown words on
            this line and prints the line out as it originally was in the
            file. Note that any words in the original line that contain
            choicesep will still cause the word choice window to appear due
            to the way it is implemented. Just press "Skip word" if that
            happens.

    Retry without <display name>
            Removes all characters specified in the corresponding
            retrywithout statement in the config from the currently selected
            word and re-transliterates just that word. The result is then
            pasted into the text box beside "Add replacement" so it can be
            added to a table. This is only a sort of helper for languages
            like Urdu in which words often can be written with or without
            diacritics. If the "base form" without diacritics is already in
            the tables, this button can be used to quickly find the
            transliteration instead of having to type it out again. Any part
            of the word that couldn't be transliterated is just pasted
            verbatim into the text box (but after the characters have been
            removed).

            Note that the selection can still be modified after this, before
            pressing "Add to list". This could potentially be useful if a
            word is in a table that is expanded using "noroot" because for
            instance "Retry without diacritics" would only work with the
            full word (with the ending), but only the stem should be added
            to the list. If that is the case, "Retry without diacritics"
            could be pressed with the whole word selected, but the ending
            could be removed before actually pressing "Add to list".

            A separate button is shown for every retrywithout statement in
            the config.

    Add to list
            Adds the word typed in the text box beside "Add replacement" to
            the selected table file as the replacement for the word
            currently selected and re-runs the replacement on the current
            line. All table files that do not have nodisplay set are shown
            as options, see "CONFIGURATION".

            Warning: This simply appends the word and its replacement to the
            end of the file, so it will cause an error if there was no
            newline ("\n") at the end of the file before.

            Note that this always re-transliterates the entire line
            afterwards. This is to allow more flexibility. Consider, for
            instance, a compound word of which the first part is also a
            valid single word. If the entire line was not re-transliterated,
            it would be impossible to add a replacement for that entire
            compound word and have it take effect during the same run since
            the first part of the word would not even be available for
            transliteration anymore.

            One problem is that the word is just written directly to the
            file and there is no undo. This is the way it currently is and
            will probably not change very soon. If a mistake is made, the
            word can always be removed again manually from the list and
            "Reload config" pressed.

    Reload config
            Reload the configuration file along with all tables an re-runs
            the replacement on the current line. Note that this can take a
            short while since the entire word database has to be reloaded.

    Stop processing
            Prints the current line number to the terminal and exits the
            program.

            The program can always be started again at this line number
            using the --start option if needed.

INTERNALS/EXAMPLES
    This section was added to explain to the user how the transliteration
    process works internally since that may be necessary to understand why
    certain words are replaced the way they are.

    First off, the process works line-by-line, i.e. no match statement will
    ever match anything that crosses the end of a line.

    Each line is initially stored as one chunk which is marked as
    untransliterated. Then, all match, matchignore, and replace (or, rather,
    group) statements are executed in the order they appear in the config
    file. Whenever a word/match is replaced, it is split off into a separate
    chunk which is marked as transliterated. A chunk marked as
    transliterated *is entirely ignored by any replacement statements that
    come afterwards*. Note that beginword and endword can always match at
    the boundary between an untransliterated and transliterated chunk. This
    is to facilitate automated replacement of certain grammatical
    constructions. For instance:

    If the string "a-" could be attached as a prefix to any word and needed
    to be replaced as "b-" everywhere, it would be quite trivial to add a
    match statement 'match "a-" "b-" beginword'. If run on the text
    "a-word", where "word" is some word that should be transliterated as
    "word_replaced", and the group replace statement for the word comes
    after the match statement given above, the following would happen:
    First, the match statement would replace "a-" and split the text into
    the two chunks "b-" and "word", where "b-" is already marked as
    transliterated. Since "word" is now separate, it will be matched by the
    group replace statement later, even if it has beginword set and would
    normally not match if "a-" came before it. Thus, the final output will
    be "b-word_replaced", allowing for the uniform replacement of the prefix
    instead of having to add each word twice, once with and once without the
    prefix.

    In certain cases, this behavior may not be desired. Consider, for
    instance, a prefix "c-" which cannot be replaced uniformly as in the
    example above due to differences in the source and destination script.
    Since it cannot be replaced uniformly, two words "word1" and "word2"
    would both need to be specified separately with replacements for
    "c-word1" and "c-word2". If, however, the prefix "c-" has an alternate
    spelling "c " (without the hyphen), it would be very useful to be able
    to automatically recognize that as well. This is where the nofinal
    attribute for the match statements comes in. If there is a match
    statement 'match "c " "c-" beginword nofinal', the replaced chunk is not
    marked as transliterated, so after executing this statement on the text
    "c word1", there will still only be one chunk, "c-word1", allowing for
    the regular word replacements to function properly.

    Once all the replacement statements have been processed, each chunk of
    text that is not marked as transliterated yet is split based on the
    split pattern specified in the config and all actual characters matched
    by the split pattern are marked as transliterated (this usually means
    all the spaces, newlines, quotation marks, etc.). Any remaining
    words/text chunks that are still marked as untransliterated are now
    processed by the unknown word window. If one of these remaining unknown
    chunks is present in the file specified by the ignore statement in the
    config, it is simply ignored and later printed out as is. After all
    untransliterated words have either had a replacement added or been
    ignored, any words with multiple replacement choices are processed by
    the word choice window. Once this is all done, the final output is
    written to the output file and the process is repeated with the next
    line. Note that the entire process is started again each time a word is
    added to a table or the config is reloaded from the unknown word window.

CONFIGURATION
    These are the commands accepted in the configuration file. Any
    parameters in square brackets are optional. Comments are started with
    "#". Strings (filenames, regex strings, etc.) are enclosed in double
    quotes ("").

    The match, matchignore, and replace commands are executed in the order
    they are specified, except that all replace commands within the same
    group are replaced together.

    The match and matchignore statements accept any RegEx strings and are
    thus very powerful. The group statements only work with the non-RegEx
    words from the tables, but are very efficient for large numbers of words
    and should thus be used for the main bulk of the words.

    Any duplicate words found will cause the user to be prompted to choose
    one option every time the word is replaced in the input text.

    Note that any regex strings specified in the config should not contain
    capture groups, as that would break the endword functionality since this
    is also implemented internally using capture groups. Capture groups are
    also entirely pointless in the config since they currently cannot be
    used as part of the replacement string in match statements. Lookaheads
    and lookbehinds are fine, though, and could be useful in certain cases.

    All tables must be loaded before they are used, or there will be an
    error that the table does not exist.

    Warning: If a replace statement is located before an expand statement
    that would have impacted the table used, there will be no error but the
    expand statement won't have any impact.

    Basic rule of thumb: Always put the table statements before the expand
    statements and the expand statements before the replace statements.

    split <regex string>
            Sets the RegEx string to be used for splitting words. This is
            only used for splitting the words which couldn't be replaced
            after all replacement has been done, before prompting the user
            for unknown words.

            Note that split should probably always contain at least "\n",
            since otherwise all of the newlines will be marked as unknown
            words. Usually, this will be included anyways through "\s".

            Note also that split should probably include the "+"
            RegEx-quantifier since that allows the splitting function in the
            end to ignore several splitting characters right after each
            other (e.g. several spaces) in one go instead of splitting the
            string again for every single one of them. This shouldn't
            actually make any difference functionality-wise, though.

            Default: "\s+" (all whitespace)

    beforeword <regex string>
            Sets the RegEx string to be matched before a word if beginword
            is set.

            Default: "\s"

    afterword <regex string>
            Sets the RegEx string to be matched after a word if endword is
            set.

            Note that afterword should probably always contain at least
            "\n", since otherwise words with endword set will not be matched
            at the end of a line.

            beforeword and afterword will often be exactly the same, but
            they are left as separate options in case more fine-tuning is
            needed.

            Default: "\s"

    tablesep <string>
            Sets the separator used to split the lines in the table files
            into the original and replacement word.

            Default: "Tab"

    choicesep <string>
            Sets the separator used to split replacement words into multiple
            choices for prompting the user.

            Default: "$"

    comment <string>
            If enabled, anything after "<string>" will be ignored on all
            lines in the input file. This will not be displayed in the
            unknown word window or word choice window but will still be
            printed in the end, with the comment character removed (that
            seems to be the most sensible thing to do).

            Note that this is really just a "dumb replacement", so there's
            no way to prevent a line with the comment character from being
            ignored. Just try to always set this to a character that does
            not occur anywhere in the text (or don't use the option at all).

    ignore <filename>
            Sets the file of words to ignore.

            This has to be set even if the file is just empty because the
            user can add words to it from the unknown word window.

    table <table identifier> <filename> [nodisplay] [revert]
            Load the table from "<filename>", making it available for later
            use in the expand and replace commands using the identifier
            "<table identifier>".

            if nodisplay is set, the filename for this table is not shown in
            the unknown word window. If, however, the same filename is
            loaded again for another table that does not have nodisplay set,
            it is still displayed.

            If revert is set, the original and replacement words are
            switched. This can be useful for creating a config for
            transliterating in the opposite direction with the same
            database. I don't know why I called it "revert" since it should
            actually be called "reverse". I guess I was a bit confused.

            Note that if "<filename>" is not an absolute path, it is taken
            to be relative to the location of the configuration file.

            The table files simply consist of tablesep-separated values,
            with the word in the original script first and the replacement
            word second. Both the original and replacement word can
            optionally have several parts separated by choicesep. If the
            original word has multiple parts, it is separated and each of
            the parts is added to the table with the replacement. If the
            replacement has multiple parts, the user will be prompted to
            choose one of the options during the transliteration process. If
            the same word occurs multiple times in the same table with
            different replacements, the replacements are automatically added
            as choices that will be handled by the word choice window.

            If, for whatever reason, the same table is needed twice, but
            with different endings, the table can simply be loaded twice
            with different IDs. If the same path is loaded, the table that
            has already been loaded will be reused. Note that this feature
            was added before adding revert, so the old table is used even if
            it had revert set and the new one doesn't. This is technically a
            problem, but I don't know of any real-world case where it would
            be a problem, so I'm too lazy to change it. Tell me if it
            actually becomes a problem for you.

            WARNING: Don't load the same table file both with and without
            revert in the same config! When a replacement word is added
            through the GUI, the program has to know which way to write the
            words. Currently, whenever a table file is loaded with revert
            anywhere in the config (even if it is loaded without revert in a
            different place), words will automatically be written as if
            revert was on. I cannot currently think of any reason why
            someone would want to load a file both with and without revert
            in the same config, but I still wanted to add this warning just
            in case.

    expand <table identifier> <word ending table> [noroot]
            Expand the table "<table identifier>", i.e. generate all the
            word forms using the word endings in "<word ending table>",
            saving the result as a table with the identifier "<new table
            identifier>".

            Note: There used to be a "<new table identifier>" argument to
            create a new table in case one table had to be expanded with
            different endings. This has been removed because it was a bit
            ugly, especially since there wasn't a proper mapping from table
            IDs to filenames anymore. If this functionality is needed, the
            same table file can simply be loaded multiple times. See the
            table section above.

            If noroot is set, the root forms of the words are not kept.

            If the replacement for a word ending contains choicesep, it is
            split and each part is combined with the root form separately
            and the user is prompted to choose one of the options later. it
            is thus possible to allow multiple choices for the ending if
            there is a distinction in the replacement script but not in the
            source script. Note that each of the root words is also split
            into its choices (if necessary) during the expanding, so it is
            possible to use choicesep in both the endings and root words.

    match <regex string> <replacement string> [beginword] [endword]
    [nofinal]
            Perform a RegEx match using the given "<regex string>",
            replacing it with "<replacement string>". Note that the
            replacement cannot contain any RegEx (e.g. groups) in it.
            beginword and endword specify whether the match must be at the
            beginning or ending of a word, respectively, using the RegEx
            specified in beforeword and afterword. If nofinal is set, the
            string is not marked as transliterated after the replacement,
            allowing it to be modified by subsequent match or replace
            commands.

    matchignore <regex string> [beginword] [endword]
            Performs a RegEx match in the same manner as match, except that
            the original match is used as the replacement instead of
            specifying a replacement string, i.e. whatever is matched is
            just marked as transliterated without changing it.

    group [beginword] [endword]
            Begins a replacement group. All replace commands must occur
            between group and endgroup, since they are then grouped together
            and replaced in one go. beginword and endword act in the same
            way as specified for match and apply to all replace statements
            in this group.

    replace <table identifier> [override]
            Replace all words in the table with the identifier "<table
            identifier>", using the beginword and endword settings specified
            by the current group.

            Unless override is set on the latter table, if the same word
            occurs in two tables with different replacements, both are
            automatically added as choices. See "WORD CHOICE WINDOW".

            override can be useful if the same database is used for both
            directions and one direction maps multiple words to one word,
            but in the other direction this word should always default to
            one of the choices. In that case, a small table with these
            special cases can be created and put at the end of the main
            group statement with override set. This is technically redundant
            since you could just add a special group with only the override
            table in it earlier in the config, but it somehow seems cleaner
            this way.

            Note that a table must have been loaded before being used in a
            replace statement.

    endgroup
            End a replacement group.

    retrywithout <display name> [character] [...]
            Adds a button to the unknown word window to retry the
            replacements on the selected word, first removing the given
            characters. The button is named "<display name>" and located
            after the "Retry without" label. Whatever is found with the
            replacements is pasted into the regular text box for the "Add
            replacement" functionality.

            This can be used as an aid when, for instance, words can be
            written with or without certain diacritics. If the actual word
            without diacritics is already in the database and there is a
            retrywithout statement for all the diacritics, the button can be
            used to quickly find the replacement for the word instead of
            having to type it out manually. The same goes for compound words
            that can be written with or without a space.

            It is also possible to specify retrywithout without any
            characters, which just adds a button that takes whatever word is
            selected and retries the replacements on it. This can be useful
            if you want to manually edit words and quickly see if they are
            found with the edits in place.

            Note that all input text is first normalized to the unicode
            canonical decomposition form so that diacritics can be removed
            individually.

            Also note that all buttons are currently just dumped in the GUI
            without any sort of wrapping, so they'll run off the screen if
            there are too many. Tell me if this becomes a problem. I'm just
            too lazy to change it right now.

            Small warning: This only removes the given characters from the
            word selected in the GUI, not from the tables. Thus, this only
            works if the version of the word without any of the characters
            is already present in the tables. It would be useful when
            handling diacritics if the program could simply make a
            comparison while completely ignoring diacritics, but I haven't
            figured out a nice way to implement that yet.

            Historical note: This was called diacritics in a previous
            version and only allowed removal of diacritics. This is exactly
            the same functionality, just generalized to allow removal of any
            characters with different buttons.

    targetdiacritics <diacritic> [...]
            This was only added to simplify transliteration from Hindi to
            Urdu with the same database. When this is set, the choices in
            the word choice window are sorted in descending order based on
            the number of diacritics from this list that are matched in each
            choice. This is so that when transliterating from Hindi to Urdu,
            the choice with the most diacritics is always at the top.

            Additionally, if there are *exactly* two choices for a word and
            one of them contains diacritics but the other one doesn't, the
            one containing diacritics is automatically taken without ever
            prompting the user. This is, admittedly, a very
            language-specific feature, but I couldn't think of a simple way
            of adding it without building it directly into the actual
            program.

            Note that due to the way this is implemented, it will not take
            any effect if --nochoices is enabled.

            The attentive reader will notice at this point that most of the
            features in this program were added specifically for dealing
            with Urdu and Hindi, which does appear to make sense,
            considering that this program was written specifically for
            transliterating Urdu to Hindi and vice versa (although not quite
            as much vice versa).

BUGS
    Although it may not seem like it, one of the ugliest parts of the
    program is the GUI functionality that allows the user to add a
    replacement word. The problem is that all information about the expand
    and replace statements has to be kept in order to properly handle adding
    a word to one of the files and simultaneously adding it to the currently
    loaded tables *without reloading the entire config*. The way it
    currently works, the replacement word is directly written to the file,
    then all expand statements that would have impacted the words from this
    file are redone (just for the newly added word) and the resulting words
    are added to the appropriate tables (or, technically, the appropriate
    'trie'). Since a file can be mapped to multiple table IDs and a table ID
    can occur in multiple replace statements, this is more complicated than
    it sounds, and thus it is very likely that there are bugs lurking here
    somewhere. Do note that "Reload config" will always reload the entire
    configuration, so that's safe to do even if the on-the-fly replacing
    doesn't work.

    In general, I have tested the GUI code much less than the rest since you
    can't really test it automatically very well.

    The code is generally quite nasty, especially the parts belonging to the
    GUI. Don't look at it.

    Tell me if you find any bugs.

SEE ALSO
    perlre, perlretut

LICENSE
    Copyright (c) 2019, 2020 lumidify <nobody[at]lumidify.org>

    Permission to use, copy, modify, and distribute this software for any
    purpose with or without fee is hereby granted, provided that the above
    copyright notice and this permission notice appear in all copies.

    THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
    WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
    MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
    ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
    WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
    ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
    OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

