Tafseer Algorithm

This section consists of two parts,

The algorithm’s internal details with explanations of each step, and
its implementation in the romanalfaz package.

Algorithm

Roman-to-Roman Transliteration

The following Jupyter notebook exhibits the 12 steps given in the algorithm described in Tafseer Ahmed’s paper. The accompanying description is verbetim from the aforemtnioned article.

import re

from romanalfaz.algorithm import (
    permuteAllOccurrences, replaceEnding, permuteAllEndings, permuteConsecutiveVowels
)

Example Words

The following seven example words will be walked through on each step to showcase the transformation at each stage.

rmWords00 = [
    'alag',
    'ullo',
    'bukhar',
    'bhai',
    'hai',
    'bhayi',
    'shohrat',
    'kya',
]

for i, w in enumerate(rmWords00, start=1):
    print(f'{i:02}. "{w}"')

"alag"
"ullo"
"bukhar"
"bhai"
"hai"
"bhayi"
"shohrat"
"kya"

Step 1

Except a, e, i, o, u, y and h, change the case of all the characters of rom_word into capital. This transformed encoded word is termed as enc_rom_word.

Explanation

As vowel mapping need more complex processing than one to one replacement, the encoding is applied on consonants only. A character in capital case means a rule is already applied on it and the following low priority rule will not be accidentally applied on it.

Example Words

aLaG
uLLo
BuKhaR
Bhai
hai
Bhayi
ShohRaT

def step01(rmWord: str) -> str:
    """Vowel separation"""
    vowels = ['a', 'e', 'i', 'o', 'u', 'y', 'h']
    transWord = ''
    for ch in rmWord:
        transWord += ch.upper() if ch not in vowels else ch
    return transWord

rmWords01 = [step01(w) for w in rmWords00]

for i, (w1, w2) in enumerate(zip(rmWords00, rmWords01), start=1):
    print(f'{i:02}. "{w1}" -> "{w2}"')

"alag" -> "aLaG"
"ullo" -> "uLLo"
"bukhar" -> "BuKhaR"
"bhai" -> "Bhai"
"hai" -> "hai"
"bhayi" -> "Bhayi"
"shohrat" -> "ShohRaT"
"kya" -> "Kya"

Step 2

If the two consequent capital letters are the same, delete one of those double letters.

Explanation

The rule deals the germination/tashdeed as explained in 2.4. it deletes one of the double consonants because the germinated consonant is written only once in Urdu script.

Example Words

aLaG
uLo
BuKhaR
Bhai
hai
Bhayi
ShohRaT

def step02(rmWord: str) -> str:
    """Tashdeed - Remove double consonants"""
    return re.sub(r'([A-Z])\1+', r'\1', rmWord)

rmWords02 = [step02(w) for w in rmWords01]

for i, (w1, w2) in enumerate(zip(rmWords01, rmWords02), start=1):
    print(f'{i:02}. "{w1}" -> "{w2}"')

"aLaG" -> "aLaG"
"uLLo" -> "uLo"
"BuKhaR" -> "BuKhaR"
"Bhai" -> "Bhai"
"hai" -> "hai"
"Bhayi" -> "Bhayi"
"ShohRaT" -> "ShohRaT"
"Kya" -> "Kya"

Step 3

If the word begins with a vowel (any roman letter or letter sequence present in Table 2), append A at the beginning of the word.

Explanation

We need an extra a sound character before a vowel at the start of the word in Urdu script. For this purpose, we introduce A in enc_rom_word that will get matched with Urdu equivalent. For example, Urdu word اينٹ ‘brick’ is written as eent in roman script. In the roman word, ee stands for Urdu “ی” but we need to put a “ا” at the start.

Example Words

AaLaG
AuLo
BuKhaR
Bhai
hai
Bhayi
ShohRaT

def step03(rmWord: str) -> str:
    """Add alef to words starting with vowel"""
    vowels = ['a', 'e', 'i', 'o', 'u']
    transWord = 'A' if rmWord[0] in vowels else ''
    transWord += rmWord
    return transWord

rmWords03 = [step03(w) for w in rmWords02]

for i, (w1, w2) in enumerate(zip(rmWords02, rmWords03), start=1):
    print(f'{i:02}. "{w1}" -> "{w2}"')

"aLaG" -> "AaLaG"
"uLo" -> "AuLo"
"BuKhaR" -> "BuKhaR"
"Bhai" -> "Bhai"
"hai" -> "hai"
"Bhayi" -> "Bhayi"
"ShohRaT" -> "ShohRaT"
"Kya" -> "Kya"

Step 4

For the sequences eh and oh, do the following replacements. Consider the longest match at left hand side.

ehe = eHe, H
eh = eH, H
oh = oH, H
h = H

Explanation

In our mapping rules, if there are more than one potential replacements at right hand side corresponding to a single left hand side, then the system makes n number of copies of enc_rom_word. Each of the possible right hand side replacement is applied on one of the copies, and the subsequent steps of algorithm are applied on each of those. The above mapping rules deal with the unwritten long vowel before a “ہ” (gol-hay) in Urdu script. It is discussed in 2.5.

Example Words

AaLaG
AuLo
BuKHaR
BHai
Hai
BHayi
SHoHRaT / SHHRaT

hVowelCombos = {
    'ehe': ['eHe', 'H'],
    'eh': ['eH', 'H'],
    'oh': ['oH', 'H'],
    'h': ['H'],
}

def step04(rmWord: str) -> list:
    """Process eh and oh vowels"""
    return permuteAllOccurrences([rmWord], hVowelCombos)

rmWords04 = [step04(w) for w in rmWords03]

for i, (w1, w2) in enumerate(zip(rmWords03, rmWords04), start=1):
    print(f'{i:02}. "{w1}" -> {w2}')

"AaLaG" -> ['AaLaG']
"AuLo" -> ['AuLo']
"BuKhaR" -> ['BuKHaR']
"Bhai" -> ['BHai']
"hai" -> ['Hai']
"Bhayi" -> ['BHayi']
"ShohRaT" -> ['SHoHRaT', 'SHHRaT']
"Kya" -> ['Kya']

Step 5

If y is the last character of the word and is preceded by e or a, then

ey = Y
ay = E

Explanation

As discussed in 2.8, both chooti-ye and bari-ye are written as “ی” (chooti-ye) Y in medial position, but bari-ye is written as “ے” (bari-ye) E only at the final position.

Example Words

AaLaG
AuLo
BuKHaR
BHai
Hai
BHayi
SHoHRaT / SHHRaT

def step05(rmWords: list[str]) -> list[str]:
    """Process ending yeh"""
    yehEndings = {
        'ey': 'Y',
        'ay': "E",
    }
    return [replaceEnding(rmWord, yehEndings, 2) for rmWord in rmWords]

rmWords05 = [step05(ww) for ww in rmWords04]

for i, (w1, w2) in enumerate(zip(rmWords04, rmWords05), start=1):
    print(f'{i:02}. {w1} -> {w2}')

['AaLaG'] -> ['AaLaG']
['AuLo'] -> ['AuLo']
['BuKHaR'] -> ['BuKHaR']
['BHai'] -> ['BHai']
['Hai'] -> ['Hai']
['BHayi'] -> ['BHayi']
['SHoHRaT', 'SHHRaT'] -> ['SHoHRaT', 'SHHRaT']
['Kya'] -> ['Kya']

Step 6

If y is preceded by e or a and followed by a vowel then

ey = Y, eY
ay = Y, aY

Explanation

The y in this case can act either as consonant or as part of a vowel sequence. For example, in gayi گئ “(she) went”, y acts as consonant. Step 8 gives more details about the rules of this type.

Example Words

AaLaG
AuLo
BuKHaR
BHai
Hai
BHYi / BHaYi
SHoHRaT / SHHRaT

yVowelCombos = {
    'ey': ['Y', 'eY'],
    'ay': ['Y', 'aY'],
}

def step06(rmWords: list) -> list:
    """Process ey and ay vowels"""
    return permuteAllOccurrences(rmWords, yVowelCombos)

rmWords06 = [step06(ww) for ww in rmWords05]

for i, (w1, w2) in enumerate(zip(rmWords05, rmWords06), start=1):
    print(f'{i:02}. {w1} -> {w2}')

['AaLaG'] -> ['AaLaG']
['AuLo'] -> ['AuLo']
['BuKHaR'] -> ['BuKHaR']
['BHai'] -> ['BHai']
['Hai'] -> ['Hai']
['BHayi'] -> ['BHYi', 'BHaYi']
['SHoHRaT', 'SHHRaT'] -> ['SHoHRaT', 'SHHRaT']
['Kya'] -> ['Kya']

Step 7

Change the case of y as capital.

y = Y

Explanation

As we have dealt with all the special cases of y, this general rule changes the case of the remaining ones as capital.

Example Words

AaLaG
AuLo
BuKHaR
BHai
Hai
BHYi / BHaYi
SHoHRaT / SHHRaT

def step07(rmWords: list[str]) -> list:
    """Process y consonants to Y."""
    return [w.replace('y', 'Y') for w in rmWords]

rmWords07 = [step07(ww) for ww in rmWords06]

for i, (w1, w2) in enumerate(zip(rmWords06, rmWords07), start=1):
    print(f'{i:02}. {w1} -> {w2}')

['AaLaG'] -> ['AaLaG']
['AuLo'] -> ['AuLo']
['BuKHaR'] -> ['BuKHaR']
['BHai'] -> ['BHai']
['Hai'] -> ['Hai']
['BHYi', 'BHaYi'] -> ['BHYi', 'BHaYi']
['SHoHRaT', 'SHHRaT'] -> ['SHoHRaT', 'SHHRaT']
['Kya'] -> ['KYa']

Step 8

If the vowel sequence ai or ei is present at the end of the word, then apply following replacement.

ai = E, aYi, aAi
ei = E, eYi, e

Explanation

As discussed in 2.10, the character sequence ai either correspond to single letter “ے” (bari-ye) or it is a sequence of two vowels in two different syllables. In this case, a is in the first syllable and i is the start of the second syllable. As we need “ء” (hamza) or “ع” (ain) before the vowel at syllable initial position, Y and A are introduced to represent these characters

Example Words

AaLaG
AuLo
BuKHaR
BHE / BHaYi / BHaAi
HE / HaYi / HaAi
BHYi / BHaYi
SHoHRaT / SHHRaT

def step08(rmWords: list[str]) -> list[str]:
    """Process ending of ai and ei"""
    iEndings = {
        'ai': ['E', 'aYi', 'aAi'],
        'ei': ['E', 'eYi', 'eAi'],
    }
    return permuteAllEndings(rmWords, iEndings)

rmWords08 = [step08(ww) for ww in rmWords07]

for i, (w1, w2) in enumerate(zip(rmWords07, rmWords08), start=1):
    print(f'{i:02}. {w1} -> {w2}')

['AaLaG'] -> ['AaLaG']
['AuLo'] -> ['AuLo']
['BuKHaR'] -> ['BuKHaR']
['BHai'] -> ['BHE', 'BHaYi', 'BHaAi']
['Hai'] -> ['HE', 'HaYi', 'HaAi']
['BHYi', 'BHaYi'] -> ['BHYi', 'BHaYi']
['SHoHRaT', 'SHHRaT'] -> ['SHoHRaT', 'SHHRaT']
['KYa'] -> ['KYa']

Step 9

If there is a sequence (seq) of two or more vowels, then find all the combinations of valid vowel sequences seq1, seq2, …, seqn, and put A for “ع” (ain) or Y for “ء” (hamza) is put between these valid sequences.

Explanation

This rule is a generalized form of rule 8. It deals with all possible interpretations of sequence of vowels. An example of two letter sequences is ai that has two valid vowel combinations a-i and ai (as given in table 2). By applying the rule, we get aAi, aYi and ai for further processing.

Another two letter sequence is ua that has only one valid combination u-a. The other possibility ua is not a valid vowel combination because ua does not map on any single Urdu vowel. Hence, we get uYa and uAa for further processing in subsequent steps.

An example of three vowel letters in a row is aai آئ ‘(she) came’. It has three valid sequences a-ai, aai, a-a-i. By applying the rule, we get aAai, aYai, aaAi, aaYi, aAaAi, aAaYi, aYaYi and aYaAi for further processing.

Example Words

AaLaG
AuLo
BuKHaR
BHE / BHaYi / BHaAi
HE / HaYi / HaAi
BHYi / BHaYi
SHoHRaT / SHHRaT

def step09(rmWords: list[str]) -> list[str]:
    """Process two or more consecutive vowels"""
    return permuteConsecutiveVowels(rmWords)

samples = ["sait", "ua", "aai"]
for w in samples:
    print(f'{w} -> {step09([w])}')
print('')

rmWords09 = [step09(ww) for ww in rmWords08]

for i, (w1, w2) in enumerate(zip(rmWords08, rmWords09), start=1):
    print(f'{i:02}. {w1} -> {w2}')

sait -> ['saAit', 'saYit', 'sait']
ua -> ['uAa', 'uYa']
aai -> ['aAaAi', 'aAaYi', 'aYaAi', 'aYaYi', 'aAai', 'aYai', 'aaAi', 'aaYi']

['AaLaG'] -> ['AaLaG']
['AuLo'] -> ['AuLo']
['BuKHaR'] -> ['BuKHaR']
['BHE', 'BHaYi', 'BHaAi'] -> ['BHE', 'BHaYi', 'BHaAi']
['HE', 'HaYi', 'HaAi'] -> ['HE', 'HaYi', 'HaAi']
['BHYi', 'BHaYi'] -> ['BHYi', 'BHaYi']
['SHoHRaT', 'SHHRaT'] -> ['SHoHRaT', 'SHHRaT']
['KYa'] -> ['KYa']

Step 10

For two vowel sequence, do the following replacements.

aa = A
ai = Y
ei = Y
ee = Y
ie = Y
oo = O
au = O
ou = O

Explanation

It is simple one to one mapping of vowel sequence with encoding of corresponding Urdu letter.

Example Words

AaLaG
AuLo
BuKHaR
BHE / BHaYi
HE / HaYi / HaAi
BHYi / BHaYi / BHaAi
SHoHRaT / SHHRaT

DoubleVowels = {
    'aa': ['A'],
    'ai': ['Y'],
    'ei': ['Y'],
    'ee': ['Y'],
    'ie': ['Y'],
    'oo': ['O'],
    'au': ['O'],
    'ou': ['O'],
}
def step10(rmWords: list[str]) -> list:
    """Process two vowel sequences."""
    return permuteAllOccurrences(rmWords, DoubleVowels)

rmWords10 = [step10(ww) for ww in rmWords09]

for i, (w1, w2) in enumerate(zip(rmWords09, rmWords10), start=1):
    print(f'{i:02}. {w1} -> {w2}')

['AaLaG'] -> ['AaLaG']
['AuLo'] -> ['AuLo']
['BuKHaR'] -> ['BuKHaR']
['BHE', 'BHaYi', 'BHaAi'] -> ['BHE', 'BHaYi', 'BHaAi']
['HE', 'HaYi', 'HaAi'] -> ['HE', 'HaYi', 'HaAi']
['BHYi', 'BHaYi'] -> ['BHYi', 'BHaYi']
['SHoHRaT', 'SHHRaT'] -> ['SHoHRaT', 'SHHRaT']
['KYa'] -> ['KYa']

Step 11

Search the following vowels at word’s final position and make the substitutions accordingly.

e = E
a = A, H
i = Y
u = O

Explanation

In Urdu script, the word’s final vowel is always written as long vowel. For example the final i of aadmi ‘man’ is not ambiguous between short vowel (unwritten diacratic zer) and long vowel (chooti-ye). Only chooti-ye can appear at the end of any word.

The final a can map on “ہ” (gol-hay) too. For example, the final a of roman sada سادہ ‘simple’ stands for gol-hay.

Example Words

AaLaG
AuLO
BuKHaR
BHE / BHaYY / BHaAY
HE / HaYY / HaAY
BHYY / BHaYY
SHoHRaT / SHHRaT

VowelEndings = {
    'e': ['E'],
    'a': ['A', 'H'],
    'i': ['Y'],
    'u': ['O']
}

def step11(rmWords: list[str]) -> list:
    """Process two vowel sequences."""
    return permuteAllEndings(rmWords, VowelEndings)

rmWords11 = [step11(ww) for ww in rmWords10]

for i, (w1, w2) in enumerate(zip(rmWords10, rmWords11), start=1):
    print(f'{i:02}. {w1} -> {w2}')

['AaLaG'] -> ['AaLaG']
['AuLo'] -> ['AuLo']
['BuKHaR'] -> ['BuKHaR']
['BHE', 'BHaYi', 'BHaAi'] -> ['BHE', 'BHaYY', 'BHaAY']
['HE', 'HaYi', 'HaAi'] -> ['HE', 'HaYY', 'HaAY']
['BHYi', 'BHaYi'] -> ['BHYY', 'BHaYY']
['SHoHRaT', 'SHHRaT'] -> ['SHoHRaT', 'SHHRaT']
['KYa'] -> ['KYA', 'KYH']

Step 12

Search for the following vowel sequences and make the following replacements.

a = null, A
i = null, Y
u = null, O
e = E
o = O

Explanation

After dealing the vowels at initial and final positions, and dealing the special cases of vowel sequence, the general rule of vowel sequence replacement is given.

This is the last step for encoding a roman word. The following steps search the equivalent of the encoded word in the encoded list of Urdu words.

Example Words

AALAG / ALAG / AALG / ALG
AOLO / ALO
BOKHAR / BKHAR / BOKHR / BKHR
BHE / BHAYY / BHYY / BHAAY / BHAY
HE / HAYY/ HYY / HAAY / HAY
BHYY / BHAYY / BHYY
SHOHRAT/ SHOHRT / SHHRAT / SHHRT

VowelReplacements = {
    'a': ['', 'A'],
    'i': ['', 'Y'],
    'u': ['', 'O'],
    'e': ['E'],
    'o': ['O'],
}

def step12(rmWords: list[str]) -> list:
    """Process two vowel sequences."""
    return permuteAllOccurrences(rmWords, VowelReplacements)

rmWords12 = [step12(ww) for ww in rmWords11]

for i, (w1, w2) in enumerate(zip(rmWords11, rmWords12), start=1):
    print(f'{i:02}. {w1} -> {w2}')

['AaLaG'] -> ['ALG', 'ALAG', 'AALG', 'AALAG']
['AuLo'] -> ['ALO', 'AOLO']
['BuKHaR'] -> ['BKHR', 'BKHAR', 'BOKHR', 'BOKHAR']
['BHE', 'BHaYY', 'BHaAY'] -> ['BHE', 'BHYY', 'BHAYY', 'BHAY', 'BHAAY']
['HE', 'HaYY', 'HaAY'] -> ['HE', 'HYY', 'HAYY', 'HAY', 'HAAY']
['BHYY', 'BHaYY'] -> ['BHYY', 'BHYY', 'BHAYY']
['SHoHRaT', 'SHHRaT'] -> ['SHOHRT', 'SHOHRAT', 'SHHRT', 'SHHRAT']
['KYA', 'KYH'] -> ['KYA', 'KYH']

Implementation

This module contains the Urdu language roman-to-arabic script transliteration algorithm and its helper functions.

romanalfaz.algorithm.tafseerUrduAr2Rm(word: str, keep: bool = True) → tuple[str, int]

Converts a string containing Urdu text in arabic-script to intermediate roman-script.

The input word is normalized to base characters using romanalfaz.utils.normalize() before processing it for romanization.

This function also handles specific orthographic rule of the Urdu language, treating the initial ‘Waw’ (و) as a consonant sound (‘W’) rather than a vowel marker, and maps it specifically separate from the predefined encoding map.

Parameters:

word (str) – The input string containing Urdu text in arabic-script.
keep (bool) – To keep the undefined characters in the output text, or skip them. Default is True.

Returns:

A tuple containing two elements:

str: The fully romanized string with tokens separated by spaces.
int: The count of undefined characters encountered during conversion. Characters not found in the encoding map can be optionally kept as-is in the output, however, this counter reflects how many were in the input after normalization, irrespective of the output.

Return type:

tuple[str, int]

Raises:

AssertionError – If the input word is not a str instance.

Example

>>> tafseerUrduAr2Rm("السلام")
('ALSLAM', 0)
>>> tafseerUrduAr2Rm("والسلام")
('WALSLAM', 0)
>>> tafseerUrduAr2Rm('الگ')
('ALG', 0)
>>> tafseerUrduAr2Rm('الو')
('ALO', 0)
>>> tafseerUrduAr2Rm('بخار')
('BKHAR', 0)
>>> tafseerUrduAr2Rm('بهائ')
('BHAY', 0)
>>> tafseerUrduAr2Rm('ہے')
('HE', 0)
>>> tafseerUrduAr2Rm('شہرت')
('SHHRT', 0)
:param word:
:param keep:

romanalfaz.algorithm.tafseerUrduRm2Rm(word: str) → set[str]

Transforms a roman-script Urdu input into an intermediate normalized representation.

This function addresses common transliteration inconsistencies caused by:

The absence of vowels in standard arabic-script writing (leading to ambiguous consonant clusters).
Multiple distinct Arabic consonants that map to the same Roman letter (e.g., ‘kaf’ vs ‘qaf’).

It executes a 12-step sequential algorithm designed to resolve these ambiguities, producing one or more candidate words in an intermediate Roman encoding format. These candidates can then be matched against a pre-defined dictionary of known words to predict the intended Arabic word.

Parameters:: word (str) – The input string containing Roman-script Urdu text.
Returns:: A set of candidate words in the intermediate Roman encoding. For empty string, returns an empty set.
Return type:: set[str]
Raises:: AssertionError – If the input word is not a str instance.

Example

>>> tafseerUrduRm2Rm('khalq') # Ambiguous, can be خالق or خلق
{'KHALQ', 'KHLQ'}  # Generated possible candidates
>>> tafseerUrduRm2Rm('alag')  # الگ
{'AALAG', 'AALG', 'ALAG', 'ALG'}
>>> tafseerUrduRm2Rm('ullo')  # الو
{'ALO', 'AOLO'}
>>> tafseerUrduRm2Rm('bukhar')  # بخار
{'BKHAR', 'BKHR', 'BOKHAR', 'BOKHR'}
>>> tafseerUrduRm2Rm('bhai')  # بهائ
{'BHAAY', 'BHAY', 'BHAYY', 'BHE', 'BHYY'}
>>> tafseerUrduRm2Rm('hai')   # ہے
{'HAAY', 'HAY', 'HAYY', 'HE', 'HYY'}
>>> tafseerUrduRm2Rm('bhayi')  # بهائ
{'BHAYY', 'BHYY'}
>>> tafseerUrduRm2Rm('shohrat')  # شہرت
{'SHHRAT', 'SHHRT', 'SHOHRAT', 'SHOHRT'}

romanalfaz.algorithm.permuteFirstOccurrence(token: str, mapping: dict[str, list[str]]) → list[str]

Generates variations of a token by replacing its leftmost matching substring.

This function scans the input token from left to right to find the earliest occurrence of any key defined in the mapping dictionary. Once found, it creates a list of new strings where only that first occurrence is replaced by each of its corresponding mapped values.

Parameters:

token – The input roman Urdu token to process.
mapping – A dictionary where keys are target character patterns and values are lists of allowed replacement variations.

Returns:

A list of modified strings containing the variations. Returns an empty list if none of the mapping keys are found within the token.

Example

>>> rule_map = {"aa": ["a", "e"], "kh": ["x"]}
>>> permuteFirstOccurrence("baakh", rule_map)
['bakh', 'bekh']

romanalfaz.algorithm.permuteAllOccurrences(tokenList: list[str], mapping: dict[str, list[str]]) → list[str]

Exhaustively generates all phonetic permutations for a list of tokens.

This function processes each string in the input list, running an iterative queue-based cascade. It repeatedly targets and replaces the leftmost matching character patterns using permuteFirstOccurrence until a string contains absolutely no keys from the mapping dictionary.

Parameters:

tokenList – A list of roman Urdu tokens to be fully permuted.
mapping – A dictionary where keys are target character patterns and values are lists of allowed replacement variations.

Returns:

A list containing all fully transformed variations of the input tokens. If a token has no matches, it is returned as-is.

Example

>>> rule_map = {"oo": ["u"], "ee": ["i"]}
>>> permuteAllOccurrences(["khooshee"], rule_map)
['khushi']

romanalfaz.algorithm.permuteEnding(token: str, mapping: dict[str, list[str]]) → list[str]

Replaces the trailing suffix of a string with mapped variations.

This function evaluates the end of the input string against keys in the mapping dictionary. When a trailing match is detected, it strips that specific suffix and returns a list of new strings combining the unchanged prefix with every allowed substitution tail.

Parameters:

token – The input Roman Urdu string to inspect for trailing patterns.
mapping – A dictionary where keys are target trailing character sequences (suffixes) and values are lists of substitution variants.

Returns:

A list of strings containing the modified suffix variations. Returns an empty list if the text does not end with any of the mapping keys.

Example

>>> suffix_rules = {"ein": ["ain", "en"], "iy": ["i"]}
>>> permuteEnding("jaalein", suffix_rules)
['jaalain', 'jaalen']

romanalfaz.algorithm.permuteAllEndings(tokenList: list[str], mapping: dict[str, list[str]]) → list[str]

Exhaustively generates all suffix variations for a list of tokens.

This function steps through each string in the input token list, executing a queue-driven transformation loop. It continuously strips and replaces trailing character suffixes using permuteEnding until the strings contain absolutely no matching suffix keys left in the mapping dataset.

Parameters:

tokenList – A list of raw Roman Urdu tokens to evaluate for suffix changes.
mapping – A dictionary where keys are target suffix patterns and values are lists of allowed trailing replacement variations.

Returns:

A list of all fully processed suffix variations across the input tokens. Tokens with no suffix matches are preserved in the list exactly as-is.

Example

>>> suffixRules = {"ein": ["ain"], "ain": ["en"]}
>>> permuteAllEndings(["karein"], suffixRules)
['karen']

romanalfaz.algorithm.replaceEnding(token: str, mapping: dict[str, str], n: int) → str

Replaces a suffix of length n with its mapped string substitution.

This function extracts the final n characters of a token and checks if that suffix exists within the provided mapping dictionary. If a match is found, the old suffix is stripped and swapped for the replacement value. If no match is found, or if the string is shorter than n, the original token is returned unmodified.

Parameters:

token – The input Roman Urdu string to evaluate and modify.
mapping – A dictionary where keys are target suffixes of length n and values are their single string replacements.
n – The exact character length of the suffix to isolate and check.

Returns:

The modified string with the new ending applied, or the original unaltered token if no mapping constraints were met.

Example

>>> suffixRules = {"ah": "a", "iy": "i"}
>>> replaceEnding("vaalah", suffixRules, 2)
'vaala'

romanalfaz.algorithm.findVowelCombos(token: str) → list[list[str]]

Segments a string of vowels into all possible valid sub-pattern breakdowns.

This function uses a recursive backtracking depth-first search to find every combination of 1-character and 2-character vowel tokens that can perfectly reconstruct the input string based on a predefined set of roman Urdu vowel patterns.

Parameters:: token – A string consisting of vowel characters to partition.
Returns:: A list of lists, where each sub-list represents a valid sequence of parsed vowel patterns that exactly make up the input word.

Example

>>> findVowelCombos("aa")
[['a', 'a'], ['aa']]

romanalfaz.algorithm.concatenateCombos(subStrings: list[str], charset: set[str] | list[str]) → list[str]

Concatenates strings using all combinations of characters from a charset.

This function pieces together an array of fragmented sub-strings by interleaving every possible permutation sequence of characters from the given charset into the interstitial gaps between elements.

Parameters:

subStrings – A list of string segments to stitch together.
charset – A collection (set or list) of single-character strings to act as variation separators between the text segments.

Returns:

A list of all possible interleaved concatenated string variants.

Example

>>> segments = ["b", "n", "n"]
>>> vowels = {"a", "o"}
>>> concatenateCombos(segments, vowels)
['banan', 'banon', 'bonan', 'bonon']

romanalfaz.algorithm.permuteConsecutiveVowels(inList: list[str]) → list[str]

Processes tokens containing two or more consecutive vowels into variations.

This function scans a list of string tokens to find consecutive vowel sequences. It isolates each vowel cluster, segments it into valid sub-patterns, and interleaves structural placeholders (“A” and “Y”) between multi-token vowel combinations before reconstructing the final permuted words.

Parameters:: inList – A list of Roman Urdu string tokens to inspect and permute.
Returns:: A list of all newly generated word variations. If a word contains no consecutive vowels, it is returned in the output list unmodified.

Example

>>> permuteConsecutiveVowels(["koo"])
['koo', 'koAo', 'koYo']

romanalfaz.algorithm.URDU_ARABIC2ROMAN_ENCODING_MAP: dict[str, str]: This is one-to-one transliteration convertion map for converting Urdu text from arabic-script to roman-script for word list lookup.

Todo

Clean up the input characters to map only the normalized characters.

URDU_ARABIC2ROMAN_ENCODING_MAP: dict[str, str] = {
    '\u0627': 'A',   # ARABIC LETTER ALEF
    '\u0639': 'A',   # ARABIC LETTER AIN
    '\u0622': 'AA',  # ARABIC LETTER ALEF WITH MADDA ABOVE
    '\u0623': 'A',   # ARABIC LETTER ALEF WITH HAMZA ABOVE
    '\u0628': 'B',   # ARABIC LETTER BEH
    '\u067E': 'P',   # ARABIC LETTER PEH
    '\u062a': 'T',   # ARABIC LETTER TEH
    '\u0637': 'T',   # ARABIC LETTER TAH
    '\u0679': 'T',   # ARABIC LETTER TTEH
    '\u06C3': 'T',   # ARABIC LETTER TEH MARBUTA GOAL
    '\u062c': 'J',   # ARABIC LETTER JEEM
    '\u062b': 'S',   # ARABIC LETTER THEH
    '\u0633': 'S',   # ARABIC LETTER SEEN
    '\u0635': 'S',   # ARABIC LETTER SAD
    '\u0686': 'CH',  # ARABIC LETTER TCHEH
    '\u062d': 'H',   # ARABIC LETTER HAH
    '\u06c1': 'H',   # ARABIC LETTER HEH GOAL
    '\u06c2': 'H',   # ARABIC LETTER HEH GOAL WITH HAMZA ABOVE
    '\u06be': 'H',   # ARABIC LETTER HEH DOACHASHMEE
    '\u0647': 'H',   # ARABIC LETTER HEH
    '\u062e': 'KH',  # ARABIC LETTER KHAH
    '\u062f': 'D',   # ARABIC LETTER DAL
    '\u0688': 'D',   # ARABIC LETTER DDAL
    '\u0630': 'Z',   # ARABIC LETTER THAL
    '\u0632': 'Z',   # ARABIC LETTER ZAIN
    '\u0636': 'Z',   # ARABIC LETTER DAD
    '\u0638': 'Z',   # ARABIC LETTER ZAH
    '\u0698': 'Z',   # ARABIC LETTER JEH
    '\u0631': 'R',   # ARABIC LETTER REH
    '\u0691': 'R',   # ARABIC LETTER RREH
    '\u0634': 'SH',  # ARABIC LETTER SHEEN
    '\u063a': 'GH',  # ARABIC LETTER GHAIN
    '\u0641': 'F',   # ARABIC LETTER FEH
    '\u06A9': 'K',   # ARABIC LETTER KEHEH
    '\u0642': 'Q',   # ARABIC LETTER QAF
    '\u06af': 'G',   # ARABIC LETTER GAF
    '\u0644': 'L',   # ARABIC LETTER LAM
    '\u0645': 'M',   # ARABIC LETTER MEEM
    '\u0646': 'N',   # ARABIC LETTER NOON
    '\u06ba': 'N',   # ARABIC LETTER NOON GHUNNA
    '\u0648': 'O',   # ARABIC LETTER WAW
    '\u0624': 'O',   # ARABIC LETTER WAW WITH HAMZA ABOVE
    '\u06CC': 'Y',   # ARABIC LETTER FARSI YEH
    '\u0621': 'Y',   # ARABIC LETTER HAMZA
    '\u0626': 'Y',   # ARABIC LETTER YEH WITH HAMZA ABOVE
    '\u064A': 'Y',   # ARABIC LETTER YEH
    '\u06d2': 'E',   # ARABIC LETTER YEH BARREE
}