Tafseer Algorithm
This section consists of two parts,
The algorithm’s internal details with explanations of each step, and
its implementation in the romanalfaz package.
Algorithm
Roman-to-Roman Transliteration
The following Jupyter notebook exhibits the 12 steps given in the algorithm described in Tafseer Ahmed’s paper. The accompanying description is verbetim from the aforemtnioned article.
import re
from romanalfaz.algorithm import (
permuteAllOccurrences, replaceEnding, permuteAllEndings, permuteConsecutiveVowels
)
Example Words
The following seven example words will be walked through on each step to showcase the transformation at each stage.
rmWords00 = [
'alag',
'ullo',
'bukhar',
'bhai',
'hai',
'bhayi',
'shohrat',
'kya',
]
for i, w in enumerate(rmWords00, start=1):
print(f'{i:02}. "{w}"')
01. "alag"
02. "ullo"
03. "bukhar"
04. "bhai"
05. "hai"
06. "bhayi"
07. "shohrat"
08. "kya"
Step 1
Except a, e, i, o, u, y and h, change the
case of all the characters of rom_word into capital.
This transformed encoded word is termed as
enc_rom_word.
Explanation
As vowel mapping need more complex processing than one to one replacement, the encoding is applied on consonants only. A character in capital case means a rule is already applied on it and the following low priority rule will not be accidentally applied on it.
Example Words
aLaGuLLoBuKhaRBhaihaiBhayiShohRaT
def step01(rmWord: str) -> str:
"""Vowel separation"""
vowels = ['a', 'e', 'i', 'o', 'u', 'y', 'h']
transWord = ''
for ch in rmWord:
transWord += ch.upper() if ch not in vowels else ch
return transWord
rmWords01 = [step01(w) for w in rmWords00]
for i, (w1, w2) in enumerate(zip(rmWords00, rmWords01), start=1):
print(f'{i:02}. "{w1}" -> "{w2}"')
01. "alag" -> "aLaG"
02. "ullo" -> "uLLo"
03. "bukhar" -> "BuKhaR"
04. "bhai" -> "Bhai"
05. "hai" -> "hai"
06. "bhayi" -> "Bhayi"
07. "shohrat" -> "ShohRaT"
08. "kya" -> "Kya"
Step 2
If the two consequent capital letters are the same, delete one of those double letters.
Explanation
The rule deals the germination/tashdeed as explained in 2.4. it deletes one of the double consonants because the germinated consonant is written only once in Urdu script.
Example Words
aLaGuLoBuKhaRBhaihaiBhayiShohRaT
def step02(rmWord: str) -> str:
"""Tashdeed - Remove double consonants"""
return re.sub(r'([A-Z])\1+', r'\1', rmWord)
rmWords02 = [step02(w) for w in rmWords01]
for i, (w1, w2) in enumerate(zip(rmWords01, rmWords02), start=1):
print(f'{i:02}. "{w1}" -> "{w2}"')
01. "aLaG" -> "aLaG"
02. "uLLo" -> "uLo"
03. "BuKhaR" -> "BuKhaR"
04. "Bhai" -> "Bhai"
05. "hai" -> "hai"
06. "Bhayi" -> "Bhayi"
07. "ShohRaT" -> "ShohRaT"
08. "Kya" -> "Kya"
Step 3
If the word begins with a vowel (any roman letter
or letter sequence present in Table 2), append A at the
beginning of the word.
Explanation
We need an
extra a sound character before a vowel at the start of
the word in Urdu script. For this purpose, we introduce
A in enc_rom_word that will get matched with Urdu
equivalent. For example, Urdu word اينٹ ‘brick’ is
written as eent in roman script. In the roman word,
ee stands for Urdu “ی” but we need to put a “ا” at the
start.
Example Words
AaLaGAuLoBuKhaRBhaihaiBhayiShohRaT
def step03(rmWord: str) -> str:
"""Add alef to words starting with vowel"""
vowels = ['a', 'e', 'i', 'o', 'u']
transWord = 'A' if rmWord[0] in vowels else ''
transWord += rmWord
return transWord
rmWords03 = [step03(w) for w in rmWords02]
for i, (w1, w2) in enumerate(zip(rmWords02, rmWords03), start=1):
print(f'{i:02}. "{w1}" -> "{w2}"')
01. "aLaG" -> "AaLaG"
02. "uLo" -> "AuLo"
03. "BuKhaR" -> "BuKhaR"
04. "Bhai" -> "Bhai"
05. "hai" -> "hai"
06. "Bhayi" -> "Bhayi"
07. "ShohRaT" -> "ShohRaT"
08. "Kya" -> "Kya"
Step 4
For the sequences eh and oh, do the following
replacements. Consider the longest match at left hand
side.
ehe = eHe, H
eh = eH, H
oh = oH, H
h = H
Explanation
In our mapping rules, if there are
more than one potential replacements at right hand side
corresponding to a single left hand side, then the
system makes n number of copies of enc_rom_word.
Each of the possible right hand side replacement is
applied on one of the copies, and the subsequent steps
of algorithm are applied on each of those.
The above mapping rules deal with the unwritten
long vowel before a “ہ” (gol-hay) in Urdu script. It is
discussed in 2.5.
Example Words
AaLaGAuLoBuKHaRBHaiHaiBHayiSHoHRaT/SHHRaT
hVowelCombos = {
'ehe': ['eHe', 'H'],
'eh': ['eH', 'H'],
'oh': ['oH', 'H'],
'h': ['H'],
}
def step04(rmWord: str) -> list:
"""Process eh and oh vowels"""
return permuteAllOccurrences([rmWord], hVowelCombos)
rmWords04 = [step04(w) for w in rmWords03]
for i, (w1, w2) in enumerate(zip(rmWords03, rmWords04), start=1):
print(f'{i:02}. "{w1}" -> {w2}')
01. "AaLaG" -> ['AaLaG']
02. "AuLo" -> ['AuLo']
03. "BuKhaR" -> ['BuKHaR']
04. "Bhai" -> ['BHai']
05. "hai" -> ['Hai']
06. "Bhayi" -> ['BHayi']
07. "ShohRaT" -> ['SHoHRaT', 'SHHRaT']
08. "Kya" -> ['Kya']
Step 5
If y is the last character of the word and is
preceded by e or a, then
ey = Y
ay = E
Explanation
As discussed in 2.8, both chooti-ye
and bari-ye are written as “ی” (chooti-ye) Y in medial
position, but bari-ye is written as “ے” (bari-ye) E
only at the final position.
Example Words
AaLaGAuLoBuKHaRBHaiHaiBHayiSHoHRaT/SHHRaT
def step05(rmWords: list[str]) -> list[str]:
"""Process ending yeh"""
yehEndings = {
'ey': 'Y',
'ay': "E",
}
return [replaceEnding(rmWord, yehEndings, 2) for rmWord in rmWords]
rmWords05 = [step05(ww) for ww in rmWords04]
for i, (w1, w2) in enumerate(zip(rmWords04, rmWords05), start=1):
print(f'{i:02}. {w1} -> {w2}')
01. ['AaLaG'] -> ['AaLaG']
02. ['AuLo'] -> ['AuLo']
03. ['BuKHaR'] -> ['BuKHaR']
04. ['BHai'] -> ['BHai']
05. ['Hai'] -> ['Hai']
06. ['BHayi'] -> ['BHayi']
07. ['SHoHRaT', 'SHHRaT'] -> ['SHoHRaT', 'SHHRaT']
08. ['Kya'] -> ['Kya']
Step 6
If y is preceded by e or a and followed by a vowel then
ey = Y, eY
ay = Y, aY
Explanation
The y in this case can act either as
consonant or as part of a vowel sequence. For example,
in gayi گئ “(she) went”, y acts as consonant. Step 8
gives more details about the rules of this type.
Example Words
AaLaGAuLoBuKHaRBHaiHaiBHYi/BHaYiSHoHRaT/SHHRaT
yVowelCombos = {
'ey': ['Y', 'eY'],
'ay': ['Y', 'aY'],
}
def step06(rmWords: list) -> list:
"""Process ey and ay vowels"""
return permuteAllOccurrences(rmWords, yVowelCombos)
rmWords06 = [step06(ww) for ww in rmWords05]
for i, (w1, w2) in enumerate(zip(rmWords05, rmWords06), start=1):
print(f'{i:02}. {w1} -> {w2}')
01. ['AaLaG'] -> ['AaLaG']
02. ['AuLo'] -> ['AuLo']
03. ['BuKHaR'] -> ['BuKHaR']
04. ['BHai'] -> ['BHai']
05. ['Hai'] -> ['Hai']
06. ['BHayi'] -> ['BHYi', 'BHaYi']
07. ['SHoHRaT', 'SHHRaT'] -> ['SHoHRaT', 'SHHRaT']
08. ['Kya'] -> ['Kya']
Step 7
Change the case of y as capital.
y = Y
Explanation
As we have dealt with all the special
cases of y, this general rule changes the case of the
remaining ones as capital.
Example Words
AaLaGAuLoBuKHaRBHaiHaiBHYi/BHaYiSHoHRaT/SHHRaT
def step07(rmWords: list[str]) -> list:
"""Process y consonants to Y."""
return [w.replace('y', 'Y') for w in rmWords]
rmWords07 = [step07(ww) for ww in rmWords06]
for i, (w1, w2) in enumerate(zip(rmWords06, rmWords07), start=1):
print(f'{i:02}. {w1} -> {w2}')
01. ['AaLaG'] -> ['AaLaG']
02. ['AuLo'] -> ['AuLo']
03. ['BuKHaR'] -> ['BuKHaR']
04. ['BHai'] -> ['BHai']
05. ['Hai'] -> ['Hai']
06. ['BHYi', 'BHaYi'] -> ['BHYi', 'BHaYi']
07. ['SHoHRaT', 'SHHRaT'] -> ['SHoHRaT', 'SHHRaT']
08. ['Kya'] -> ['KYa']
Step 8
If the vowel sequence ai or ei is present at the
end of the word, then apply following replacement.
ai = E, aYi, aAi
ei = E, eYi, e
Explanation
As discussed in 2.10, the character
sequence ai either correspond to single letter “ے”
(bari-ye) or it is a sequence of two vowels in two
different syllables. In this case, a is in the first
syllable and i is the start of the second syllable. As
we need “ء” (hamza) or “ع” (ain) before the vowel at
syllable initial position, Y and A are introduced to
represent these characters
Example Words
AaLaGAuLoBuKHaRBHE/BHaYi/BHaAiHE/HaYi/HaAiBHYi/BHaYiSHoHRaT/SHHRaT
def step08(rmWords: list[str]) -> list[str]:
"""Process ending of ai and ei"""
iEndings = {
'ai': ['E', 'aYi', 'aAi'],
'ei': ['E', 'eYi', 'eAi'],
}
return permuteAllEndings(rmWords, iEndings)
rmWords08 = [step08(ww) for ww in rmWords07]
for i, (w1, w2) in enumerate(zip(rmWords07, rmWords08), start=1):
print(f'{i:02}. {w1} -> {w2}')
01. ['AaLaG'] -> ['AaLaG']
02. ['AuLo'] -> ['AuLo']
03. ['BuKHaR'] -> ['BuKHaR']
04. ['BHai'] -> ['BHE', 'BHaYi', 'BHaAi']
05. ['Hai'] -> ['HE', 'HaYi', 'HaAi']
06. ['BHYi', 'BHaYi'] -> ['BHYi', 'BHaYi']
07. ['SHoHRaT', 'SHHRaT'] -> ['SHoHRaT', 'SHHRaT']
08. ['KYa'] -> ['KYa']
Step 9
If there is a sequence (seq) of two or more vowels,
then find all the combinations of valid vowel sequences
seq1, seq2, …, seqn, and put A for “ع” (ain) or Y
for “ء” (hamza) is put between these valid sequences.
Explanation
This rule is a generalized form of
rule 8. It deals with all possible interpretations of
sequence of vowels. An example of two letter
sequences is ai that has two valid vowel combinations
a-i and ai (as given in table 2). By applying the rule,
we get aAi, aYi and ai for further processing.
Another two letter sequence is ua that has only
one valid combination u-a. The other possibility ua
is not a valid vowel combination because ua does not
map on any single Urdu vowel. Hence, we get uYa
and uAa for further processing in subsequent steps.
An example of three vowel letters in a row is aai آئ ‘(she) came’.
It has three valid sequences a-ai, aai, a-a-i.
By applying the rule, we get aAai, aYai,
aaAi, aaYi, aAaAi, aAaYi, aYaYi and aYaAi
for further processing.
Example Words
AaLaGAuLoBuKHaRBHE/BHaYi/BHaAiHE/HaYi/HaAiBHYi/BHaYiSHoHRaT/SHHRaT
def step09(rmWords: list[str]) -> list[str]:
"""Process two or more consecutive vowels"""
return permuteConsecutiveVowels(rmWords)
samples = ["sait", "ua", "aai"]
for w in samples:
print(f'{w} -> {step09([w])}')
print('')
rmWords09 = [step09(ww) for ww in rmWords08]
for i, (w1, w2) in enumerate(zip(rmWords08, rmWords09), start=1):
print(f'{i:02}. {w1} -> {w2}')
sait -> ['saAit', 'saYit', 'sait']
ua -> ['uAa', 'uYa']
aai -> ['aAaAi', 'aAaYi', 'aYaAi', 'aYaYi', 'aAai', 'aYai', 'aaAi', 'aaYi']
01. ['AaLaG'] -> ['AaLaG']
02. ['AuLo'] -> ['AuLo']
03. ['BuKHaR'] -> ['BuKHaR']
04. ['BHE', 'BHaYi', 'BHaAi'] -> ['BHE', 'BHaYi', 'BHaAi']
05. ['HE', 'HaYi', 'HaAi'] -> ['HE', 'HaYi', 'HaAi']
06. ['BHYi', 'BHaYi'] -> ['BHYi', 'BHaYi']
07. ['SHoHRaT', 'SHHRaT'] -> ['SHoHRaT', 'SHHRaT']
08. ['KYa'] -> ['KYa']
Step 10
For two vowel sequence, do the following replacements.
aa = A
ai = Y
ei = Y
ee = Y
ie = Y
oo = O
au = O
ou = O
Explanation
It is simple one to one mapping of vowel sequence with encoding of corresponding Urdu letter.
Example Words
AaLaGAuLoBuKHaRBHE/BHaYiHE/HaYi/HaAiBHYi/BHaYi/BHaAiSHoHRaT/SHHRaT
DoubleVowels = {
'aa': ['A'],
'ai': ['Y'],
'ei': ['Y'],
'ee': ['Y'],
'ie': ['Y'],
'oo': ['O'],
'au': ['O'],
'ou': ['O'],
}
def step10(rmWords: list[str]) -> list:
"""Process two vowel sequences."""
return permuteAllOccurrences(rmWords, DoubleVowels)
rmWords10 = [step10(ww) for ww in rmWords09]
for i, (w1, w2) in enumerate(zip(rmWords09, rmWords10), start=1):
print(f'{i:02}. {w1} -> {w2}')
01. ['AaLaG'] -> ['AaLaG']
02. ['AuLo'] -> ['AuLo']
03. ['BuKHaR'] -> ['BuKHaR']
04. ['BHE', 'BHaYi', 'BHaAi'] -> ['BHE', 'BHaYi', 'BHaAi']
05. ['HE', 'HaYi', 'HaAi'] -> ['HE', 'HaYi', 'HaAi']
06. ['BHYi', 'BHaYi'] -> ['BHYi', 'BHaYi']
07. ['SHoHRaT', 'SHHRaT'] -> ['SHoHRaT', 'SHHRaT']
08. ['KYa'] -> ['KYa']
Step 11
Search the following vowels at word’s final position and make the substitutions accordingly.
e = E
a = A, H
i = Y
u = O
Explanation
In Urdu script, the word’s final
vowel is always written as long vowel. For example the
final i of aadmi ‘man’ is not ambiguous between
short vowel (unwritten diacratic zer) and long vowel
(chooti-ye). Only chooti-ye can appear at the end of
any word.
The final a can map on “ہ” (gol-hay) too. For
example, the final a of roman sada سادہ ‘simple’
stands for gol-hay.
Example Words
AaLaGAuLOBuKHaRBHE/BHaYY/BHaAYHE/HaYY/HaAYBHYY/BHaYYSHoHRaT/SHHRaT
VowelEndings = {
'e': ['E'],
'a': ['A', 'H'],
'i': ['Y'],
'u': ['O']
}
def step11(rmWords: list[str]) -> list:
"""Process two vowel sequences."""
return permuteAllEndings(rmWords, VowelEndings)
rmWords11 = [step11(ww) for ww in rmWords10]
for i, (w1, w2) in enumerate(zip(rmWords10, rmWords11), start=1):
print(f'{i:02}. {w1} -> {w2}')
01. ['AaLaG'] -> ['AaLaG']
02. ['AuLo'] -> ['AuLo']
03. ['BuKHaR'] -> ['BuKHaR']
04. ['BHE', 'BHaYi', 'BHaAi'] -> ['BHE', 'BHaYY', 'BHaAY']
05. ['HE', 'HaYi', 'HaAi'] -> ['HE', 'HaYY', 'HaAY']
06. ['BHYi', 'BHaYi'] -> ['BHYY', 'BHaYY']
07. ['SHoHRaT', 'SHHRaT'] -> ['SHoHRaT', 'SHHRaT']
08. ['KYa'] -> ['KYA', 'KYH']
Step 12
Search for the following vowel sequences and make the following replacements.
a = null, A
i = null, Y
u = null, O
e = E
o = O
Explanation
After dealing the vowels at initial and final positions, and dealing the special cases of vowel sequence, the general rule of vowel sequence replacement is given.
This is the last step for encoding a roman word. The following steps search the equivalent of the encoded word in the encoded list of Urdu words.
Example Words
AALAG/ALAG/AALG/ALGAOLO/ALOBOKHAR/BKHAR/BOKHR/BKHRBHE/BHAYY/BHYY/BHAAY/BHAYHE/HAYY/HYY/HAAY/HAYBHYY/BHAYY/BHYYSHOHRAT/SHOHRT/SHHRAT/SHHRT
VowelReplacements = {
'a': ['', 'A'],
'i': ['', 'Y'],
'u': ['', 'O'],
'e': ['E'],
'o': ['O'],
}
def step12(rmWords: list[str]) -> list:
"""Process two vowel sequences."""
return permuteAllOccurrences(rmWords, VowelReplacements)
rmWords12 = [step12(ww) for ww in rmWords11]
for i, (w1, w2) in enumerate(zip(rmWords11, rmWords12), start=1):
print(f'{i:02}. {w1} -> {w2}')
01. ['AaLaG'] -> ['ALG', 'ALAG', 'AALG', 'AALAG']
02. ['AuLo'] -> ['ALO', 'AOLO']
03. ['BuKHaR'] -> ['BKHR', 'BKHAR', 'BOKHR', 'BOKHAR']
04. ['BHE', 'BHaYY', 'BHaAY'] -> ['BHE', 'BHYY', 'BHAYY', 'BHAY', 'BHAAY']
05. ['HE', 'HaYY', 'HaAY'] -> ['HE', 'HYY', 'HAYY', 'HAY', 'HAAY']
06. ['BHYY', 'BHaYY'] -> ['BHYY', 'BHYY', 'BHAYY']
07. ['SHoHRaT', 'SHHRaT'] -> ['SHOHRT', 'SHOHRAT', 'SHHRT', 'SHHRAT']
08. ['KYA', 'KYH'] -> ['KYA', 'KYH']
Implementation
This module contains the Urdu language roman-to-arabic script transliteration algorithm and its helper functions.
- romanalfaz.algorithm.tafseerUrduAr2Rm(word: str, keep: bool = True) tuple[str, int]
Converts a string containing Urdu text in arabic-script to intermediate roman-script.
The input word is normalized to base characters using
romanalfaz.utils.normalize()before processing it for romanization.This function also handles specific orthographic rule of the Urdu language, treating the initial ‘Waw’ (و) as a consonant sound (‘W’) rather than a vowel marker, and maps it specifically separate from the predefined encoding map.
- Parameters:
word (str) – The input string containing Urdu text in arabic-script.
keep (bool) – To keep the undefined characters in the output text, or skip them. Default is
True.
- Returns:
A tuple containing two elements:
str: The fully romanized string with tokens separated by spaces.
int: The count of undefined characters encountered during conversion. Characters not found in the encoding map can be optionally kept as-is in the output, however, this counter reflects how many were in the input after normalization, irrespective of the output.
- Return type:
tuple[str, int]
- Raises:
AssertionError – If the input word is not a
strinstance.
Example
>>> tafseerUrduAr2Rm("السلام") ('ALSLAM', 0) >>> tafseerUrduAr2Rm("والسلام") ('WALSLAM', 0) >>> tafseerUrduAr2Rm('الگ') ('ALG', 0) >>> tafseerUrduAr2Rm('الو') ('ALO', 0) >>> tafseerUrduAr2Rm('بخار') ('BKHAR', 0) >>> tafseerUrduAr2Rm('بهائ') ('BHAY', 0) >>> tafseerUrduAr2Rm('ہے') ('HE', 0) >>> tafseerUrduAr2Rm('شہرت') ('SHHRT', 0) :param word: :param keep:
- romanalfaz.algorithm.tafseerUrduRm2Rm(word: str) set[str]
Transforms a roman-script Urdu input into an intermediate normalized representation.
This function addresses common transliteration inconsistencies caused by:
The absence of vowels in standard arabic-script writing (leading to ambiguous consonant clusters).
Multiple distinct Arabic consonants that map to the same Roman letter (e.g., ‘kaf’ vs ‘qaf’).
It executes a 12-step sequential algorithm designed to resolve these ambiguities, producing one or more candidate words in an intermediate Roman encoding format. These candidates can then be matched against a pre-defined dictionary of known words to predict the intended Arabic word.
- Parameters:
word (str) – The input string containing Roman-script Urdu text.
- Returns:
A set of candidate words in the intermediate Roman encoding. For empty string, returns an empty set.
- Return type:
set[str]
- Raises:
AssertionError – If the input word is not a
strinstance.
Example
>>> tafseerUrduRm2Rm('khalq') # Ambiguous, can be خالق or خلق {'KHALQ', 'KHLQ'} # Generated possible candidates >>> tafseerUrduRm2Rm('alag') # الگ {'AALAG', 'AALG', 'ALAG', 'ALG'} >>> tafseerUrduRm2Rm('ullo') # الو {'ALO', 'AOLO'} >>> tafseerUrduRm2Rm('bukhar') # بخار {'BKHAR', 'BKHR', 'BOKHAR', 'BOKHR'} >>> tafseerUrduRm2Rm('bhai') # بهائ {'BHAAY', 'BHAY', 'BHAYY', 'BHE', 'BHYY'} >>> tafseerUrduRm2Rm('hai') # ہے {'HAAY', 'HAY', 'HAYY', 'HE', 'HYY'} >>> tafseerUrduRm2Rm('bhayi') # بهائ {'BHAYY', 'BHYY'} >>> tafseerUrduRm2Rm('shohrat') # شہرت {'SHHRAT', 'SHHRT', 'SHOHRAT', 'SHOHRT'}
- romanalfaz.algorithm.permuteFirstOccurrence(token: str, mapping: dict[str, list[str]]) list[str]
Generates variations of a token by replacing its leftmost matching substring.
This function scans the input token from left to right to find the earliest occurrence of any key defined in the mapping dictionary. Once found, it creates a list of new strings where only that first occurrence is replaced by each of its corresponding mapped values.
- Parameters:
token – The input roman Urdu token to process.
mapping – A dictionary where keys are target character patterns and values are lists of allowed replacement variations.
- Returns:
A list of modified strings containing the variations. Returns an empty list if none of the mapping keys are found within the token.
Example
>>> rule_map = {"aa": ["a", "e"], "kh": ["x"]} >>> permuteFirstOccurrence("baakh", rule_map) ['bakh', 'bekh']
- romanalfaz.algorithm.permuteAllOccurrences(tokenList: list[str], mapping: dict[str, list[str]]) list[str]
Exhaustively generates all phonetic permutations for a list of tokens.
This function processes each string in the input list, running an iterative queue-based cascade. It repeatedly targets and replaces the leftmost matching character patterns using permuteFirstOccurrence until a string contains absolutely no keys from the mapping dictionary.
- Parameters:
tokenList – A list of roman Urdu tokens to be fully permuted.
mapping – A dictionary where keys are target character patterns and values are lists of allowed replacement variations.
- Returns:
A list containing all fully transformed variations of the input tokens. If a token has no matches, it is returned as-is.
Example
>>> rule_map = {"oo": ["u"], "ee": ["i"]} >>> permuteAllOccurrences(["khooshee"], rule_map) ['khushi']
- romanalfaz.algorithm.permuteEnding(token: str, mapping: dict[str, list[str]]) list[str]
Replaces the trailing suffix of a string with mapped variations.
This function evaluates the end of the input string against keys in the mapping dictionary. When a trailing match is detected, it strips that specific suffix and returns a list of new strings combining the unchanged prefix with every allowed substitution tail.
- Parameters:
token – The input Roman Urdu string to inspect for trailing patterns.
mapping – A dictionary where keys are target trailing character sequences (suffixes) and values are lists of substitution variants.
- Returns:
A list of strings containing the modified suffix variations. Returns an empty list if the text does not end with any of the mapping keys.
Example
>>> suffix_rules = {"ein": ["ain", "en"], "iy": ["i"]} >>> permuteEnding("jaalein", suffix_rules) ['jaalain', 'jaalen']
- romanalfaz.algorithm.permuteAllEndings(tokenList: list[str], mapping: dict[str, list[str]]) list[str]
Exhaustively generates all suffix variations for a list of tokens.
This function steps through each string in the input token list, executing a queue-driven transformation loop. It continuously strips and replaces trailing character suffixes using permuteEnding until the strings contain absolutely no matching suffix keys left in the mapping dataset.
- Parameters:
tokenList – A list of raw Roman Urdu tokens to evaluate for suffix changes.
mapping – A dictionary where keys are target suffix patterns and values are lists of allowed trailing replacement variations.
- Returns:
A list of all fully processed suffix variations across the input tokens. Tokens with no suffix matches are preserved in the list exactly as-is.
Example
>>> suffixRules = {"ein": ["ain"], "ain": ["en"]} >>> permuteAllEndings(["karein"], suffixRules) ['karen']
- romanalfaz.algorithm.replaceEnding(token: str, mapping: dict[str, str], n: int) str
Replaces a suffix of length n with its mapped string substitution.
This function extracts the final n characters of a token and checks if that suffix exists within the provided mapping dictionary. If a match is found, the old suffix is stripped and swapped for the replacement value. If no match is found, or if the string is shorter than n, the original token is returned unmodified.
- Parameters:
token – The input Roman Urdu string to evaluate and modify.
mapping – A dictionary where keys are target suffixes of length n and values are their single string replacements.
n – The exact character length of the suffix to isolate and check.
- Returns:
The modified string with the new ending applied, or the original unaltered token if no mapping constraints were met.
Example
>>> suffixRules = {"ah": "a", "iy": "i"} >>> replaceEnding("vaalah", suffixRules, 2) 'vaala'
- romanalfaz.algorithm.findVowelCombos(token: str) list[list[str]]
Segments a string of vowels into all possible valid sub-pattern breakdowns.
This function uses a recursive backtracking depth-first search to find every combination of 1-character and 2-character vowel tokens that can perfectly reconstruct the input string based on a predefined set of roman Urdu vowel patterns.
- Parameters:
token – A string consisting of vowel characters to partition.
- Returns:
A list of lists, where each sub-list represents a valid sequence of parsed vowel patterns that exactly make up the input word.
Example
>>> findVowelCombos("aa") [['a', 'a'], ['aa']]
- romanalfaz.algorithm.concatenateCombos(subStrings: list[str], charset: set[str] | list[str]) list[str]
Concatenates strings using all combinations of characters from a charset.
This function pieces together an array of fragmented sub-strings by interleaving every possible permutation sequence of characters from the given charset into the interstitial gaps between elements.
- Parameters:
subStrings – A list of string segments to stitch together.
charset – A collection (set or list) of single-character strings to act as variation separators between the text segments.
- Returns:
A list of all possible interleaved concatenated string variants.
Example
>>> segments = ["b", "n", "n"] >>> vowels = {"a", "o"} >>> concatenateCombos(segments, vowels) ['banan', 'banon', 'bonan', 'bonon']
- romanalfaz.algorithm.permuteConsecutiveVowels(inList: list[str]) list[str]
Processes tokens containing two or more consecutive vowels into variations.
This function scans a list of string tokens to find consecutive vowel sequences. It isolates each vowel cluster, segments it into valid sub-patterns, and interleaves structural placeholders (“A” and “Y”) between multi-token vowel combinations before reconstructing the final permuted words.
- Parameters:
inList – A list of Roman Urdu string tokens to inspect and permute.
- Returns:
A list of all newly generated word variations. If a word contains no consecutive vowels, it is returned in the output list unmodified.
Example
>>> permuteConsecutiveVowels(["koo"]) ['koo', 'koAo', 'koYo']
- romanalfaz.algorithm.URDU_ARABIC2ROMAN_ENCODING_MAP: dict[str, str]
This is one-to-one transliteration convertion map for converting Urdu text from arabic-script to roman-script for word list lookup.
Todo
Clean up the input characters to map only the normalized characters.
URDU_ARABIC2ROMAN_ENCODING_MAP: dict[str, str] = {
'\u0627': 'A', # ARABIC LETTER ALEF
'\u0639': 'A', # ARABIC LETTER AIN
'\u0622': 'AA', # ARABIC LETTER ALEF WITH MADDA ABOVE
'\u0623': 'A', # ARABIC LETTER ALEF WITH HAMZA ABOVE
'\u0628': 'B', # ARABIC LETTER BEH
'\u067E': 'P', # ARABIC LETTER PEH
'\u062a': 'T', # ARABIC LETTER TEH
'\u0637': 'T', # ARABIC LETTER TAH
'\u0679': 'T', # ARABIC LETTER TTEH
'\u06C3': 'T', # ARABIC LETTER TEH MARBUTA GOAL
'\u062c': 'J', # ARABIC LETTER JEEM
'\u062b': 'S', # ARABIC LETTER THEH
'\u0633': 'S', # ARABIC LETTER SEEN
'\u0635': 'S', # ARABIC LETTER SAD
'\u0686': 'CH', # ARABIC LETTER TCHEH
'\u062d': 'H', # ARABIC LETTER HAH
'\u06c1': 'H', # ARABIC LETTER HEH GOAL
'\u06c2': 'H', # ARABIC LETTER HEH GOAL WITH HAMZA ABOVE
'\u06be': 'H', # ARABIC LETTER HEH DOACHASHMEE
'\u0647': 'H', # ARABIC LETTER HEH
'\u062e': 'KH', # ARABIC LETTER KHAH
'\u062f': 'D', # ARABIC LETTER DAL
'\u0688': 'D', # ARABIC LETTER DDAL
'\u0630': 'Z', # ARABIC LETTER THAL
'\u0632': 'Z', # ARABIC LETTER ZAIN
'\u0636': 'Z', # ARABIC LETTER DAD
'\u0638': 'Z', # ARABIC LETTER ZAH
'\u0698': 'Z', # ARABIC LETTER JEH
'\u0631': 'R', # ARABIC LETTER REH
'\u0691': 'R', # ARABIC LETTER RREH
'\u0634': 'SH', # ARABIC LETTER SHEEN
'\u063a': 'GH', # ARABIC LETTER GHAIN
'\u0641': 'F', # ARABIC LETTER FEH
'\u06A9': 'K', # ARABIC LETTER KEHEH
'\u0642': 'Q', # ARABIC LETTER QAF
'\u06af': 'G', # ARABIC LETTER GAF
'\u0644': 'L', # ARABIC LETTER LAM
'\u0645': 'M', # ARABIC LETTER MEEM
'\u0646': 'N', # ARABIC LETTER NOON
'\u06ba': 'N', # ARABIC LETTER NOON GHUNNA
'\u0648': 'O', # ARABIC LETTER WAW
'\u0624': 'O', # ARABIC LETTER WAW WITH HAMZA ABOVE
'\u06CC': 'Y', # ARABIC LETTER FARSI YEH
'\u0621': 'Y', # ARABIC LETTER HAMZA
'\u0626': 'Y', # ARABIC LETTER YEH WITH HAMZA ABOVE
'\u064A': 'Y', # ARABIC LETTER YEH
'\u06d2': 'E', # ARABIC LETTER YEH BARREE
}