Computer Science / Information Retrieval

Information Retrieval of Jumbled Words

January 05, 2011

Reading time: 5 minute

...

#Computer Science #Information Retrieval

📝 Original Info

Title: Information Retrieval of Jumbled Words
ArXiv ID: 1101.0766
Date: 2011-01-05
Authors: Venkata Ravinder Paruchuri

📝 Abstract

It is known that humans can easily read words where the letters have been jumbled in a certain way. This paper examines this problem by associating a distance measure with the jumbling process. Modifications to text were generated according to the Damerau-Levenshtein distance and it was checked if the users are able to read it. Graphical representations of the results are provided.

💡 Deep Analysis

📄 Full Content

Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe .

The above text has circulated on the Web for several years to show how powerful the human mind is in making sense of jumbled spellings. It may be viewed from the perspectives of joint error correction and coding [1] that is done simultaneously and automatically by the mind, or from the point of view of approximate string matching [2]- [6].

It has been proposed that the human brain is able to read the words even when they are jumbled because of the following properties 1. The grammatical structure of the sentence is not disturbed in the above sentence, that is the small words [of 2 or 3 letters] or the function words [by, the, is etc] are not jumbled.

Since the grammatical structure is preserved, the user is able to predict the next word in the sentence. The jumbled text not only preserves the grammatical structure, it leaves almost 45-50% of the words correct (In the above paragraph that we took 46% of the words are unchanged.

People generally tend to notice the first and last letters more easily than they tend to observe the middle letters. So there is less possibility of finding errors in the middle letters than the initial and last letters. 3. Although the words are jumbled in the paragraph, the jumbled words are not new words, thus making the task of the reader easier.
The sound of the original word is preserved in the jumbled words. This also makes reading easy as people tend to read the word by its sound. 5. People read the jumbled text because of the context of the sentence.

The two things that interested me, in this paper, are the use of function words and the context that plays a part in guessing the next word in the sentence. I have decided to remove the function words from the paragraph and then use the same jumbling technique to study the effect of this change. Also, to break the context of the sentence, I have taken 100 independent words that are commonly use in everyday life and then applied the jumbling technique.

Approximate string matching is the technique of performing string matching to the pattern of text. The match is measured in the number of operations that are performed to match the exact string. The most common operations that are performed to match the string are insertion, deletion and substitution. The number of operations performed is measured in terms of edit distance [13].

Examples of the operations are shown below:

Insertion: monkey  monkeys Deletion: monkey  money Substitution: monkey  donkey All the above operations the number of edit distances performed are one. Some string matchers also consider transposition of two adjacent letters in the string [14].

Approximate string matching has applications in many fields. Some examples are recovering the original signals after their transmission over noisy channels, finding DNA subsequences after possible mutations, and text searching where there are typing or spelling errors [6].

Most approximate string matchers assume same cost for all the operations performed in string matching, but some matchers do assign different weights to different operations. A more detailed description about edit distance and distance functions are explained in the distance measures section.

Edit distance is the number of operations performed to transfer one string into another string. There are different ways of performing the edit distance such as Levenshtein distance [7], Damerau-Levenshtein distance, Hamming distance, Jaro-Winkler distance, Longest common subsequence problem etc.

Levenshtein distance is a metric used to measure the difference between two sequences. This measure between two strings is defined by the number of edit operations used from transforming one string to another. The edit operations may be insertion, deletion and substitution of a single character. Here all the operations cost one unit. Levenshtein distance has a wide range of applications in areas such as spell checkers, dialect pronunciations and used in software’s for natural language translations [6].

As example the Levenshtein distance between Sunday and Monday is 2.

Damerau-Levenshtein distance is similar to Levenshtein distance except that it includes an extra edit operation called the transposition of adjacent letters. Here all the operations also cost one. Damerau-Levenshtein distance has its applications in fields of fraud vendor name detections, where it can detect the letter that has been deleted or substituted, in DNA, where the variation between the two strands of DNA can be found out by this distance [6].

Hamming distance allows only substitution of letters, which cost one unit. It is ap

📄 Read Full PDF on ArXiv

📸 Image Gallery

Reference

This content is AI-processed based on open access ArXiv data.

Information Retrieval of Jumbled Words

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

📸 Image Gallery

Reference

Table of Contents

Table of Contents

📝 Original Info

📝 Abstract

💡 Deep Analysis

📄 Full Content

📸 Image Gallery

Reference

Related Posts

Personalized Event-Based Surveillance and Alerting Support for the Assessment of Risk

Status of GDL - GNU Data Language

Start searching

No results found