Saturday, December 7, 2019

Lexical Normalization of Twitter Data for Messages- myassignmenthelp

Question: Discuss about theLexical Normalization of Twitter Data for Short Messages. Answer: Problem Definition A message in twitter consists of many different characters including the special characters. For lexical analysis to be performed for a twitter message all these characters have to be identified. Auto completion, auto correction of spellings and acronyms are subjects to normalization. Various normalization techniques are to be applied to these characters from the dictionary (OOV). The dictionary is searched for the presence of these characters and the words that are not found in the dictionary are subjected to normalization. The dictionary in this case is a .txt document which contains a set of words that the program looks into to find a match of a word that is being typed in a pad. The program then provides a match either an autocomplete or an autocorrect of the typed characters Symbols and special characters like #@ are categorized as special tokens or non-candidate and are not subjected to the normalization process. Most words are alike in spelling and pronunciation which necessitates the need for an algorithm that will identify each of the phonetically alike word from a query in a dictionary so as to produce a perfect match for the typed word or query. One algorithm cannot solve this issue of narrowing down the scope of word until the most appropriate word is arrived at. The use of a series of algorithm comes in to select the word typed and run it against a dictionary and classify it as a candidate or a non-candidate .Then narrow down to grouping them into groups according to their sound. Normalization is the process of transforming text into a form consistent with the dictionary which in this case is a .txt file that contains a list of words to be compared to the query. Techniques Levenshtein distance( 1) When a candidate has been identified the levenshtein distance technique is put in place to find matches for the candidate from the dictionary and is stored in an array. This set is referred to as first set of matches with reference to Edit distance. Because they contain matches in relation to the query typed.This generally contains misspelt words in a query typed. Soundex Algorithm(2) This is where matching of two different words with similar pronunciation to the same code is performed. This algorithm is used for spelling correction. Refined Soundex divides words into many groups according to their sounds. This algorithm provides a better approach to phonetic matching. Words are analysed and phonetically matched immediately when they come from the levenshtein distance(step 1). This then produces an array containing phonetically similar words and leveshtein distance words. Peter Norwigs Algorithm(3) Generates all possible words from the levenshtein algorithm. These include inserts, deletes, replaces and transposes from the query and searches them from the dictionary. This process is language dependent and is expensive. It may take a lot of time to process because it leads to many search terms. In finding the best match this algoritm has an accuracy of 80-90%. It then produces a lot of matches from the phonetically matched array of words from step 2 Soundex algorithm. Comparison(4) In this step a comparison between step 2 and step 3 is done and the best matches are searched for. N Gram Context Match(5) If more than one phonetic match is retrieved the context matching is done using each of the phonetic matches as the query. {previous word, query, next word} The following rules are applied to the query If the previous and the next word next to the query are retrieved in the dictionary the this is applied {previous word, query, next word} If the next word is found then {query next word} is used If the previous word is found in the dictionary then {previous word, query} is used Here the phonetically matching words to the query are retrieved. The most occurrence of a word is returned as a candidate based on the algorithm. The code below illustrates this Conclusion An incorporation of several normalization techniques improves the accuracy of auto completion and auto correction . These algorithms though tiresome for example the N-Gram which takes a substantial time are work pursuing for accuracy to be achieved. Reference Han, B and Baldin , T. Lexical Normalization of Short Text Messages: Makn sens Whitelaw, C and Hutchinson, B. and Y Chung, G. and Ellis, G. Using the Web for Language Independent Spellchecking and Auto correction Google Inc., Pyrmont NSW 2009, Australia

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.