Methods and system for fast, adaptive correction of misspells

Item request has been placed!

Item request cannot be made.

Processing Request

Read More Add to Saved list

Publication Date:
March 03, 2020

Additional Information
- Patent Number:
  10579,729
- Appl. No:
  15/296818
- Application Filed:
  October 18, 2016
- Abstract:
  Embodiments are directed to a spellcheck module for an enterprise search engine. The spellcheck module includes a candidate suggestion generation module that generates a number of candidate words that may be the correction of the misspelled word. The candidate suggestion generation module implements an algorithm for indexing, searching, and storing terms from an index with a constrained edit distance, using words in a collection of documents. The spellcheck module further includes a candidate suggestion ranking module. In one embodiment, a non-contextual approach using a linear combination of distance and probability scores is utilized; while in another embodiment, a context sensitive approach accounting for real-word misspells and adopting deep learning models is utilized. In use, a query is provided to the spellcheck module to generate results in the form of a ranked list of generated candidate entries that may be an entry a user accidentally misspelled.
- Inventors:
  International Business Machines Corporation (Armonk, NY, US)
- Assignees:
  International Business Machines Corporation (Armonk, NY, US)
- Claim:
  1. A computer-implemented method for adaptive correction of misspelling, the method comprising: defining, by a processor coupled to one or more user devices, a maximum edit distance and a threshold frequency for words of a dataset to be added to an index; sorting, by the processor, the dataset to identify the words of the dataset to add to the index based on the threshold frequency; adding, by the processor, to the index the identified words and alternative words having character deletions in accordance with the maximum edit distance to create entries; receiving, at the processor from a first user device of the one or more user devices, a text for spelling analysis; identifying, by the processor, one or more candidate entries from the entries of the index by obtaining from the index the entries associated with the text; and ranking, by the processor, the one or more candidate entries based on a ranking score; wherein the ranking score is a linear combination of Smoothed Term Probability (STP) and Edit Similarity (ES), and the STP considers a logarithm of a frequency of each candidate entry and a logarithm of all occurrences in the index; the linear combination of STP and ES is weighted by a hyper parameter alpha comprising a validation parameter.
- Claim:
  2. The method of claim 1 , wherein adding to the index the identified words and the alternative words further comprises adding links to the identified words and the alternative words to create the entries.
- Claim:
  3. The method of claim 2 , further comprising storing in memory associated with the processor an associated link for the alternative words that are not part of the dataset.
- Claim:
  4. The method of claim 1 , further comprising ordering, by the processor, the one or more candidate entries based on the ranking to identify corrections to the text.
- Claim:
  5. The method of claim 1 , wherein the processor is part of an enterprise search engine and the dataset comprises a collection of data of the enterprise search engine.
- Claim:
  6. A system for adaptive correction of misspelling, the system comprising: a processor coupled to one or more user devices to receive user-generated search queries from the one or more user devices, the processor configured to: define a maximum edit distance and a threshold frequency for words of a dataset to be added to an index; sort the dataset to identify the words of the dataset to add to the index based on the threshold frequency; add to the index the identified words and alternative words having character deletions in accordance with the maximum edit distance to create entries; receive from a first user device of the one or more user devices a text for spelling analysis; identify one or more candidate entries from the entries of the index by obtaining from the index the entries associated with the text; and rank the one or more candidate entries based on a ranking score; wherein the ranking score is a linear combination of Smoothed Term Probability (STP) and Edit Similarity (ES), and the STP considers a logarithm of a frequency of each candidate entry and a logarithm of all occurrences in the index; the linear combination of STP and ES is weighted by a hyper parameter alpha comprising a validation parameter.
- Claim:
  7. The system of claim 6 , wherein adding to the index the identified words and the alternative words further comprises adding links to the identified words and the alternative words to create the entries.
- Claim:
  8. The system of claim 7 , wherein the processor is further configured to store in memory associated with the processor an associated link for the alternative words that are not part of the dataset.
- Claim:
  9. The system of claim 6 , wherein the processor is further configured to order the one or more candidate entries based on the ranking to identify corrections to the text.
- Claim:
  10. The system of claim 6 , wherein the processor is part of an enterprise search engine and the dataset comprises a collection of data of the enterprise search engine.
- Claim:
  11. A computer program product for adaptive correction of misspelling, the computer program product comprising: a non-transitory computer readable storage medium having program instructions embodied therewith, the program instructions executable by a processor coupled to one or more user devices to receive user-generated search queries from the one or more user devices to cause the processor to: define a maximum edit distance and a threshold frequency for words of a dataset to be added to an index; sort the dataset to identify the words of the dataset to add to the index based on the threshold frequency; add to the index the identified words and alternative words having character deletions in accordance with the maximum edit distance to create entries; receive from a first user device of the one or more user devices a text for spelling analysis; identify one or more candidate entries from the entries of the index by obtaining from the index the entries associated with the text; and rank the one or more candidate entries based on a ranking score; wherein the ranking score is a linear combination of Smoothed Term Probability (STP) and Edit Similarity (ES), and the STP considers a logarithm of a frequency of each candidate entry and a logarithm of all occurrences in the index; the linear combination of STP and ES is weighted by a hyper parameter alpha comprising a validation parameter.
- Claim:
  12. The computer program product of claim 11 , wherein adding to the index the identified words and the alternative words further comprises adding links to the identified words and the alternative words to create the entries.
- Claim:
  13. The computer program product of claim 12 , wherein the program instructions further cause the processor to store in memory associated with the processor an associated link for the alternative words that are not part of the dataset.
- Claim:
  14. The computer program product of claim 11 , wherein the program instructions further cause the processor to order the one or more candidate entries based on the ranking to identify corrections to the text.
- Patent References Cited:
  4674065 June 1987 Lange et al.
  5148367 September 1992 Saito et al.
  5258909 November 1993 Damerau et al.
  5572423 November 1996 Church
  5604897 February 1997 Travis
  5659771 August 1997 Golding
  5699441 December 1997 Sagawa
  5907839 May 1999 Roth
  6047300 April 2000 Walfish et al.
  6616704 September 2003 Birman et al.
  7047493 May 2006 Brill et al.
  7254774 August 2007 Cucerzan et al.
  7321892 January 2008 Vadon et al.
  7660806 February 2010 Brill et al.
  7809744 October 2010 Nevidomski et al.
  7818332 October 2010 Olds et al.
  8290968 October 2012 Jonas
  8441454 May 2013 Longe
  8621344 December 2013 Shazeer
  8775341 July 2014 Commons
  8799237 August 2014 Walsh
  9015036 April 2015 Karov Zangvil
  9020822 April 2015 Kalinli-Akbacak
  9031293 May 2015 Kalinli-Akbacak
  9037967 May 2015 Al-Jefri
  9245052 January 2016 Brewer et al.
  9557916 January 2017 Robinson
  9740767 August 2017 Quinion
  2002/0087604 July 2002 Bernth et al.
  2003/0145285 July 2003 Miyahira et al.
  2007/0016616 January 2007 Brill et al.
  2008/0077396 March 2008 Hsu
  2008/0155398 June 2008 Bodin et al.
  2008/0167858 July 2008 Christie et al.
  2009/0089666 April 2009 White et al.
  2010/0076972 March 2010 Baron
  2011/0193797 August 2011 Unruh
  2012/0191357 July 2012 Qiu
  2012/0229388 September 2012 Oh et al.
  2013/0066896 March 2013 Mehanna
  2013/0238584 September 2013 Hendry
  2013/0262096 October 2013 Wilhelms-Tricarico
  2013/0283156 October 2013 Badrashiny et al.
  2014/0104175 April 2014 Ouyang
  2014/0249799 September 2014 Yih
  2014/0358831 December 2014 Adams
  2015/0332670 November 2015 Akbacak et al.
  2016/0110343 April 2016 Kumar
  2016/0210551 July 2016 Lee
  2016/0306876 October 2016 Nichols
  2016/0350653 December 2016 Socher et al.
  2017/0032243 February 2017 Corrado
  2017/0358293 December 2017 Chua
  2017/0372200 December 2017 Chen et al.
  2018/0061439 March 2018 Diamos
  2018/0150605 May 2018 Co
  1104568 January 2012
  WO-2009040790 April 2009
  WO-2012151255 November 2012
- Other References:
  Tyler Garaas, Mei Xiao, Marc Pomplun: Personalized Spell Checking using Neural Networks Tyler Garaas. University of Boston: 1-5. cited by applicant
  Eric Brill, Robert C. Moore: An Improved Error Model for Noisy Channel Spelling Correction. ACL 2000. cited by applicant
  Silviu Cucerzan, Eric Bill: Spelling Correction as an Itrative Process that Exploits the Collective Knowledge of Web Users. EMNLP 2004: 293-300. cited by applicant
  Aminul Islam, Diana Inkpen: Real-word spelling correction using Google web 1Tn-gram data set. CIKM 2009: 1689-1692. cited by applicant
  Andrew R. Golding: In Proceedings of the Third Workshop on Very Large Corpora (1995): 39-53. cited by applicant
  Andres R. Golding, Yves Schabes: Combining Trigram-based and Feature-based Methods for Context-Sensitive Spelling Correction. ACL 1996: 71-78. cited by applicant
  Andrew R. Golding, Dan Roth: A Winnow-Based Approach to Context-Sensitive Spelling Correction. Machine Learning 34(1-3): 107-130 (1999). cited by applicant
  Mark D. Kernighan, Kenneth Ward Church, William A. Gale: A Spelling Correction Program Based on a Noisy Channel Model. COLING 1990: 205-210. cited by applicant
  Karen Kukich: Techniques for Automatically Correcting Words in Text. ACM Comput. Surv. 24(4): 377-439 (1992). cited by applicant
  Lidia Mangu. Eric Brill. Automatic Rule Acqusition for Spelling Correction ICML 1997: 187-194. cited by applicant
  Herman Stehouwer, Menno van Zaanen: Language models for contextual error detection and correction. CLAGI 2009: 41-48. cited by applicant
  Kristina Toutanova, Robert C. Moore: Pronunciation Modeling for Improved Spelling Correction. ACL 2002: 144-151. cited by applicant
  Casey Whitelaw, Ben Hutchinson, Grace Chung, Ged Ellis: Using the Web for Language Independent Spellchecking and Autocorrection. EMNLP 2009: 890-899. cited by applicant
  “SpellChecker” Lucene Java-Wiki. Sep. 20, 2009. Web. Oct. 13, 2016. . cited by applicant
  Eric Mays, Fred J. Damerau, Robert L. Mercer: Context based spelling correction. Inf. Process. Manage. 27(5): 517-522 (1991). cited by applicant
  Peiling Wang, Michael W. Berry, Yiheng Yang: Mining longitudinal web queries: Trends and patterns. JASIST 54(8): 743-758 (2003). cited by applicant
  Non-Final Office Action dated Oct. 12, 2018 in corresponding U.S. Appl. No. 15/296,794. cited by applicant
  Office Action dated Aug. 2, 2017 in related U.S. Appl. No. 15/296,794. cited by applicant
  U.S. Appl. No. 15/296,794, filed Oct. 18, 2016. cited by applicant
  Notice of Allowance dated Mar. 25, 2019 in corresponding U.S. Appl. No. 15/296,794. cited by applicant
- Primary Examiner:
  Chau, Dung K
- Attorney, Agent or Firm:
  Pepper Hamilton LLP
- Accession Number:
  edspgr.10579729

Comments

No Comments.

Methods and system for fast, adaptive correction of misspells

Contact

Follow us