Item request has been placed! ×
Item request cannot be made. ×
loading  Processing Request

Methods and system for fast, adaptive correction of misspells

Item request has been placed! ×
Item request cannot be made. ×
loading   Processing Request
  • Publication Date:
    March 03, 2020
  • Additional Information
    • Patent Number:
      10579,729
    • Appl. No:
      15/296818
    • Application Filed:
      October 18, 2016
    • Abstract:
      Embodiments are directed to a spellcheck module for an enterprise search engine. The spellcheck module includes a candidate suggestion generation module that generates a number of candidate words that may be the correction of the misspelled word. The candidate suggestion generation module implements an algorithm for indexing, searching, and storing terms from an index with a constrained edit distance, using words in a collection of documents. The spellcheck module further includes a candidate suggestion ranking module. In one embodiment, a non-contextual approach using a linear combination of distance and probability scores is utilized; while in another embodiment, a context sensitive approach accounting for real-word misspells and adopting deep learning models is utilized. In use, a query is provided to the spellcheck module to generate results in the form of a ranked list of generated candidate entries that may be an entry a user accidentally misspelled.
    • Inventors:
      International Business Machines Corporation (Armonk, NY, US)
    • Assignees:
      International Business Machines Corporation (Armonk, NY, US)
    • Claim:
      1. A computer-implemented method for adaptive correction of misspelling, the method comprising: defining, by a processor coupled to one or more user devices, a maximum edit distance and a threshold frequency for words of a dataset to be added to an index; sorting, by the processor, the dataset to identify the words of the dataset to add to the index based on the threshold frequency; adding, by the processor, to the index the identified words and alternative words having character deletions in accordance with the maximum edit distance to create entries; receiving, at the processor from a first user device of the one or more user devices, a text for spelling analysis; identifying, by the processor, one or more candidate entries from the entries of the index by obtaining from the index the entries associated with the text; and ranking, by the processor, the one or more candidate entries based on a ranking score; wherein the ranking score is a linear combination of Smoothed Term Probability (STP) and Edit Similarity (ES), and the STP considers a logarithm of a frequency of each candidate entry and a logarithm of all occurrences in the index; the linear combination of STP and ES is weighted by a hyper parameter alpha comprising a validation parameter.
    • Claim:
      2. The method of claim 1 , wherein adding to the index the identified words and the alternative words further comprises adding links to the identified words and the alternative words to create the entries.
    • Claim:
      3. The method of claim 2 , further comprising storing in memory associated with the processor an associated link for the alternative words that are not part of the dataset.
    • Claim:
      4. The method of claim 1 , further comprising ordering, by the processor, the one or more candidate entries based on the ranking to identify corrections to the text.
    • Claim:
      5. The method of claim 1 , wherein the processor is part of an enterprise search engine and the dataset comprises a collection of data of the enterprise search engine.
    • Claim:
      6. A system for adaptive correction of misspelling, the system comprising: a processor coupled to one or more user devices to receive user-generated search queries from the one or more user devices, the processor configured to: define a maximum edit distance and a threshold frequency for words of a dataset to be added to an index; sort the dataset to identify the words of the dataset to add to the index based on the threshold frequency; add to the index the identified words and alternative words having character deletions in accordance with the maximum edit distance to create entries; receive from a first user device of the one or more user devices a text for spelling analysis; identify one or more candidate entries from the entries of the index by obtaining from the index the entries associated with the text; and rank the one or more candidate entries based on a ranking score; wherein the ranking score is a linear combination of Smoothed Term Probability (STP) and Edit Similarity (ES), and the STP considers a logarithm of a frequency of each candidate entry and a logarithm of all occurrences in the index; the linear combination of STP and ES is weighted by a hyper parameter alpha comprising a validation parameter.
    • Claim:
      7. The system of claim 6 , wherein adding to the index the identified words and the alternative words further comprises adding links to the identified words and the alternative words to create the entries.
    • Claim:
      8. The system of claim 7 , wherein the processor is further configured to store in memory associated with the processor an associated link for the alternative words that are not part of the dataset.
    • Claim:
      9. The system of claim 6 , wherein the processor is further configured to order the one or more candidate entries based on the ranking to identify corrections to the text.
    • Claim:
      10. The system of claim 6 , wherein the processor is part of an enterprise search engine and the dataset comprises a collection of data of the enterprise search engine.
    • Claim:
      11. A computer program product for adaptive correction of misspelling, the computer program product comprising: a non-transitory computer readable storage medium having program instructions embodied therewith, the program instructions executable by a processor coupled to one or more user devices to receive user-generated search queries from the one or more user devices to cause the processor to: define a maximum edit distance and a threshold frequency for words of a dataset to be added to an index; sort the dataset to identify the words of the dataset to add to the index based on the threshold frequency; add to the index the identified words and alternative words having character deletions in accordance with the maximum edit distance to create entries; receive from a first user device of the one or more user devices a text for spelling analysis; identify one or more candidate entries from the entries of the index by obtaining from the index the entries associated with the text; and rank the one or more candidate entries based on a ranking score; wherein the ranking score is a linear combination of Smoothed Term Probability (STP) and Edit Similarity (ES), and the STP considers a logarithm of a frequency of each candidate entry and a logarithm of all occurrences in the index; the linear combination of STP and ES is weighted by a hyper parameter alpha comprising a validation parameter.
    • Claim:
      12. The computer program product of claim 11 , wherein adding to the index the identified words and the alternative words further comprises adding links to the identified words and the alternative words to create the entries.
    • Claim:
      13. The computer program product of claim 12 , wherein the program instructions further cause the processor to store in memory associated with the processor an associated link for the alternative words that are not part of the dataset.
    • Claim:
      14. The computer program product of claim 11 , wherein the program instructions further cause the processor to order the one or more candidate entries based on the ranking to identify corrections to the text.
    • Patent References Cited:
      4674065 June 1987 Lange et al.
      5148367 September 1992 Saito et al.
      5258909 November 1993 Damerau et al.
      5572423 November 1996 Church
      5604897 February 1997 Travis
      5659771 August 1997 Golding
      5699441 December 1997 Sagawa
      5907839 May 1999 Roth
      6047300 April 2000 Walfish et al.
      6616704 September 2003 Birman et al.
      7047493 May 2006 Brill et al.
      7254774 August 2007 Cucerzan et al.
      7321892 January 2008 Vadon et al.
      7660806 February 2010 Brill et al.
      7809744 October 2010 Nevidomski et al.
      7818332 October 2010 Olds et al.
      8290968 October 2012 Jonas
      8441454 May 2013 Longe
      8621344 December 2013 Shazeer
      8775341 July 2014 Commons
      8799237 August 2014 Walsh
      9015036 April 2015 Karov Zangvil
      9020822 April 2015 Kalinli-Akbacak
      9031293 May 2015 Kalinli-Akbacak
      9037967 May 2015 Al-Jefri
      9245052 January 2016 Brewer et al.
      9557916 January 2017 Robinson
      9740767 August 2017 Quinion
      2002/0087604 July 2002 Bernth et al.
      2003/0145285 July 2003 Miyahira et al.
      2007/0016616 January 2007 Brill et al.
      2008/0077396 March 2008 Hsu
      2008/0155398 June 2008 Bodin et al.
      2008/0167858 July 2008 Christie et al.
      2009/0089666 April 2009 White et al.
      2010/0076972 March 2010 Baron
      2011/0193797 August 2011 Unruh
      2012/0191357 July 2012 Qiu
      2012/0229388 September 2012 Oh et al.
      2013/0066896 March 2013 Mehanna
      2013/0238584 September 2013 Hendry
      2013/0262096 October 2013 Wilhelms-Tricarico
      2013/0283156 October 2013 Badrashiny et al.
      2014/0104175 April 2014 Ouyang
      2014/0249799 September 2014 Yih
      2014/0358831 December 2014 Adams
      2015/0332670 November 2015 Akbacak et al.
      2016/0110343 April 2016 Kumar
      2016/0210551 July 2016 Lee
      2016/0306876 October 2016 Nichols
      2016/0350653 December 2016 Socher et al.
      2017/0032243 February 2017 Corrado
      2017/0358293 December 2017 Chua
      2017/0372200 December 2017 Chen et al.
      2018/0061439 March 2018 Diamos
      2018/0150605 May 2018 Co
      1104568 January 2012
      WO-2009040790 April 2009
      WO-2012151255 November 2012



















    • Other References:
      Tyler Garaas, Mei Xiao, Marc Pomplun: Personalized Spell Checking using Neural Networks Tyler Garaas. University of Boston: 1-5. cited by applicant
      Eric Brill, Robert C. Moore: An Improved Error Model for Noisy Channel Spelling Correction. ACL 2000. cited by applicant
      Silviu Cucerzan, Eric Bill: Spelling Correction as an Itrative Process that Exploits the Collective Knowledge of Web Users. EMNLP 2004: 293-300. cited by applicant
      Aminul Islam, Diana Inkpen: Real-word spelling correction using Google web 1Tn-gram data set. CIKM 2009: 1689-1692. cited by applicant
      Andrew R. Golding: In Proceedings of the Third Workshop on Very Large Corpora (1995): 39-53. cited by applicant
      Andres R. Golding, Yves Schabes: Combining Trigram-based and Feature-based Methods for Context-Sensitive Spelling Correction. ACL 1996: 71-78. cited by applicant
      Andrew R. Golding, Dan Roth: A Winnow-Based Approach to Context-Sensitive Spelling Correction. Machine Learning 34(1-3): 107-130 (1999). cited by applicant
      Mark D. Kernighan, Kenneth Ward Church, William A. Gale: A Spelling Correction Program Based on a Noisy Channel Model. COLING 1990: 205-210. cited by applicant
      Karen Kukich: Techniques for Automatically Correcting Words in Text. ACM Comput. Surv. 24(4): 377-439 (1992). cited by applicant
      Lidia Mangu. Eric Brill. Automatic Rule Acqusition for Spelling Correction ICML 1997: 187-194. cited by applicant
      Herman Stehouwer, Menno van Zaanen: Language models for contextual error detection and correction. CLAGI 2009: 41-48. cited by applicant
      Kristina Toutanova, Robert C. Moore: Pronunciation Modeling for Improved Spelling Correction. ACL 2002: 144-151. cited by applicant
      Casey Whitelaw, Ben Hutchinson, Grace Chung, Ged Ellis: Using the Web for Language Independent Spellchecking and Autocorrection. EMNLP 2009: 890-899. cited by applicant
      “SpellChecker” Lucene Java-Wiki. Sep. 20, 2009. Web. Oct. 13, 2016. . cited by applicant
      Eric Mays, Fred J. Damerau, Robert L. Mercer: Context based spelling correction. Inf. Process. Manage. 27(5): 517-522 (1991). cited by applicant
      Peiling Wang, Michael W. Berry, Yiheng Yang: Mining longitudinal web queries: Trends and patterns. JASIST 54(8): 743-758 (2003). cited by applicant
      Non-Final Office Action dated Oct. 12, 2018 in corresponding U.S. Appl. No. 15/296,794. cited by applicant
      Office Action dated Aug. 2, 2017 in related U.S. Appl. No. 15/296,794. cited by applicant
      U.S. Appl. No. 15/296,794, filed Oct. 18, 2016. cited by applicant
      Notice of Allowance dated Mar. 25, 2019 in corresponding U.S. Appl. No. 15/296,794. cited by applicant
    • Primary Examiner:
      Chau, Dung K
    • Attorney, Agent or Firm:
      Pepper Hamilton LLP
    • Accession Number:
      edspgr.10579729