Stop word

Stop words are the words in a stop list (or stoplist or negative dictionary) which are filtered out (i.e. stopped) before or after processing of natural language data (text) because they are deemed insignificant.[1] There is no single universal list of stop words used by all natural language processing tools, nor any agreed upon rules for identifying stop words, and indeed not all tools even use such a list. Therefore, any group of words can be chosen as the stop words for a given purpose. The "general trend in [information retrieval] systems over time has been from standard use of quite large stop lists (200–300 terms) to very small stop lists (7–12 terms) to no stop list whatsoever".[2]

History of stop words

A predecessor concept was used in creating some concordances. For example, the first Hebrew concordance, Isaac Nathan ben Kalonymus's Me’ir Nativ, contained a one-page list of unindexed words, with nonsubstantive prepositions and conjunctions which are similar to modern stop words.[3]

Hans Peter Luhn, one of the pioneers in information retrieval, is credited with coining the phrase and using the concept when introducing his Keyword-in-Context automatic indexing process.[4] The phrase "stop word", which is not in Luhn's 1959 presentation, and the associated terms "stop list" and "stoplist" appear in the literature shortly afterward.[5]

Although it is commonly assumed that stoplists include only the most frequent words in a language, it was C.J. Van Rijsbergen who proposed the first standardized list which was not based on word frequency information. The "Van list" included 250 English words. Martin Porter's word stemming program developed in the 1980s built on the Van list, and the Porter list is now commonly used as a default stoplist in a variety of software applications.

In 1990, Christopher Fox proposed the first general stop list based on empirical word frequency information derived from the Brown Corpus:

This paper reports an exercise in generating a stop list for general text based on the Brown corpus of 1,014,000 words drawn from a broad range of literature in English. We start with a list of tokens occurring more than 300 times in the Brown corpus. From this list of 278 words, 32 are culled on the grounds that they are too important as potential index terms. Twenty-six words are then added to the list in the belief that they may occur very frequently in certain kinds of literature. Finally, 149 words are added to the list because the finite state machine based filter in which this list is intended to be used is able to filter them at almost no cost. The final product is a list of 421 stop words that should be maximally efficient and effective in filtering the most frequently occurring and semantically neutral words in general literature in English.[6]

In SEO terminology, stop words are the most common words that many search engines used to avoid for the purposes of saving space and time in processing of large data during crawling or indexing.

For some search engines, these are some of the most common, short function words, such as the, is, at, which, and on. In this case, stop words can cause problems when searching for phrases that include them, particularly in names such as "The Who", "The The", or "Take That". Other search engines remove some of the most common words—including lexical words, such as "want"—from a query in order to improve performance.[7]

In recent years the SEO best practices around stop words have evolved along with the fields of machine learning and natural language processing. In February 2021, John Mueller, Webmaster Trends Analyst at Google, Tweeted, "I wouldn't worry about stop words at all; write naturally. Search engines look at much, much more than individual words. 'To be or not to be' just is a collection of stop words, but stop words alone don't do it any justice."[8][9]

See also

References

  1. ^ Rajaraman, A.; Ullman, J. D. (2011). "Data Mining" (PDF). Mining of Massive Datasets. pp. 1–17. doi:10.1017/CBO9781139058452.002. ISBN 9781139058452.
  2. ^ Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze (2008). Introduction to Information Retrieval. Cambridge University Press. p. 27.{{cite book}}: CS1 maint: multiple names: authors list (link)
  3. ^ Weinberg, Bella Hass (2004). "Predecessors of scientific indexing structures in the domain of religion" (PDF). Second Conference on the History and Heritage of Scientific and Technical Information Systems: 126–134. Archived from the original (PDF) on 3 Jan 2016. Retrieved 17 February 2016.
  4. ^ Luhn, H. P. (1959). "Keyword-in-Context Index for Technical Literature (KWIC Index)". American Documentation. 11 (4). Yorktown Heights, NY: International Business Machines Corp.: 288–295. doi:10.1002/asi.5090110403.
  5. ^ Flood, Barbara J. (1999). "Historical note: The Start of a Stop List at Biological Abstracts". Journal of the American Society for Information Science. 50 (12): 1066. doi:10.1002/(SICI)1097-4571(1999)50:12<1066::AID-ASI5>3.0.CO;2-A.
  6. ^ Fox, Christopher (1989-09-01). "A stop list for general text". ACM SIGIR Forum. 24 (1–2): 19–21. doi:10.1145/378881.378888. ISSN 0163-5840. S2CID 20240000.
  7. ^ Stackoverflow: "One of our major performance optimizations for the "related questions" query is removing the top 10,000 most common English dictionary words (as determined by Google search) before submitting the query to the SQL Server 2008 full text engine. It’s shocking how little is left of most posts once you remove the top 10k English dictionary words. This helps limit and narrow the returned results, which makes the query dramatically faster".
  8. ^ "Google: Stop Worrying About Stop Words Just Write Naturally". seroundtable.com. 16 February 2021. Retrieved 2022-07-15.
  9. ^ John, Mueller (Feb 6, 2021). "John Mueller on stop words in 2021: "I wouldn't worry about stop words at all"". Twitter. Retrieved July 15, 2022.

Read other articles:

George WaldKinderen praten bertemu Nobelprijswinnaars di Amsterdam; George Wald, 1987Lahir(1906-01-18)18 Januari 1906New York CityMeninggal12 April 1997(1997-04-12) (umur 91)Cambridge, MassachusettsKebangsaanAmericanDikenal atasPigments in the retinaPenghargaanPenghargaan Nobel (1967)Penghargaan Lasker (1953)Karier ilmiahBidangNeurobiologiInstitusiHarvard University George Wald (18 Januari 1906 – 12 April 1997) ialah seorang ilmuwan Amerika Serikat. Saat meneliti biokimi...

 

 

Peruvian footballer (born 1971) Jorge Soto Personal informationFull name Jorge Antonio Soto GómezDate of birth (1971-10-27) October 27, 1971 (age 52)Place of birth Lima, PeruHeight 1.78 m (5 ft 10 in)Position(s) MidfielderSenior career*Years Team Apps (Gls)1990–1992 Deportivo Municipal 42 (6)1993–1999 Sporting Cristal 229 (61)1999 Lanús 15 (3)2000 Flamengo 0 (0)2000–2003 Sporting Cristal 123 (54)2003 San Luis 12 (0)2004–2007 Sporting Cristal 132 (37)2008 Alianza ...

 

 

Computer science debate The Protocol Wars were a long-running debate in computer science that occurred from the 1970s to the 1990s, when engineers, organizations and nations became polarized over the issue of which communication protocol would result in the best and most robust networks. This culminated in the Internet–OSI Standards War in the 1980s and early 1990s, which was ultimately won by the Internet protocol suite (TCP/IP) by the mid-1990s when it became the dominant protocol through...

Disambiguazione – Se stai cercando altri significati, vedi Serie A 2004-2005 (disambigua). Serie A 2004-2005Serie A TIM 2004-2005 Competizione Serie A Sport Calcio Edizione 103ª (73ª di Serie A) Organizzatore Lega Calcio Date dall'11 settembre 2004al 18 giugno 2005 Luogo  Italia Partecipanti 20 Formula girone unico Risultati Vincitore Juventus(titolo revocato) Retrocessioni BolognaBresciaAtalanta Statistiche Miglior marcatore Cristiano Lucarelli (24) Incontri disputati...

 

 

Questa voce sull'argomento calciatori tedeschi è solo un abbozzo. Contribuisci a migliorarla secondo le convenzioni di Wikipedia. Segui i suggerimenti del progetto di riferimento. Hans Schäfer Schäfer (sinistra) Nazionalità  Germania Altezza 174 cm Calcio Ruolo Attaccante Termine carriera 1965 Carriera Squadre di club1 1948-1965 Colonia421 (261) Nazionale 1952-1962 Germania Ovest39 (15) Palmarès  Mondiali di calcio Oro Svizzera 1954 1 I due numeri indicano le presenz...

 

 

Practical implementation of improvements For other uses, see Innovation (disambiguation) and Innovators (disambiguation). Thomas Edison with phonograph in the late 1870s. Edison was one of the most prolific inventors in history, holding 1,093 U.S. patents in his name. Innovation is the practical implementation of ideas that result in the introduction of new goods or services or improvement in offering goods or services.[1] ISO TC 279 in the standard ISO 56000:2020 defines innovation a...

British-bred Thoroughbred racehorse TaghroodaRacing colours of Hamdan Al MaktoumSireSea The StarsGrandsireCape CrossDamEzimaDamsireSadler's WellsSexMareFoaled27 January 2011[1]CountryUnited KingdomColourBayBreederShadwell StudOwnerHamdan Al MaktoumTrainerJohn GosdenRecord6: 4-1-1Earnings£1,476,101Major winsPretty Polly Stakes (2014)Oaks Stakes (2014)King George VI and Queen Elizabeth Stakes (2014)AwardsCartier Champion Three-year-old Filly (2014)World's top-rated three-year-old filly...

 

 

Brazilian university This article does not cite any sources. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: Minas Gerais State University – news · newspapers · books · scholar · JSTOR (November 2009) (Learn how and when to remove this message) State University of Minas GeraisUniversidade do Estado de Minas GeraisOther nameUEMGMottoUnidade na diversidadeMotto in...

 

 

German zoologist Paul Matschie Paul Matschie (11 August 1861, Brandenburg an der Havel – 7 March 1926, Friedenau) was a German zoologist. He studied mathematics and natural sciences at the Universities of Halle and Berlin, afterwards working as an unpaid volunteer at the Berlin Zoological Museum under Jean Cabanis (1816–1906). In 1892, he was in charge of the department of mammals at the museum, later becoming a curator (1895), and in 1902, attaining the title of professor. In 1924, h...

Overview of women in the Vatican Part of a series onWomen in society Society Women's history (legal rights) Woman Animal advocacy Business Female entrepreneurs Gender representation on corporate boards of directors Diversity (politics) Diversity, equity, and inclusion Economic development Explorers and travelers Education Feminism Womyn Government Conservatives in the US Heads of state or government Legislators Queen regnant List Health Journalism Law Law enforcement Military Mother Nobe...

 

 

Japanese money This article is about the history of currency in Japan. For the modern currency, see Japanese yen. Kaei period Edasen (Branch money), the first result out of the foundry. The coins are then cut out and filed. Japanese currency has a history covering the period from the 8th century AD to the present. After the traditional usage of rice as a currency medium, Japan adopted currency systems and designs from China before developing a separate system of its own. History Commodity mon...

 

 

Sceaux 行政国 フランス地域圏 (Région) イル=ド=フランス地域圏県 (département) オー=ド=セーヌ県郡 (arrondissement) アントニー郡小郡 (canton) 小郡庁所在地INSEEコード 92071郵便番号 92330市長(任期) フィリップ・ローラン(2008年-2014年)自治体間連合 (fr) メトロポール・デュ・グラン・パリ人口動態人口 19,679人(2007年)人口密度 5466人/km2住民の呼称 Scéens地理座標 北緯48度4...

Japonismo es un término que se refiere a la influencia de las artes niponas en las occidentales. La palabra se usó por vez primera por Jules Claretie en su libro L'Art Francais en 1872 publicado ese año.[1]​ Las obras creadas a partir de la transferencia directa de los principios del arte japonés sobre el occidental, especialmente las realizadas por artistas franceses, reciben la denominación de japonesque («japonesca») o japonaiserie («japonería»). El arte y la artesanía jap...

 

 

Football stadium in Luxembourg Stade Achille HammerelStade Achille Hammerel, Aerial view, Luxembourg CityFull nameStade Achille HammerelLocationLuxembourg City, LuxembourgCoordinates49°36′17″N 06°08′25″E / 49.60472°N 6.14028°E / 49.60472; 6.14028Capacity5,814SurfacegrassTenantsRacing FC Union Luxembourg Stade Achille Hammerel in 2014 Stade Achille Hammerel is a football stadium in Verlorenkost, a quarter of Luxembourg City, in southern Luxembourg. It is cur...

 

 

Le service militaire en Turquie (en turc : Askerlik hizmeti, le plus souvent abrégé askerlik) constitue la conscription obligatoire des jeunes ressortissants turcs. Présentation générale Des soldats turcs dans le cadre de la Force pour le Kosovo en 2010. Le service militaire en Turquie est obligatoire pour les turcs âgés au moins de 21 ans[1] et ne concerne que les hommes[2]. La durée du service est de 12 mois en général et de 6 mois pour les diplômés universitaires[3]. Il es...

Statue in Columbus, Ohio, U.S. Christopher ColumbusThe statue in 2006Interactive map highlighting the statue's former locationArtistEdoardo AlfieriYear1955 (1955)MediumBronze sculptureSubjectChristopher ColumbusLocationColumbus, Ohio, United StatesCoordinates39°57′44″N 83°00′12″W / 39.962298°N 83.003289°W / 39.962298; -83.003289 Christopher Columbus, or simply Columbus, is a 1955 sculpture by Edoardo Alfieri, originally installed outside Columbus, Ohio...

 

 

Indoor arena in Indianapolis, Indiana, U.S. Gainbridge FieldhouseGainbridge Fieldhouse in 2012 (then named Bankers Life Fieldhouse)Gainbridge FieldhouseLocation in IndianapolisShow map of IndianapolisGainbridge FieldhouseLocation in IndianaShow map of IndianaGainbridge FieldhouseLocation in the United StatesShow map of the United StatesFormer namesConseco Fieldhouse (1999–2011)Bankers Life Fieldhouse (2011–2021)Address125 South Pennsylvania StreetLocationIndianapolis, IndianaCoordinates39...

 

 

University administrator and academic George BoyneFAcSS FRSEProfessor Boyne in 2022Principal and Vice-Chancellor of the University of AberdeenIncumbentAssumed office 1 August 2018ChancellorHM The QueenPreceded bySir Ian Diamond Personal detailsBornAberdeen, ScotlandEducationAberdeen Grammar SchoolAlma materUniversity of Aberdeen (MA, MLitt)University of Bath (PhD)ProfessionAcademic, university administratorSalary£260,000 (2021–22)[1]Academic backgroundThesisThe politics of ...

For the former Borromeo College in the United States see Saint Mary Seminary and Graduate School of Theology#Borromeo College 45°10′49″N 9°9′40.7″E / 45.18028°N 9.161306°E / 45.18028; 9.161306 Almo Collegio BorromeoLatin: Almum Collegium BorromaeumMottoHumilitasMotto in EnglishHumilityTypeInstitution for High Cultural QualificationEstablished1561RectorDon Alberto LolliStudents140 (2011)LocationPavia, ItalyAffiliationsCCULRWebsitewww.collegioborromeo.it...

 

 

Fictional prose narrative form This article is about the literary form. For other uses, see Novella (disambiguation). Literature Oral literature Folklore fable fairy tale folk play folksong heroic epic legend myth proverb Oration Performance audiobook spoken word Saying Major written forms Drama closet drama Poetry lyric narrative Prose Nonsense verse Ergodic Electronic Long prose fiction Anthology Serial Novel/romance Short prose fiction Novella Novelette Short story Drabble Sketch Flash fic...