词袋模型

自然語言處理信息檢索裏,词袋模型(英語:Bag-of-words model)是一個簡化的表達模型。在此模型下,一段文本(比如一个句子或是一个文档)可以用一個装着这些词的袋子来表示,這種表示方式不考慮文法以及詞的順序。最近词袋模型也被應用在電腦視覺領域。[1]

词袋模型被廣泛應用在文件分類,詞語出現的頻率可以用來當作訓練分類器的特徵。

關於「词袋」這個用字的由來可追溯到澤里格·哈里斯英语Zellig Harris於1954年在《Distributional Structure》的文章[2]

範例

下列文件可用词袋表示:

以下是兩個簡單的文件:

(1) John likes to watch movies. Mary likes movies too.
(2) John also likes to watch football games.

基於以上兩個文件,可以建構出下列清單:

[
    "John",
    "likes",
    "to",
    "watch",
    "movies",
    "also",
    "football",
    "games",
    "Mary",
    "too"
]

此處有10個不同的詞,使用清單的索引表示長度為10的向量:

(1) [1, 2, 1, 1, 2, 0, 0, 0, 1, 1] (2) [1, 1, 1, 1, 0, 1, 1, 1, 0, 0] 

每個向量的索引內容對應到清單中詞出現的次數。

舉例來說,第一個向量(文件一)前兩個內容索引是1和2,第一個索引內容是"John"對應到清單第一個詞並且該值設定為1,因為"John"出現一次。

此向量表示法不會保存原始句子中詞的順序。該表示法有許多成功的應用,像是郵件過濾。

Term weighting

在上述的範例,文件向量包含term頻率。

在信息檢索和文字分類常用不同方法量term權重。常見方法為tf-idf

範例:垃圾郵件過濾

分類一個郵件訊息,一個貝氏垃圾郵件分類假設訊息是一堆字並且隨機倒在兩堆袋子其中一個袋子裡,之後使用貝氏機率去決定哪個「袋子」(「垃圾郵件袋子」還是「正常郵件袋子」)是較有可能的。

参考文献

  1. ^ Sivic, Josef. Efficient visual search of videos cast as text retrieval (PDF). IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 31, NO. 4. IEEE: 591–605. April 2009 [2016-03-06]. (原始内容存档 (PDF)于2016-02-22). 
  2. ^ Harris, Zellig. Distributional Structure. Word. 1954, 10 (2/3): 146–62. And this stock of combinations of elements becomes a factor in the way later choices are made ... for language is not merely a bag of words but a tool with particular properties which have been fashioned in the course of its use 

參見

Read other articles:

Untuk karakter BoBoiBoy, lihat BoBoiBoy (karakter). BoBoiBoyLogo BoBoiBoyGenreAksi, komedi, petualanganPembuatNizam RazakPengembangAnimonsta StudiosDitulis olehNizam RazakMuhammad Anas Abdul AzizSutradaraNizam Razak (musim 1)Yap Ee Jean, Dzubir Mohamed Zakaria (musim 2)Pengisi suaraNur Fathiah DiazYap Ee JeanSarah Alisya Zainal RashidDzubir Mohamed ZakariaMuhammad Anas Abdul AzizMuhammad Fathi DiazNizam RazakYvonne Chong Shin VunWong Wai KayPenggubah lagu temaYuri Wong, Jonathan Lee (musik ta...

 

American politician (born 1948) This biography of a living person needs additional citations for verification. Please help by adding reliable sources. Contentious material about living persons that is unsourced or poorly sourced must be removed immediately from the article and its talk page, especially if potentially libelous.Find sources: John Cullerton – news · newspapers · books · scholar · JSTOR (January 2010) (Learn how and when to remove this tem...

 

Возрастно-половая пирамида населения Чехии на 2020 год По данным на 27 марта 2021 года население Чехии составляло 10 524 167 человек[1]. Основным источником роста страны в постсоветский период является иммиграция, преимущественно из таких соседних стран как Украина, Словакия ...

BoliviaJulukanLa Verde (The Green one)AsosiasiFederasi Sepak Bola BoliviaKonfederasiCONMEBOL (Amerika Selatan)Pelatih Antônio Carlos ZagoKaptenLuis HaquinPenampilan terbanyakMarcelo Moreno (108)Pencetak gol terbanyakMarcelo Moreno (31)Stadion kandangStadion Hernando SilesKode FIFABOLPeringkat FIFATerkini 85 1 (4 April 2024)[1]Tertinggi18 (Juli 1997)Terendah115 (Oktober 2011)Peringkat EloTerkini 58 (19 Januari 2024)[2] Warna pertama Warna kedua Pertandingan internasional perta...

 

Синелобый амазон Научная классификация Домен:ЭукариотыЦарство:ЖивотныеПодцарство:ЭуметазоиБез ранга:Двусторонне-симметричныеБез ранга:ВторичноротыеТип:ХордовыеПодтип:ПозвоночныеИнфратип:ЧелюстноротыеНадкласс:ЧетвероногиеКлада:АмниотыКлада:ЗавропсидыКласс:Пт�...

 

Шалфей обыкновенный Научная классификация Домен:ЭукариотыЦарство:РастенияКлада:Цветковые растенияКлада:ЭвдикотыКлада:СуперастеридыКлада:АстеридыКлада:ЛамиидыПорядок:ЯсноткоцветныеСемейство:ЯснотковыеРод:ШалфейВид:Шалфей обыкновенный Международное научное наз...

District of Bucharest, Romania Bucurestii Noi on the map of Bucharest A house in Strada Durău An apartment block in Bucureștii Noi, Sector 1. Bucureștii Noi (Romanian: [bukuˌreʃtij ˈnoj], New Bucharest) is a district situated in the north-west of Bucharest, Romania, in Sector 1. History At the end of the 19th century the area was known as Măicănești or Grefoaicele and was owned by Nicolae Bazilescu. The domain stretched on 295 hectares from which 155 hectares were put out for...

 

Not to be confused with Orange City Fire Department. This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these template messages) This article relies excessively on references to primary sources. Please improve this article by adding secondary or tertiary sources. Find sources: Orange County Fire Authority – news · newspapers · books · scholar · JSTOR (August 2019) (Learn how ...

 

拉吉夫·甘地राजीव गांधीRajiv Gandhi1987年10月21日,拉吉夫·甘地在阿姆斯特丹斯希普霍尔机场 第6任印度总理任期1984年10月31日—1989年12月2日总统吉亞尼·宰爾·辛格拉马斯瓦米·文卡塔拉曼前任英迪拉·甘地继任維什瓦納特·普拉塔普·辛格印度對外事務部部長任期1987年7月25日—1988年6月25日前任Narayan Dutt Tiwari(英语:Narayan Dutt Tiwari)继任納拉辛哈·拉奥任期1984年10�...

Part of a series onAfrican Americans History Periods Timeline Atlantic slave trade Abolitionism in the United States Slavery in the colonial history of the US Revolutionary War Antebellum period Slavery and military history during the Civil War Reconstruction era Politicians Juneteenth Civil rights movement (1865–1896) Jim Crow era (1896–1954) Civil rights movement (1954–1968) Black power movement Post–civil rights era Aspects Agriculture history Black Belt in the American South Busi...

 

Vietic language spoken in Southeast Asia NguonNguồn, Năm NguyênNative toVietnam, LaosRegionQuảng Bình ProvinceEthnicity2,000 (2007)[1] to 40,000[2]Native speakers2,000 (2007–2010)[1]Language familyAustroasiatic VieticViet–MuongNguonWriting systemLatin (Chữ Quốc ngữ)Language codesISO 639-3nuoGlottolognguo1239ELPNguônLocation of Quảng Bình Province Nguồn (also Năm Nguyên) is a Vietic language spoken by the Nguồn people in the Trườn...

 

James Earl RayFoto polisi Ray diambil pada tanggal 8 Juli 1955Lahir(1928-03-10)10 Maret 1928Alton, Illinois, ASMeninggal23 April 1998(1998-04-23) (umur 70)Nashville, Tennessee, ASSebab meninggalPenyakit ginjal dan gagal hati yang disebabkan oleh hepatitis CHukuman kriminal99 tahun penjara (satu tahun ditambahkan setelah dia ditangkap kembali dengan total 100 tahun)Orang tuaJames Gerald RayLucille RayAlasanPembunuhan, pelarian diri dari penjara, perampokan bersenjata, perampokanPeri...

For other ships with the same name, see USS Cyane. USS Cyane History United Kingdom NameCyane Ordered30 January 1805 Laid downAugust 1805 Launched14 October 1806 CommissionedMarch 1807 FateCaptured by USS Constitution, 20 February 1815 United States NameCyane AcquiredCaptured 20 February 1815 Commissioned1815 Decommissioned1827 FateBroken up, 1836 General characteristics Class and typeBanterer-class sixth-rate[1][2] Tonnage539 Length118 ft 2 in (36.02 m) Be...

 

هذه المقالة يتيمة إذ تصل إليها مقالات أخرى قليلة جدًا. فضلًا، ساعد بإضافة وصلة إليها في مقالات متعلقة بها. (يوليو 2018) فسيفساء من نوعين مختلفين من مسبار كوكب المريخ العالمي كوكب المريخ (MOC) التعرض للأرض والقمر والمشتري من 2003.لا يزال النص الموجود في هذه الصفحة في مرحلة الترجمة �...

 

Mulavar of Pandurangaswamy TempleCentral deity in a Hindu temple Mulavar (Tamil: மூலவர், romanized: mūlavar) or Mula-murti is a Sanskrit-Tamil term referring to the main deity, or a murti (cult image) in a Hindu temple.[1][2] Location The central deity, mulavar, is located near the centre of temples, than the images that surround them, and are precisely located at the points corresponding to the energies they represent on the temple plan's power diagram.[...

Negara-negara anggota Konvensi Regulasi Perburuan Paus Internasional (biru).[1] Konvensi Regulasi Perburuan Paus Internasional adalah sebuah perjanjian lingkungan hidup internasional yang ditandatangani pada 1946.[2] Konvensi tersebut mengatur kegiatan komersial dan saintifik dari perburuan paus terhadap delapan puluh sembilan negara anggota. Referensi Wikisumber memiliki naskah asli yang berkaitan dengan artikel ini: International Convention for the Regulation of Whaling ^ Me...

 

Cet article est une ébauche concernant un phare, l’océan Atlantique et le Finistère. Vous pouvez partager vos connaissances en l’améliorant (comment ?) selon les recommandations des projets correspondants. Phare de Sainte-MarineLe phare de Sainte-MarineLocalisationCoordonnées 47° 51′ 51,48″ N, 4° 06′ 46,908″ OLocalisation Finistère FranceHistoireConstruction 1884Électrification 1943Gardienné nonVisiteurs nonArchitectureHauteur 17 ...

 

日本の行政機関防衛会議ぼうえいかいぎDefense Council 防衛会議が設置される防衛省庁舎A棟(左奥)役職議長 木原稔組織上部機関 防衛省概要所在地 〒162-8801東京都新宿区市谷本村町5番1号設置 2009年(平成21年)8月1日ウェブサイト 防衛省・自衛隊テンプレートを表示 防衛会議(ぼうえいかいぎ、英語:Defense Council[1]、略称:DC)は、日本の官公庁の一つであり、防�...

この記事は検証可能な参考文献や出典が全く示されていないか、不十分です。 出典を追加して記事の信頼性向上にご協力ください。(このテンプレートの使い方)出典検索?: 堀切菖蒲園 – ニュース · 書籍 · スカラー · CiNii · J-STAGE · NDL · dlib.jp · ジャパンサーチ · TWL (2023年3月) 堀切菖蒲園 堀切菖蒲園(2014年6月) 分類 都市�...

 

この存命人物の記事には検証可能な出典が不足しています。 信頼できる情報源の提供に協力をお願いします。存命人物に関する出典の無い、もしくは不完全な情報に基づいた論争の材料、特に潜在的に中傷・誹謗・名誉毀損あるいは有害となるものはすぐに除去する必要があります。出典検索?: 薬師寺保栄 – ニュース · 書籍 · スカラー · CiNii&#...