Streaming algorithm

In computer science, streaming algorithms are algorithms for processing data streams in which the input is presented as a sequence of items and can be examined in only a few passes, typically just one. These algorithms are designed to operate with limited memory, generally logarithmic in the size of the stream and/or in the maximum value in the stream, and may also have limited processing time per item.

As a result of these constraints, streaming algorithms often produce approximate answers based on a summary or "sketch" of the data stream.

History

Though streaming algorithms had already been studied by Munro and Paterson[1] as early as 1978, as well as Philippe Flajolet and G. Nigel Martin in 1982/83,[2] the field of streaming algorithms was first formalized and popularized in a 1996 paper by Noga Alon, Yossi Matias, and Mario Szegedy.[3] For this paper, the authors later won the Gödel Prize in 2005 "for their foundational contribution to streaming algorithms." There has since been a large body of work centered around data streaming algorithms that spans a diverse spectrum of computer science fields such as theory, databases, networking, and natural language processing.

Semi-streaming algorithms were introduced in 2005 as a relaxation of streaming algorithms for graphs,[4] in which the space allowed is linear in the number of vertices n, but only logarithmic in the number of edges m. This relaxation is still meaningful for dense graphs, and can solve interesting problems (such as connectivity) that are insoluble in space.

Models

Data stream model

In the data stream model, some or all of the input is represented as a finite sequence of integers (from some finite domain) which is generally not available for random access, but instead arrives one at a time in a "stream".[5] If the stream has length n and the domain has size m, algorithms are generally constrained to use space that is logarithmic in m and n. They can generally make only some small constant number of passes over the stream, sometimes just one.[6]

Turnstile and cash register models

Much of the streaming literature is concerned with computing statistics on frequency distributions that are too large to be stored. For this class of problems, there is a vector (initialized to the zero vector ) that has updates presented to it in a stream. The goal of these algorithms is to compute functions of using considerably less space than it would take to represent precisely. There are two common models for updating such streams, called the "cash register" and "turnstile" models.[7]

In the cash register model, each update is of the form , so that is incremented by some positive integer . A notable special case is when (only unit insertions are permitted).

In the turnstile model, each update is of the form , so that is incremented by some (possibly negative) integer . In the "strict turnstile" model, no at any time may be less than zero.

Sliding window model

Several papers also consider the "sliding window" model.[citation needed] In this model, the function of interest is computing over a fixed-size window in the stream. As the stream progresses, items from the end of the window are removed from consideration while new items from the stream take their place.

Besides the above frequency-based problems, some other types of problems have also been studied. Many graph problems are solved in the setting where the adjacency matrix or the adjacency list of the graph is streamed in some unknown order. There are also some problems that are very dependent on the order of the stream (i.e., asymmetric functions), such as counting the number of inversions in a stream and finding the longest increasing subsequence.[citation needed]

Evaluation

The performance of an algorithm that operates on data streams is measured by three basic factors:

  • The number of passes the algorithm must make over the stream.
  • The available memory.
  • The running time of the algorithm.

These algorithms have many similarities with online algorithms since they both require decisions to be made before all data are available, but they are not identical. Data stream algorithms only have limited memory available but they may be able to defer action until a group of points arrive, while online algorithms are required to take action as soon as each point arrives.

If the algorithm is an approximation algorithm then the accuracy of the answer is another key factor. The accuracy is often stated as an approximation meaning that the algorithm achieves an error of less than with probability .

Applications

Streaming algorithms have several applications in networking such as monitoring network links for elephant flows, counting the number of distinct flows, estimating the distribution of flow sizes, and so on.[8] They also have applications in databases, such as estimating the size of a join [citation needed].

Some streaming problems

Frequency moments

The kth frequency moment of a set of frequencies is defined as .

The first moment is simply the sum of the frequencies (i.e., the total count). The second moment is useful for computing statistical properties of the data, such as the Gini coefficient of variation. is defined as the frequency of the most frequent items.

The seminal paper of Alon, Matias, and Szegedy dealt with the problem of estimating the frequency moments.[citation needed]

Calculating frequency moments

A direct approach to find the frequency moments requires to maintain a register mi for all distinct elements ai ∈ (1,2,3,4,...,N) which requires at least memory of order .[3] But we have space limitations and require an algorithm that computes in much lower memory. This can be achieved by using approximations instead of exact values. An algorithm that computes an (ε,δ)approximation of Fk, where F'k is the (ε,δ)- approximated value of Fk.[9] Where ε is the approximation parameter and δ is the confidence parameter.[10]

Calculating F0 (distinct elements in a data stream)
FM-Sketch algorithm

Flajolet et al. in [2] introduced probabilistic method of counting which was inspired from a paper by Robert Morris.[11] Morris in his paper says that if the requirement of accuracy is dropped, a counter n can be replaced by a counter log n which can be stored in log log n bits.[12] Flajolet et al. in [2] improved this method by using a hash function h which is assumed to uniformly distribute the element in the hash space (a binary string of length L).

Let bit(y,k) represent the kth bit in binary representation of y

Let represents the position of least significant 1-bit in the binary representation of yi with a suitable convention for .

Let A be the sequence of data stream of length M whose cardinality need to be determined. Let BITMAP [0...L − 1] be the

hash space where the ρ(hashedvalues) are recorded. The below algorithm then determines approximate cardinality of A.

Procedure FM-Sketch:

    for i in 0 to L − 1 do
        BITMAP[i] := 0 
    end for
    for x in A: do
        Index := ρ(hash(x))
        if BITMAP[index] = 0 then
            BITMAP[index] := 1
        end if
    end for
    B := Position of left most 0 bit of BITMAP[] 
    return 2 ^ B

If there are N distinct elements in a data stream.

  • For then BITMAP[i] is certainly 0
  • For then BITMAP[i] is certainly 1
  • For then BITMAP[i] is a fringes of 0's and 1's
K-minimum value algorithm

The previous algorithm describes the first attempt to approximate F0 in the data stream by Flajolet and Martin. Their algorithm picks a random hash function which they assume to uniformly distribute the hash values in hash space.

Bar-Yossef et al. in [10] introduced k-minimum value algorithm for determining number of distinct elements in data stream. They used a similar hash function h which can be normalized to [0,1] as . But they fixed a limit t to number of values in hash space. The value of t is assumed of the order (i.e. less approximation-value ε requires more t). KMV algorithm keeps only t-smallest hash values in the hash space. After all the m values of stream have arrived, is used to calculate. That is, in a close-to uniform hash space, they expect at-least t elements to be less than .

Procedure 2 K-Minimum Value

Initialize first t values of KMV 
for a in a1 to an do
    if h(a) < Max(KMV) then
        Remove Max(KMV) from KMV set
        Insert h(a) to KMV 
    end if
end for 
return t/Max(KMV)
Complexity analysis of KMV

KMV algorithm can be implemented in memory bits space. Each hash value requires space of order memory bits. There are hash values of the order . The access time can be reduced if we store the t hash values in a binary tree. Thus the time complexity will be reduced to .

Calculating Fk

Alon et al. estimates Fk by defining random variables that can be computed within given space and time.[3] The expected value of random variables gives the approximate value of Fk.

Assume length of sequence m is known in advance. Then construct a random variable X as follows:

  • Select ap be a random member of sequence A with index at p,
  • Let , represents the number of occurrences of l within the members of the sequence A following ap.
  • Random variable .

Assume S1 be of the order and S2 be of the order . Algorithm takes S2 random variable and outputs the median . Where Yi is the average of Xij where 1 ≤ jS1.

Now calculate expectation of random variable E(X).

Complexity of Fk

From the algorithm to calculate Fk discussed above, we can see that each random variable X stores value of ap and r. So, to compute X we need to maintain only log(n) bits for storing ap and log(n) bits for storing r. Total number of random variable X will be the .

Hence the total space complexity the algorithm takes is of the order of

Simpler approach to calculate F2

The previous algorithm calculates in order of memory bits. Alon et al. in [3] simplified this algorithm using four-wise independent random variable with values mapped to .

This further reduces the complexity to calculate to

Frequent elements

In the data stream model, the frequent elements problem is to output a set of elements that constitute more than some fixed fraction of the stream. A special case is the majority problem, which is to determine whether or not any value constitutes a majority of the stream.

More formally, fix some positive constant c > 1, let the length of the stream be m, and let fi denote the frequency of value i in the stream. The frequent elements problem is to output the set { i | fi > m/c }.[13]

Some notable algorithms are:

Event detection

Detecting events in data streams is often done using a heavy hitters algorithm as listed above: the most frequent items and their frequency are determined using one of these algorithms, then the largest increase over the previous time point is reported as trend. This approach can be refined by using exponentially weighted moving averages and variance for normalization.[14]

Counting distinct elements

Counting the number of distinct elements in a stream (sometimes called the F0 moment) is another problem that has been well studied. The first algorithm for it was proposed by Flajolet and Martin. In 2010, Daniel Kane, Jelani Nelson and David Woodruff found an asymptotically optimal algorithm for this problem.[15] It uses O(ε2 + log d) space, with O(1) worst-case update and reporting times, as well as universal hash functions and a r-wise independent hash family where r = Ω(log(1/ε) / log log(1/ε)).

Entropy

The (empirical) entropy of a set of frequencies is defined as , where .

Online learning

Learn a model (e.g. a classifier) by a single pass over a training set.


Lower bounds

Lower bounds have been computed for many of the data streaming problems that have been studied. By far, the most common technique for computing these lower bounds has been using communication complexity.[citation needed]

See also

Notes

  1. ^ Munro, J. Ian; Paterson, Mike (1978). "Selection and Sorting with Limited Storage". 19th Annual Symposium on Foundations of Computer Science, Ann Arbor, Michigan, USA, 16–18 October 1978. IEEE Computer Society. pp. 253–258. doi:10.1109/SFCS.1978.32.
  2. ^ a b c Flajolet & Martin (1985)
  3. ^ a b c d Alon, Matias & Szegedy (1996)
  4. ^ Feigenbaum, Joan; Sampath, Kannan (2005). "On graph problems in a semi-streaming model". Theoretical Computer Science. 348 (2): 207–216. doi:10.1016/j.tcs.2005.09.013.
  5. ^ Babcock, Brian; Babu, Shivnath; Datar, Mayur; Motwani, Rajeev; Widom, Jennifer (2002). "Models and issues in data stream systems". Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. PODS '02. New York, NY, USA: ACM. pp. 1–16. CiteSeerX 10.1.1.138.190. doi:10.1145/543613.543615. ISBN 978-1581135077. S2CID 2071130.
  6. ^ Bar-Yossef, Ziv; Jayram, T. S.; Kumar, Ravi; Sivakumar, D.; Trevisan, Luca (2002-09-13). "Counting Distinct Elements in a Data Stream". Randomization and Approximation Techniques in Computer Science. Lecture Notes in Computer Science. Vol. 2483. Springer, Berlin, Heidelberg. pp. 1–10. CiteSeerX 10.1.1.12.6276. doi:10.1007/3-540-45726-7_1. ISBN 978-3540457268. S2CID 4684185.
  7. ^ Gilbert et al. (2001)
  8. ^ Xu (2007)
  9. ^ Indyk, Piotr; Woodruff, David (2005-01-01). "Optimal approximations of the frequency moments of data streams". Proceedings of the thirty-seventh annual ACM symposium on Theory of computing. STOC '05. New York, NY, USA: ACM. pp. 202–208. doi:10.1145/1060590.1060621. ISBN 978-1-58113-960-0. S2CID 7911758.
  10. ^ a b Bar-Yossef, Ziv; Jayram, T. S.; Kumar, Ravi; Sivakumar, D.; Trevisan, Luca (2002-09-13). Rolim, José D. P.; Vadhan, Salil (eds.). Counting Distinct Elements in a Data Stream. Lecture Notes in Computer Science. Springer Berlin Heidelberg. pp. 1–10. CiteSeerX 10.1.1.12.6276. doi:10.1007/3-540-45726-7_1. ISBN 978-3-540-44147-2. S2CID 4684185.
  11. ^ Morris (1978)
  12. ^ Flajolet, Philippe (1985-03-01). "Approximate counting: A detailed analysis". BIT Numerical Mathematics. 25 (1): 113–134. CiteSeerX 10.1.1.64.5320. doi:10.1007/BF01934993. ISSN 0006-3835. S2CID 2809103.
  13. ^ Cormode, Graham (2014). "Misra-Gries Summaries". In Kao, Ming-Yang (ed.). Encyclopedia of Algorithms. Springer US. pp. 1–5. doi:10.1007/978-3-642-27848-8_572-1. ISBN 9783642278488.
  14. ^ Schubert, E.; Weiler, M.; Kriegel, H. P. (2014). SigniTrend: scalable detection of emerging topics in textual streams by hashed significance thresholds. Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '14. pp. 871–880. doi:10.1145/2623330.2623740. ISBN 9781450329569.
  15. ^ Kane, Nelson & Woodruff (2010)

References

Read other articles:

Miconia macrothyrsa Klasifikasi ilmiah Kerajaan: Plantae Divisi: Tracheophyta Kelas: Magnoliopsida Ordo: Myrtales Famili: Melastomataceae Genus: Miconia Spesies: Miconia macrothyrsa Nama binomial Miconia macrothyrsaBenth. Miconia macrothyrsa adalah spesies tumbuhan yang tergolong ke dalam famili Melastomataceae. Spesies ini juga merupakan bagian dari ordo Myrtales. Spesies Miconia macrothyrsa sendiri merupakan bagian dari genus Miconia.[1] Nama ilmiah dari spesies ini pertama kali di...

 

 

Asghar FarhadiFarhadi di Festival Film Cannes 2013Lahir7 Mei 1972 (umur 51)[1]Homāyūnshahr, Provinsi Isfahan, IranKebangsaanIranAlmamaterUniversitas Tarbiat ModaresUniversitas TehranPekerjaanSutradara film, penulis latar, produser filmTahun aktif1997–sekarangKarya terkenalA SeparationAbout EllySuami/istriParisa BakhtavarAnakSarinaSaaghar Asghar Farhadi (Persia: اصغر فرهادیcode: fa is deprecated , pengucapan Persia: [æsɢæɾ ɛ fæɾhɑdiː]; kelahiran 7...

 

 

1994 box set by AerosmithBox of FireBox set by AerosmithReleasedNovember 22, 1994Recorded1972-1994GenreHard rock, blues rockLength493:57LabelColumbiaProducerAerosmith, Adrian Barber, Tony Bongiovi, Ray Colcord, Don DeVito, Jack Douglas, Bob Ezrin, David Krebs, Steve Leber, Gary Lyons, George Marino, Paul O'NeillAerosmith compilation chronology Big Ones(1994) Box of Fire(1994) Classic Aerosmith: The Universal Masters Collection(2000) Professional ratingsReview scoresSourceRatingAllmusi...

James IIRaja SkotlandiaBerkuasa21 Februari 1437 – 3 Agustus 1460Penobatan25 Maret 1437PendahuluJames IPenerusJames IIIInformasi pribadiKelahiran(1430-10-16)16 Oktober 1430Holyrood AbbeyKematian3 Agustus 1460(1460-08-03) (umur 29)Istana RoxburghPemakamanHolyrood AbbeyWangsaStewartAyahJames IIbuJoan BeaufortPasanganMary dari GueldersAnakselengkapnya...James IIIAlexander, Adipati Albany ke-1AgamaKatolik Roma James II (Bahasa Skotlandia Abad Pertengahan: Iames Stewart; 16 Oktober 1430 ...

 

 

This article includes a list of references, related reading, or external links, but its sources remain unclear because it lacks inline citations. Please help improve this article by introducing more precise citations. (January 2017) (Learn how and when to remove this template message) In mathematics, specifically in geometric topology, surgery theory is a collection of techniques used to produce one finite-dimensional manifold from another in a 'controlled' way, introduced by John Milnor&...

 

 

American fast food restaurant chain This article is about the Texas-based hamburger chain restaurant. For the Virginia-based hamburger chain restaurant, see What-A-Burger. WhataburgerCompany typePrivateIndustryRestaurantGenreFast foodFoundedAugust 8, 1950; 73 years ago (1950-08-08) in Corpus Christi, Texas, U.S.FounderHarmon Dobson, Paul Burton[1]HeadquartersSan Antonio, Texas, U.S.Number of locations1000 (2024)Area servedTexas, Florida, Georgia, Missouri, Kansas, So...

Comic book series This article is about the Dark Horse comic book series. For the DC Comics character, see Mask (DC Comics). This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: The Mask comics – news · newspapers · books · scholar · JSTOR (March 2024) (Learn how and when to remove this message) Comics chara...

 

 

Football league seasonLadbrokes ChampionshipSeason2017–18ChampionsSt MirrenPromotedSt MirrenLivingstonRelegatedDumbartonBrechin CityMatches played180Goals scored472 (2.62 per match)Top goalscorerStephen Dobbie(18 goals)[1]Biggest home winFalkirk 6–1 Dundee United[2](6 January 2018)St Mirren 5–0 Dumbarton[2](27 March 2018)Biggest away winBrechin City 0–5 Dundee United[2](17 April 2018)Highest scoringDunfermline Athletic 2–5 Queen of the South[...

 

 

2016年美國總統選舉 ← 2012 2016年11月8日 2020 → 538個選舉人團席位獲勝需270票民意調查投票率55.7%[1][2] ▲ 0.8 %   获提名人 唐納·川普 希拉莉·克林頓 政党 共和黨 民主党 家鄉州 紐約州 紐約州 竞选搭档 迈克·彭斯 蒂姆·凱恩 选举人票 304[3][4][註 1] 227[5] 胜出州/省 30 + 緬-2 20 + DC 民選得票 62,984,828[6] 65,853,514[6]...

Lithuanian-French linguist (1917–1992) This article includes a list of general references, but it lacks sufficient corresponding inline citations. Please help to improve this article by introducing more precise citations. (November 2014) (Learn how and when to remove this message) Algirdas Julien GreimasBornAlgirdas Julius Greimas(1917-03-09)9 March 1917Tula, Russian EmpireDied27 February 1992(1992-02-27) (aged 74)Paris, FranceCitizenshipLithuania, FranceAlma materVytautas Magnus ...

 

 

Financial penalty This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these template messages) This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: Fine penalty – news · newspapers · books · scholar · JSTOR (November 2011) (Learn ho...

 

 

Mgr.Andreas Peter Cornelius SolM.S.C.Uskup Emeritus AmboinaGerejaGereja Katolik RomaKeuskupanAmboinaPenunjukan15 Januari 1965(49 tahun, 88 hari)Masa jabatan berakhir10 Juni 1994(78 tahun, 238 hari)PendahuluJacobus Grent, M.S.C.PenerusPetrus Canisius Mandagi, M.S.C.ImamatTahbisan imam10 Agustus 1940[1](24 tahun, 296 hari)Tahbisan uskup25 Februari 1964(48 tahun, 129 hari)oleh Jacobus Grent, M.S.C.Informasi pribadiNama lahirAndreas Peter Corne...

Second tier division of NASCAR NASCAR Xfinity SeriesCategoryStock carsCountryUnited StatesInaugural season1982ManufacturersChevrolet · Ford · ToyotaEngine suppliersChevrolet · Ford · ToyotaTire suppliersGoodyearDrivers' championCole CusterMakes' championChevroletTeams' championStewart Haas RacingOfficial websiteNASCAR Xfinity Series Current season The NASCAR Xfinity Series (NXS) is a stock car racing series organized by NASCAR. It is pro...

 

 

日本の政治家青木 精一あおき せいいち  青木精一生年月日 (1883-04-19) 1883年4月19日出生地 群馬県南勢多郡板橋村(現・群馬県桐生市)没年月日 (1945-04-14) 1945年4月14日(61歳没)出身校 正教神学校前職 ジャーナリスト所属政党 立憲政友会昭和会立憲政友会革新同盟テンプレートを表示 青木 精一(あおき せいいち、1883年(明治16年)4月19日[1] - 1945年(昭和20年)4�...

 

 

French top-level subdivision applied to certain overseas entities For the Euro-constituency, see Overseas Territories of France (European Parliament constituency). This article is part of a series on theAdministrativedivisions of France Administrative divisions Regions Departments Arrondissements Cantons Intercommunality Métropole Communauté urbaine Communauté d'agglomération Communauté de communes Communes Associated communes Municipal arrondissements Overseas France Overseas department...

Village in Nottinghamshire, England This article is about the village. For the district, see Borough of Gedling. For the constituency, see Gedling (UK Parliament constituency). This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: Gedling – news · newspapers · books · scholar · JSTOR (April 2014) (Learn how and wh...

 

 

Chilean politician and diplomat Rafael Errázuriz UrmenetaBorn(1861-08-10)August 10, 1861Santiago, ChileDiedDecember 26, 1923(1923-12-26) (aged 62)Rome, Italy Rafael Valentín Errázuriz Urmeneta (August 10, 1861 - December 26, 1923) was a Chilean politician and diplomat.[1][2] Rafael Valentín Errázuriz was born in Santiago, the son of Maximiano Errázuriz Valdivieso and of Amalia Urmeneta Quiroga. He completed his secondary studies at the Instituto Nacional, and later ...

 

 

VII CorpsVII Corps badgeActiveJuly 22, 1862 – August 1, 1863January 6, 1864 – August 1, 1865TypeArmy CorpsSizeCorpsPart ofDepartment of ArkansasEngagementsAmerican Civil WarInsignia1st Division2nd Division3rd DivisionMilitary unit United States Army Corps, 1861-1865 Previous Next VI Corps (Union Army) VIII Corps (Union Army) Two corps of the Union Army were called VII Corps during the American Civil War.Union Army 1st Division Badge, VII CorpsFlag of the VII Corps VII Corps (Departme...

River in Manitoba, Canada Whitemouth RiverLocationCountryCanadaProvinceManitobaPhysical characteristicsSourceWhitewater Lake • coordinates49°16′21″N 96°37′10″W / 49.27250°N 96.61944°W / 49.27250; -96.61944 • elevation348 m (1,142 ft) MouthWinnipeg River • locationSeven Sisters Falls, Manitoba • coordinates50°07′18″N 96°02′05″W / 50.12167°N 96.03472°W&#...

 

 

Vous lisez un « bon article » labellisé en 2018. Pour un article plus général, voir Intelligence. Femmes passant un test d'intelligence générale (Allemagne, 1931). L'intelligence humaine est caractérisée par plusieurs aptitudes, surtout cognitives, qui permettent à l'individu humain d'apprendre, de former des concepts, de comprendre, d'appliquer la logique et la raison. Elle comprend la capacité à reconnaître des tendances, comprendre les idées, planifier, résoudre d...