Density-based spatial clustering of applications with noise (DBSCAN) is a data clusteringalgorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu in 1996.[1]
It is a density-based clustering non-parametric algorithm: given a set of points in some space, it groups together points that are closely packed (points with many nearby neighbors), and marks as outliers points that lie alone in low-density regions (those whose nearest neighbors are too far away).
DBSCAN is one of the most commonly used and cited clustering algorithms.[2]
In 2014, the algorithm was awarded the Test of Time Award (an award given to algorithms which have received substantial attention in theory and practice) at the leading data mining conference, ACM SIGKDD.[3] As of July 2020[update], the follow-up paper "DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN"[4] appears in the list of the 8 most downloaded articles of the prestigious ACM Transactions on Database Systems (TODS) journal.[5]
Another follow-up, HDBSCAN*, was initially published by Ricardo J. G. Campello, David Moulavi, and Jörg Sander in 2013,[6] then expanded upon with Arthur Zimek in 2015.[7] It revises some of the original decisions such as the border points, and produces a hierarchical instead of a flat result.
History
In 1972, Robert F. Ling published a closely related algorithm in "The Theory and Construction of k-Clusters"[8] in The Computer Journal with an estimated runtime complexity of O(n³).[8] DBSCAN has a worst-case of O(n²), and the database-oriented range-query formulation of DBSCAN allows for index acceleration. The algorithms slightly differ in their handling of border points.
Preliminary
Consider a set of points in some space to be clustered. Let ε be a parameter specifying the radius of a neighborhood with respect to some point. For the purpose of DBSCAN clustering, the points are classified as core points, (directly-) reachable points and outliers, as follows:
A point p is a core point if at least minPts points are within distance ε of it (including p).
A point q is directly reachable from p if point q is within distance ε from core point p. Points are only said to be directly reachable from core points.
A point q is reachable from p if there is a path p1, ..., pn with p1 = p and pn = q, where each pi+1 is directly reachable from pi. Note that this implies that the initial point and all points on the path must be core points, with the possible exception of q.
All points not reachable from any other point are outliers or noise points.
Now if p is a core point, then it forms a cluster together with all points (core or non-core) that are reachable from it. Each cluster contains at least one core point; non-core points can be part of a cluster, but they form its "edge", since they cannot be used to reach more points.
Reachability is not a symmetric relation: by definition, only core points can reach non-core points. The opposite is not true, so a non-core point may be reachable, but nothing can be reached from it. Therefore, a further notion of connectedness is needed to formally define the extent of the clusters found by DBSCAN. Two points p and q are density-connected if there is a point o such that both p and q are reachable from o. Density-connectedness is symmetric.
A cluster then satisfies two properties:
All points within the cluster are mutually density-connected.
If a point is density-reachable from some point of the cluster, it is part of the cluster as well.
Algorithm
Original query-based algorithm
DBSCAN requires two parameters: ε (eps) and the minimum number of points required to form a dense region[a] (minPts). It starts with an arbitrary starting point that has not been visited. This point's ε-neighborhood is retrieved, and if it contains sufficiently many points, a cluster is started. Otherwise, the point is labeled as noise. Note that this point might later be found in a sufficiently sized ε-environment of a different point and hence be made part of a cluster.
If a point is found to be a dense part of a cluster, its ε-neighborhood is also part of that cluster. Hence, all points that are found within the ε-neighborhood are added, as is their own ε-neighborhood when they are also dense. This process continues until the density-connected cluster is completely found. Then, a new unvisited point is retrieved and processed, leading to the discovery of a further cluster or noise.
DBSCAN can be used with any distance function[1][4] (as well as similarity functions or other predicates).[9] The distance function (dist) can therefore be seen as an additional parameter.
The algorithm can be expressed in pseudocode as follows:[4]
DBSCAN(DB, distFunc, eps, minPts) {
C := 0 /* Cluster counter */for each point P in database DB {
if label(P) ≠ undefined thencontinue/* Previously processed in inner loop */
Neighbors N := RangeQuery(DB, distFunc, P, eps) /* Find neighbors */if |N| < minPts then { /* Density check */
label(P) := Noise /* Label as Noise */continue
}
C := C + 1 /* next cluster label */
label(P) := C /* Label initial point */
SeedSet S := N \ {P} /* Neighbors to expand */for each point Q in S { /* Process every seed point Q */if label(Q) = Noise then label(Q) := C /* Change Noise to border point */if label(Q) ≠ undefined thencontinue/* Previously processed (e.g., border point) */
label(Q) := C /* Label neighbor */
Neighbors N := RangeQuery(DB, distFunc, Q, eps) /* Find neighbors */if |N| ≥ minPts then { /* Density check (if Q is a core point) */
S := S ∪ N /* Add new neighbors to seed set */
}
}
}
}
where RangeQuery can be implemented using a database index for better performance, or using a slow linear scan:
RangeQuery(DB, distFunc, Q, eps) {
Neighbors N := empty list
for each point P in database DB { /* Scan all points in the database */if distFunc(Q, P) ≤ eps then { /* Compute distance and check epsilon */
N := N ∪ {P} /* Add to result */
}
}
return N
}
Abstract algorithm
The DBSCAN algorithm can be abstracted into the following steps:[4]
Find the points in the ε (eps) neighborhood of every point, and identify the core points with more than minPts neighbors.
Find the connected components of core points on the neighbor graph, ignoring all non-core points.
Assign each non-core point to a nearby cluster if the cluster is an ε (eps) neighbor, otherwise assign it to noise.
A naive implementation of this requires storing the neighborhoods in step 1, thus requiring substantial memory. The original DBSCAN algorithm does not require this by performing these steps for one point at a time.
Optimization Criterion
DBSCAN optimizes the following loss function:[10]
For any possible clustering out of the set of all clusterings , it minimizes the number of clusters under the condition that every pair of points in a cluster is density-reachable, which corresponds to the original two properties "maximality" and "connectivity" of a cluster:[1]
where gives the smallest such that two points p and q are density-connected.
Complexity
DBSCAN visits each point of the database, possibly multiple times (e.g., as candidates to different clusters). For practical considerations, however, the time complexity is mostly governed by the number of regionQuery invocations. DBSCAN executes exactly one such query for each point, and if an indexing structure is used that executes a neighborhood query in O(log n), an overall average runtime complexity of O(n log n) is obtained (if parameter ε is chosen in a meaningful way, i.e. such that on average only O(log n) points are returned). Without the use of an accelerating index structure, or on degenerated data (e.g. all points within a distance less than ε), the worst case run time complexity remains O(n²). The - n = (n²-n)/2-sized upper triangle of the distance matrix can be materialized to avoid distance recomputations, but this needs O(n²) memory, whereas a non-matrix based implementation of DBSCAN only needs O(n) memory.
Advantages
DBSCAN does not require one to specify the number of clusters in the data a priori, as opposed to k-means.
DBSCAN can find arbitrarily-shaped clusters. It can even find a cluster completely surrounded by (but not connected to) a different cluster. Due to the MinPts parameter, the so-called single-link effect (different clusters being connected by a thin line of points) is reduced.
DBSCAN has a notion of noise, and is robust to outliers.
DBSCAN requires just two parameters and is mostly insensitive to the ordering of the points in the database. (However, points sitting on the edge of two different clusters might swap cluster membership if the ordering of the points is changed, and the cluster assignment is unique only up to isomorphism.)
DBSCAN is designed for use with databases that can accelerate region queries, e.g. using an R* tree.
The parameters minPts and ε can be set by a domain expert, if the data is well understood.
Disadvantages
DBSCAN is not entirely deterministic: border points that are reachable from more than one cluster can be part of either cluster, depending on the order the data are processed. For most data sets and domains, this situation does not arise often and has little impact on the clustering result:[4] both on core points and noise points, DBSCAN is deterministic. DBSCAN*[6][7] is a variation that treats border points as noise, and this way achieves a fully deterministic result as well as a more consistent statistical interpretation of density-connected components.
The quality of DBSCAN depends on the distance measure used in the function regionQuery(P,ε). The most common distance metric used is Euclidean distance. Especially for high-dimensional data, this metric can be rendered almost useless due to the so-called "Curse of dimensionality", making it difficult to find an appropriate value for ε. This effect, however, is also present in any other algorithm based on Euclidean distance.
DBSCAN cannot cluster data sets well with large differences in densities, since the minPts-ε combination cannot then be chosen appropriately for all clusters.[11]
If the data and scale are not well understood, choosing a meaningful distance threshold ε can be difficult.
See the section below on extensions for algorithmic modifications to handle these issues.
Parameter estimation
Every data mining task has the problem of parameters. Every parameter influences the algorithm in specific ways. For DBSCAN, the parameters ε and minPts are needed. The parameters must be specified by the user. Ideally, the value of ε is given by the problem to solve (e.g. a physical distance), and minPts is then the desired minimum cluster size.[a]
MinPts: As a rule of thumb, a minimum minPts can be derived from the number of dimensions D in the data set, as minPts ≥ D + 1. The low value of minPts = 1 does not make sense, as then every point is a core point by definition. With minPts ≤ 2, the result will be the same as of hierarchical clustering with the single link metric, with the dendrogram cut at height ε. Therefore, minPts must be chosen at least 3. However, larger values are usually better for data sets with noise and will yield more significant clusters. As a rule of thumb, minPts = 2·dim can be used,[9] but it may be necessary to choose larger values for very large data, for noisy data or for data that contains many duplicates.[4]
ε: The value for ε can then be chosen by using a k-distance graph, plotting the distance to the k = minPts-1 nearest neighbor ordered from the largest to the smallest value.[4] Good values of ε are where this plot shows an "elbow":[1][9][4] if ε is chosen much too small, a large part of the data will not be clustered; whereas for a too high value of ε, clusters will merge and the majority of objects will be in the same cluster. In general, small values of ε are preferable,[4] and as a rule of thumb only a small fraction of points should be within this distance of each other. Alternatively, an OPTICS plot can be used to choose ε,[4] but then the OPTICS algorithm itself can be used to cluster the data.
Distance function: The choice of distance function is tightly coupled to the choice of ε, and has a major impact on the results. In general, it will be necessary to first identify a reasonable measure of similarity for the data set, before the parameter ε can be chosen. There is no estimation for this parameter, but the distance functions needs to be chosen appropriately for the data set. For example, on geographic data, the great-circle distance is often a good choice.
OPTICS can be seen as a generalization of DBSCAN that replaces the ε parameter with a maximum value that mostly affects performance. MinPts then essentially becomes the minimum cluster size to find. While the algorithm is much easier to parameterize than DBSCAN, the results are a bit more difficult to use, as it will usually produce a hierarchical clustering instead of the simple data partitioning that DBSCAN produces.
Recently, one of the original authors of DBSCAN has revisited DBSCAN and OPTICS, and published a refined version of hierarchical DBSCAN (HDBSCAN*),[6][7] which no longer has the notion of border points. Instead, only the core points form the cluster.
Relationship to spectral clustering
A spectral implementation of DBSCAN is related to spectral clustering in the trivial case of determining connected graph components — the optimal clusters with no edges cut.[12] However, it can be computationally intensive, up to . Additionally, one has to choose the number of eigenvectors to compute. For performance reasons, the original DBSCAN algorithm remains preferable to its spectral implementation.
Extensions
Generalized DBSCAN (GDBSCAN)[9][13] is a generalization by the same authors to arbitrary "neighborhood" and "dense" predicates. The ε and minPts parameters are removed from the original algorithm and moved to the predicates. For example, on polygon data, the "neighborhood" could be any intersecting polygon, whereas the density predicate uses the polygon areas instead of just the object count.
Various extensions to the DBSCAN algorithm have been proposed, including methods for parallelization, parameter estimation, and support for uncertain data. The basic idea has been extended to hierarchical clustering by the OPTICS algorithm. DBSCAN is also used as part of subspace clustering algorithms like PreDeCon and SUBCLU. HDBSCAN*[6][7] is a hierarchical version of DBSCAN which is also faster than OPTICS, from which a flat partition consisting of the most prominent clusters can be extracted from the hierarchy.[14]
Availability
Different implementations of the same algorithm were found to exhibit enormous performance differences, with the fastest on a test data set finishing in 1.4 seconds, the slowest taking 13803 seconds.[15] The differences can be attributed to implementation quality, language and compiler differences, and the use of indexes for acceleration.
Apache CommonsMath contains a Java implementation of the algorithm running in quadratic time.
ELKI offers an implementation of DBSCAN as well as GDBSCAN and other variants. This implementation can use various index structures for sub-quadratic runtime and supports arbitrary distance functions and arbitrary data types, but it may be outperformed by low-level optimized (and specialized) implementations on small data sets.
MATLAB includes an implementation of DBSCAN in its "Statistics and Machine Learning Toolbox" since release R2019a.
mlpack includes an implementation of DBSCAN accelerated with dual-tree range search techniques.
PostGIS includes ST_ClusterDBSCAN – a 2D implementation of DBSCAN that uses R-tree index. Any geometry type is supported, e.g. Point, LineString, Polygon, etc.
R contains implementations of DBSCAN in the packages dbscan and fpc. Both packages support arbitrary distance functions via distance matrices. The package fpc does not have index support (and thus has quadratic runtime and memory complexity) and is rather slow due to the R interpreter. The package dbscan provides a fast C++ implementation using k-d trees (for Euclidean distance only) and also includes implementations of DBSCAN*, HDBSCAN*, OPTICS, OPTICSXi, and other related methods.
k-means clustering – Vector quantization algorithm minimizing the sum of squared deviations
Notes
^ abWhile minPts intuitively is the minimum cluster size, in some cases DBSCAN can produce smaller clusters.[4] A DBSCAN cluster consists of at least one core point.[4] As other points may be border points to more than one cluster, there is no guarantee that at least minPts points are included in every cluster.
^"Microsoft Academic Search: Papers". Archived from the original on April 21, 2010. Retrieved 2010-04-18. Most cited data mining articles according to Microsoft academic search; DBSCAN is on rank 24.
^ abcdCampello, Ricardo J. G. B.; Moulavi, Davoud; Zimek, Arthur; Sander, Jörg (2015). "Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection". ACM Transactions on Knowledge Discovery from Data. 10 (1): 1–51. doi:10.1145/2733381. ISSN1556-4681. S2CID2887636.
^Sander, Jörg (1998). Generalized Density-Based Clustering for Spatial Data Mining. München: Herbert Utz Verlag. ISBN3-89675-469-6.
^Campello, R. J. G. B.; Moulavi, D.; Zimek, A.; Sander, J. (2013). "A framework for semi-supervised and unsupervised optimal extraction of clusters from hierarchies". Data Mining and Knowledge Discovery. 27 (3): 344. doi:10.1007/s10618-013-0311-4. S2CID8144686.
Nurfirmanwansyah Anggota DPRD Sumatera Barat Fraksi PKSPetahanaMulai menjabat 28 Agustus 2019Mayoritas34.341 suaraMasa jabatan28 Agustus 2004 – 2005Mayoritas8.104 suaraWakil Bupati Solok Selatan ke-1Masa jabatan20 Agustus 2005 – 20 Agustus 2010BupatiSyafrizal J PendahuluTidak ada, Jabatan baruPenggantiAbdul Rahman Informasi pribadiLahir1 Oktober 1964 (umur 59)Muara Labuh, Sumatera BaratKebangsaan IndonesiaPartai politikPKSSuami/istriDaslinarAnak8Alma ma...
Serie B 1947-1948 Competizione Serie B Sport Calcio Edizione 16ª Organizzatore Lega Nazionale Date dal 14 settembre 1947all'11 luglio 1948 Luogo Italia Partecipanti 54 Formula 3 gironi geografici Risultati Promozioni NovaraPadovaPalermo Retrocessioni Crema, Vigevano;Gallaratese, Fanfulla;Viareggio, Pro Vercelli;Varese,Vita Nova;Magenta, Vogherese;Cagliari, Prato;Piacenza, Mantova;Udinese, Bolzano;Treviso, Pistoiese;Pro Gorizia, Carrarese;Suzzara, Centese;Siena, Anconitana;Cosenza,...
Paku Alam IVꦦꦏꦸꦄꦭꦩ꧀꧇꧔꧇Adipati Kadipaten Pakualaman keempat Bertakhta1864-1878Penobatan1 Desember 1864PendahuluPaku Alam IIIPenerusPaku Alam VInformasi pribadiKelahiranRaden Mas Nataningrat25 Oktober 1841Kadipaten Pakualaman, Keresidenan Yogyakarta, Hindia BelandaKematian25 September 1878(1878-09-25) (umur 36)Kadipaten Pakualaman, Keresidenan Yogyakarta, Hindia BelandaPemakamanPasarean Mataram, KotagedeWangsaMataramNama takhtaSampeyan Dalem Kanjeng Gusti Pangeran Adipa...
Community of San Diego in CaliforniaFairbanks Ranch Country Club, San DiegoCommunity of San DiegoFairbanks Ranch Country ClubFairbanks Ranch Country Club, San DiegoLocation within San Diego CountyCoordinates: 32°58′29.7″N 117°12′54.6″W / 32.974917°N 117.215167°W / 32.974917; -117.215167Country United States of AmericaState CaliforniaCounty San DiegoCity San DiegoGovernment • TypeCA District 52: Scott Peters Fairbanks Ranch Country Club...
Reece Burke Burke bersama West Ham United, Mei 2015Informasi pribadiNama lengkap Reece Frederick James Burke[1]Tanggal lahir 2 September 1996 (umur 27)Tempat lahir Newham, InggrisTinggi 189 cm (6 ft 2 in)[2]Posisi bermain BekInformasi klubKlub saat ini Luton TownNomor 16Karier junior2003–2014 West Ham UnitedKarier senior*Tahun Tim Tampil (Gol)2014–2018 West Ham United 5 (0)2015–2016 → Bradford City (pinjaman) 34 (2)2016–2017 → Wigan Athletic (p...
هذه المقالة يتيمة إذ تصل إليها مقالات أخرى قليلة جدًا. فضلًا، ساعد بإضافة وصلة إليها في مقالات متعلقة بها. (أبريل 2019) ميا إكلوند (بالفنلندية: Mia Nicole Eklund) معلومات شخصية الميلاد 30 أكتوبر 1994 (30 سنة)[1] تالين الإقامة تالين مواطنة فنلندا الطول 1.76 متر است�...
Эта статья — об архипелаге. О расположенном на нём государстве см. Коморы. Коморские островаараб. جزر القمر, фр. Les Comores Характеристики Количество островов33 Крупнейший островНгазиджа Общая площадь2238 км² Наивысшая точка2361 м Население900 000 ...
Sanskrit text, linked to Shukla Yajurveda Mandala-brahmana UpanishadNarayana (Vishnu) teaches yoga in this Upanishad.[1][2]Devanagariमण्डलब्राह्मणIASTMaṇḍala-brāhmaṇaTitle meansTeachings of Purusha in Sun[3][4]TypeYoga[5]Linked VedaShukla Yajurveda[5]Chapters5[6]PhilosophyYoga, Vedanta[6] The Mandala-brahmana Upanishad (Sanskrit: मण्डलब्राह्मण उपनिषत्), also...
Gereja Kristen AbdielLogo GKAPenggolonganProtestanPemimpinPdt. William LiemWilayahIndonesiaDidirikan5 Oktober 1976 Surabaya, IndonesiaTerpisah dariGereja Kristus TuhanPenyatuan dariGKT Hin Hwa, GKT Amoy, GKT Zion BaliPecahanGereja Kristus TuhanUmat±100rbSitus web resmiSinode GKA Gereja Kristen Abdiel (GKA) adalah salah satu sinode gereja di Indonesia dengan pusat di Kota Surabaya. Secara sejarah, Gereja Kristen Abdiel adalah pemekaran dari Gereja Kristus Tuhan (GKT) [1] yang beridiri...
English, Scottish, Irish and Great Britain legislationActs of parliaments of states preceding the United Kingdom Of the Kingdom of EnglandRoyal statutes, etc. issued beforethe development of Parliament 1225–1267 1275–1307 1308–1325 Temp. incert. 1327–1411 1413–1460 1461 1463 1464 1467 1468 1472 1474 1477 1482 1483 1485–1503 1509–1535 1536 1539–1540 1541 1542 1543 1545 1546 1547 1548 1549 1551 1553 1554 1555 &...
Hotel and casino in Paradise, Nevada Wynn Las VegasShow map of Las Vegas StripShow map of Nevada Location Paradise, Nevada, U.S. Address 3131 South Las Vegas BoulevardOpening dateApril 28, 2005; 19 years ago (April 28, 2005)No. of rooms2,716Total gaming space111,000 sq ft (10,300 m2)Permanent showsLe Rêve (2005–2020)AwakeningLake of DreamsSignature attractionsWynn Golf ClubNotable restaurantsAlex (2005–2011)AllegroThe Buffet at Wynn Cipriani Delilah LakesideMi...
Louis-Michel le Peletier de Saint-Fargeau ritratto da Garneray. Louis-Michel le Peletier, marchese di Saint-Fargeau (anche con la grafia Lepeletier o Lepelletier; Parigi, 29 maggio 1760 – Parigi, 20 gennaio 1793), è stato un politico e rivoluzionario francese. La firma di Louis-Michel le Peletier de Sait-Fargeau nel 1789. Dopo l'abolizione dei titoli nobiliari da parte dell'Assemblea Costituente nel 1790, egli cominciò a firmarsi con solo nome e cognome, come Michel Lepeletier.[1]...
Франц Саксен-Кобург-Заальфельдскийнем. Franz von Sachsen-Coburg-Saalfeld герцог Саксен-Кобург-Заальфельдский 8 сентября 1800 — 9 декабря 1806 Предшественник Эрнст Фридрих Саксен-Кобург-Заальфельдский Преемник Эрнст I Саксен-Кобург-Заальфельдский Рождение 15 июля 1750(1750-07-15)Кобург, Сакс...
هذه المقالة عن المجموعة العرقية الأتراك وليس عن من يحملون جنسية الجمهورية التركية أتراكTürkler (بالتركية) التعداد الكليالتعداد 70~83 مليون نسمةمناطق الوجود المميزةالبلد القائمة ... تركياألمانياسورياالعراقبلغارياالولايات المتحدةفرنساالمملكة المتحدةهولنداالنمساأسترالي�...
Citroën H VanInformasiProdusenCitroënMasa produksi1947–1981PerakitanPrancis: ParisSpanyol: Vigo (Centro de Vigo)Bodi & rangkaKelasLight commercial vehicle (M)Bentuk kerangka4-pintu panel vanTata letakFF layoutKronologiPendahuluCitroën TUBPenerusCitroën C25Citroën C35 Citroën H Van adalah panel van (truk ringan) yang diproduksi oleh produsen mobil Prancis Citroën antara tahun 1947 dan 1981.[1] Ini dikembangkan sebagai van sederhana yang digerakkan roda depan setelah P...
Fiat Fiorino Общие данные Производитель Fiat Годы производства 1977 — настоящее время Класс LCV Дизайн и конструкция Тип кузова фургон/пикап Компоновка переднемоторная, переднеприводная Колёсная формула 4 × 2 На рынке Сегмент M-сегмент Fiat 500 (фургон)Fiat Doblò Медиафайлы на �...
جزء من سلسلة حولالماركسية مؤلفات نظرية المخطوطات الاقتصادية والفلسفية (1844) أطروحات حول فويرباخ الأيديولوجية الألمانية بيان الحزب الشيوعي برومير الثامن عشر للويس بونابرت غرندريسه مساهمة في نقد الاقتصاد السياسي رأس المال جدليات الطبيعة مفاهيم اشتراكية علمية حتمية اقتصاد...
Country music show originating in Shreveport, Louisiana For the 1944 film, see Louisiana Hayride (film). Radio show The Louisiana HayrideGenrestage show and broadcastHome stationKWKHSyndicatesWLWTV adaptationsKSLA-TVRecording studioShreveport Municipal Memorial Auditorium (Shreveport, Louisiana)Original releaseApril 3, 1948 (1948-04-03) –August 27, 1960 (1960-08-27) Louisiana Hayride was a radio and later television country music show broadcast from the Shreveport Municipal M...