This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| en:iot-reloaded:dbscan [2024/09/25 15:35] – agrisnik | en:iot-reloaded:dbscan [2024/12/10 22:44] (current) – pczekalski | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| + | ===== DBSCAN ===== | ||
| + | |||
| + | DBSCAN (Density-Based Spatial Clustering of Applications with Noise) employs density measures to mark points in high-density regions and those in low-density regions – the noise. Because of this natural behaviour of the algorithm, it is particularly useful in signal processing and similar application domains. | ||
| + | ((Ester, Martin; Kriegel, Hans-Peter; Sander, Jörg; Xu, Xiaowei (1996). Simoudis, Evangelos; Han, Jiawei; Fayyad, Usama M. (eds.). A density-based algorithm for discovering clusters in large spatial databases with noise (PDF). Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96). AAAI Press. pp. 226–231. CiteSeerX 10.1.1.121.9220. ISBN 1-57735-004-9.)). | ||
| + | |||
| + | One of the essential concepts is the point' | ||
| + | |||
| + | {{ : | ||
| + | |||
| + | where: | ||
| + | * **p** – the point of interest. | ||
| + | * **N(p)** – neighbourhood of the point p. | ||
| + | * **q** – any other point. | ||
| + | * **distance(p, | ||
| + | * **eps** – epsilon – user-defined distance constant. | ||
| + | * **D** – the initial set of points available for the algorithm. | ||
| + | |||
| + | The algorithm treats different points differently depending on density and neighbouring points distribution around the point – its neighbourhood: | ||
| + | |||
| + | - **Core Points:** | ||
| + | * A point is a core point if it has at least MinPts neighbours within a distance eps, where MinPts and eps are user-defined parameters, i.e. N(p) ≥ MinPts. | ||
| + | - **Directly Density-Reachable points:** | ||
| + | * A point is directly density-reachable from a core point if it lies within the distance eps of the core point. | ||
| + | - **Border Points:** | ||
| + | * A border point is not a core point but within the eps distance of a core point. Border points are part of a cluster but do not have enough neighbours to be core points. | ||
| + | - **Noise points:** | ||
| + | * Points that are not core and are not reachable from any core point are considered noise or outliers. | ||
| + | |||
| + | <figure DBSCANconcepts> | ||
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| + | |||
| + | * **The main steps in DBSCAN:** | ||
| + | * Selects a point from the data set – it might be the first in the data set or any randomly selected. | ||
| + | * If it is a core point, form a cluster by grouping it with all directly density-reachable points. | ||
| + | * Move to the next unvisited point and return to step 1. | ||
| + | * Border points are added to the nearest cluster, and points not reachable from any core point are marked as noise. | ||
| + | |||
| + | |||
| + | * **Advantages: | ||
| + | * Can detect clusters of arbitrary shape. | ||
| + | * Naturally identifies outliers or noise. | ||
| + | * Unlike K-means, DBSCAN does not require specifying the number of clusters upfront. | ||
| + | |||
| + | * **Disadvantages: | ||
| + | * Results depend heavily on the choice of eps and MinPts. | ||
| + | * It struggles with clusters of varying densities since eps is fixed. | ||
| + | |||
| + | DBSCAN is excellent for discovering clusters in data with noise, especially when clusters are not circular or spherical. | ||
| + | |||
| + | Some application examples (figures {{ref> | ||
| + | |||
| + | <figure DBSCANexample1> | ||
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| + | |||
| + | <figure DBSCANexample2> | ||
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| + | |||
| + | A typical application in signal processing (figure {{ref> | ||
| + | |||
| + | <figure DBSCANexample3> | ||
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| + | |||
| + | ==== Selecting eps and MinPts values ==== | ||
| + | |||
| + | Usually, MinPts is selected using some prior knowledge of the data and its internal structure. If it is done, the following steps might be applied: | ||
| + | * Calculate the average distance between every point and its k-nearest neighbours, where k = MinPts. | ||
| + | * The average distances are sorted and depicted on a chart, where x – is the index of the sorted average distance, y – is the distance value. | ||
| + | * The optimal eps value is when y increases rapidly, as shown in the following picture (figure {{ref> | ||
| + | |||
| + | <figure Selecting_MinPts> | ||
| + | {{ : | ||
| + | < | ||
| + | </ | ||
| + | |||
| + | The red horizontal line shows a possible eps value, around 0, | ||
| + | |||
| + | |||