A classification model in machine learning based on centroids
Rocchio Classification
In machine learning , a nearest centroid classifier or nearest prototype classifier is a classification model that assigns to observations the label of the class of training samples whose mean (centroid ) is closest to the observation. When applied to text classification using word vectors containing tf*idf weights to represent documents, the nearest centroid classifier is known as the Rocchio classifier because of its similarity to the Rocchio algorithm for relevance feedback .[ 1]
An extended version of the nearest centroid classifier has found applications in the medical domain, specifically classification of tumors .[ 2]
Algorithm
Training
Given labeled training samples
{
(
x
→ → -->
1
,
y
1
)
,
… … -->
,
(
x
→ → -->
n
,
y
n
)
}
{\displaystyle \textstyle \{({\vec {x}}_{1},y_{1}),\dots ,({\vec {x}}_{n},y_{n})\}}
with class labels
y
i
∈ ∈ -->
Y
{\displaystyle y_{i}\in \mathbf {Y} }
, compute the per-class centroids
μ μ -->
→ → -->
ℓ ℓ -->
=
1
|
C
ℓ ℓ -->
|
∑ ∑ -->
i
∈ ∈ -->
C
ℓ ℓ -->
x
→ → -->
i
{\displaystyle \textstyle {\vec {\mu }}_{\ell }={\frac {1}{|C_{\ell }|}}{\underset {i\in C_{\ell }}{\sum }}{\vec {x}}_{i}}
where
C
ℓ ℓ -->
{\displaystyle C_{\ell }}
is the set of indices of samples belonging to class
ℓ ℓ -->
∈ ∈ -->
Y
{\displaystyle \ell \in \mathbf {Y} }
.
Prediction
The class assigned to an observation
x
→ → -->
{\displaystyle {\vec {x}}}
is
y
^ ^ -->
=
arg
-->
min
ℓ ℓ -->
∈ ∈ -->
Y
‖ ‖ -->
μ μ -->
→ → -->
ℓ ℓ -->
− − -->
x
→ → -->
‖ ‖ -->
{\displaystyle {\hat {y}}={\arg \min }_{\ell \in \mathbf {Y} }\|{\vec {\mu }}_{\ell }-{\vec {x}}\|}
.
See also
References