Data mining classification is one step in the process of data mining. It is used to group items based on certain key characteristics. There are several techniques used for data mining classification, including nearest neighbor classification, decision tree learning, and support vector machines.
Data mining is a method researchers use to extract patterns from data. Generally a representative sample is chosen from the pool of data and then manipulated and analyzed to find patterns. In addition to data mining classification, researchers may also use clustering, regression, and rule learning to analyze the data.
There are several algorithms that can be used in data mining classification. Nearest neighbor classification is one of the simplest of the data mining classification algorithms. It relies on a training set. A training set is a set of data used to train the computer into paying attention to certain variables. In nearest neighbor classification, the computer simply classifies all data as part of the group that contains data closest in value to the input.
Decision tree learning uses a branching model to classify the data. The computer basically asks a series of questions about the data. If the answer to the first question is true, it asks question 2a. If the answer is false, it asks question 2b. When drawn out, this method forms a tree of branching paths.
Naive Bayes classification relies on probability. It asks a series of questions about each piece of data and then uses the answers to determine the probability that the data belong in a particular classification. This is different from decision tree learning because the answer to the first question does not influence which question will be asked next.
More complicated methods of data mining classification include neural networks and support vector machines. These methods are computer-based models that would be difficult to do by hand. Neural networks is often used in artificial intelligence programming because it mimics the human brain. It filters information through a series of nodes that find patterns and then classify the information.
Support vector machines use training samples to build a model that will classify information, usually visualized as a scatter plot with a wide space between categories. When new information is fed into the machine, it is plotted on the graph. The data are then classified based on which category the information falls closest to on the graph. This method works only when there are two options to choose from.