[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

data mining



PureBytes Links

Trading Reference Links


Thanks for the info on "neuroshell", I'm still confused as to whether
neuroshell starts at the "end" of the Knowledge Discovery process like most
indicator based programs or in the middle.

I'm still looking at the following model:

data, -->
target data, -->
pre-processed data, -->
transformed data, -->
if-then patterns, -->
interpretation and evaluation

= knowledge


I downloaded the following free working program and dug out the following
data mining
information.

Best regards

Walter

==================

http://www.statsoft.com/download.html

DEMO 1: The Interactive STATISTICA Application with the Electronic Manual:
This program is a complete application (and not a self-running demo) with
many features of the Basic Statistics module of STATISTICA, and it allows
you to run your own analyses and produce customized graphs. This program
includes a comprehensive Electronic Manual and many example data files on
which you can perform various analyses (using a large selection of provided
data sets) and create different kinds of graphs. Download the Interactive
STATISTICA Application (approx. 30 minutes at 28.8 Kbps). An abbreviated
version of this demo which does not contain the Electronic Manual is also
available. Download abbreviated version (approx. 15 minutes at 28.8 Kbps).

============

>From the "Electronic Manual"

StatSoft defines data mining as an analytic process designed to explore
large amounts of (typically business or market related) data in search for
consistent patterns and/or systematic relationships between variables, and
then to validate the findings by applying the detected patterns to new
subsets of data.  The process thus consists of three basic stages:
exploration, model building or pattern definition, and
validation/verification.  Ideally, if the nature of available data allows,
it is typically repeated iteratively until a "robust" model is identified.
However, in business practice the options to validate the model at the stage
of analysis are typically limited and, thus, the initial results often have
the status of heuristics that could influence the decision process (e.g.,
"The data appear to indicate that the probability of trying sleeping pills
increases with age faster in females than in males.").

The concept of Data Mining is becoming increasingly popular as a business
information management tool where it is expected to reveal knowledge
structures that can guide decisions in conditions of limited certainty.
Recently, there has been increased interest in developing new analytic
techniques specifically designed to address the issues relevant to business
data mining (e.g., Classification Trees).  But, Data Mining is still based
on the conceptual principles of traditional Exploratory Data Analysis (EDA)
and modeling and it shares with them both general approaches and specific
techniques.

However, an important general difference in the focus and purpose between
Data Mining and the traditional Exploratory Data Analysis (EDA) is that Data
Mining is more oriented towards applications than the basic nature of the
underlying phenomena.  In other words, Data Mining is relatively less
concerned with identifying the specific relations between the involved
variables.  For example, uncovering the nature of the underlying functions
or the specific types of interactive, multivariate dependencies between
variables are not the main goal of Data Mining.  Instead, the focus is on
producing a solution that can generate useful predictions.  Therefore, Data
Mining accepts among others a "black box" approach to data exploration or
knowledge discovery and uses not only the traditional Exploratory Data
Analysis (EDA) techniques, but also such techniques as Neural Networks which
can generate valid predictions but are not capable of identifying the
specific nature of the interrelations between the variables on which the
predictions are based.

Data Mining is often considered to be "a blend of statistics, AI [artificial
intelligence], and data base research" (Pregibon, 1997, p. 8), which until
very recently was not commonly recognized as a field of interest for
statisticians, and was even considered by some "a dirty word in Statistics"
 (Pregibon, 1997, p. 8).  Due to its applied importance, however, the field
emerges as a rapidly growing and major area (also in statistics) where
important theoretical advances are being made (see, for example, the recent
annual International Conferences on Knowledge Discovery and Data Mining,
co-hosted in 1997 by the American Statistical Association).

For information on Data Mining techniques, see Exploratory Data Analysis
(EDA) and Data Mining Techniques, see also Neural Networks; for a
comprehensive overview and discussion of Data Mining, see Fayyad,
Piatetsky-Shapiro, Smyth, and Uthurusamy (1996).  Representative selections
of articles on Data Mining can be found in Proceedings from the American
Association of Artificial Intelligence Workshops on Knowledge Discovery in
Databases published by AAAI Press (e.g., Piatetsky-Shapiro, 1993; Fayyad &
Uthurusamy, 1994).
Data mining is often treated as the natural extension of the data
warehousing concept.

================

Note:  Exploratory Data Analysis (EDA) is closely related to the concept of
Data Mining; for more information see Data Mining.
EDA vs. Hypothesis Testing.  As opposed to traditional hypothesis testing
designed to verify a priori hypotheses about relations between variables
(e.g., "There is a positive correlation between the AGE of a person and
his/her RISK TAKING disposition"), exploratory data analysis (EDA) is used
to identify systematic relations between variables when there are no (or not
complete) a priori expectations as to the nature of those relations.  In a
typical exploratory data analysis process, many variables are taken into
account and compared, using a variety of techniques in the search for
systematic patterns.

Computational EDA techniques.  Computational exploratory data analysis
methods include both simple basic statistics and more advanced, designated
multivariate exploratory techniques designed to identify patterns in
multivariate data sets.

Basic statistical exploratory methods.  The basic statistical exploratory
methods include such techniques as examining distributions of variables
(e.g., to identify highly skewed or non-normal, such as bi-modal patterns),
reviewing large correlation matrices
 for coefficients that meet certain thresholds, or examining multi-way
frequency tables (e.g., "slice by slice" systematically reviewing
combinations of levels of control variables).
Multivariate exploratory techniques.  Multivariate exploratory techniques
designed specifically to identify patterns in multivariate (or univariate,
such as sequences of measurements) data sets include: Cluster Analysis,
Factor Analysis, Discriminant Function Analysis, Multidimensional Scaling,
Log-linear Analysis, Canonical Correlation, Stepwise Linear and Nonlinear
(e.g., Logit) Regression, Correspondence Analysis, Time Series Analysis, and
Classification Trees.

Neural Networks.  Analytic techniques modeled after the (hypothesized)
processes of learning in the cognitive system and the neurological functions
of the brain and capable of predicting new observations (on specific
variables) from other observations (on the same or other variables) after
executing a process of so-called learning from existing data.  For more
information, see Neural Networks; see also STATISTICA Neural Networks.

Graphical (data visualization) EDA techniques.  A large selection of
powerful exploratory data analytic techniques is also offered by graphical
data visualization methods that can identify relations, trends, and biases
"hidden" in unstructured data sets.

Brushing.  Perhaps the most common and historically first widely used
technique explicitly identified as graphical exploratory data analysis is
brushing, an interactive method allowing one to select on-screen specific
data points or subsets of data and identify their (e.g., common)
characteristics, or to examine their effects on relations between relevant
variables.  (For an illustration, see Brushing Mode - Animation.)  Those
relations between variables can be visualized by fitted functions (e.g., 2D
lines or 3D surfaces) and their confidence intervals, thus, for example, one
can examine changes in those functions by interactively (temporarily)
removing or adding specific subsets of data.  STATISTICA offers a
particularly comprehensive implementation of brushing techniques,
interactive animated brushing, analytic brushing by selecting attributes of
specific data points, and others.

Other graphical EDA techniques.  Other graphical exploratory analytic
techniques include function fitting and plotting, data smoothing, overlaying
and merging of multiple displays, categorizing data, splitting/merging
subsets of data in graphs, aggregating data in graphs, identifying and
marking subsets of data that meet specific conditions, icon plots, shading,
plotting confidence intervals and confidence areas (ellipses), generating
tessellations, spectral planes, integrated layered compressions, and
projected contours, data image reduction techniques, interactive (and
continuous) rotation with animated stratification (cross-sections) of 3D
displays, and selective highlighting of specific series and blocks of data.

Verification of results of EDA.  The exploration of data can only serve as
the first stage of data analysis and its results can be treated as tentative
at best as long as they are not confirmed, e.g., crossvalidated, using a
different data set (or and independent subset).  If the result of the
exploratory stage suggests a particular model, then its validity can be
verified by applying it to a new data set and testing its fit (e.g., testing
its predictive validity).  Case selection conditions can be used to quickly
define subsets of data (e.g., for estimation and verification), and for
testing the robustness of results.
See also Data Mining and Neural Networks.