# BISC: Abstract version of the Soccer Problem

Subject: BISC: Abstract version of the Soccer Problem
From: Michelle T. Lin (michlin@eecs.berkeley.edu)
Date: Tue Aug 22 2000 - 16:37:03 MET DST

*********************************************************************
Berkeley Initiative in Soft Computing (BISC)
*********************************************************************

To: The BISC Group
Subject: Abstract version of the Soccer Problem

In a message to the BISC Group dated July 18, 2000, I posed a
problem which I characterized as a challenge to data miners -- a
problem labeled The Soccer Problem, or SP for short.

In an abstract formulation, what is seen more clearly is that
the soccer problem is an instance of class of problems which involve
what may be called exploratory hypothesis testing (EH testing). As
described in the following, the abstract problem is a simplified
version of the soccer problem but it preserves its main features. Here
is the problem.

Assume that the data consist of a collection, C, of N
sequences of length L of symbols drawn from a finite alphabet, A, of
size K. Each sequence is tagged with 0 and 1. What we have, then, is a
function or a relation, R, from C to {0,1}.

The question is: What is R? On the face of it, this appears to
be a standard problem in pattern recognition, neurocomputing, and
machine learning. The difficulty is that the number of given
sequences, N, is much too small in relation to L and K, making
standard techniques inapplicable.

To deal with insufficiency of data, we formulate a testable
hypothesis, H, and proceed to test it. For convenience, the approach
will be referred to as Exploratory Hypothesis testing or EH-testing,
for short.

The hypothesis is the following. Consider a subset of A, A*,
and let r(s) be the relative count of symbols in a sequence, s, which
belong to A*. For example, if A={a,b,c,d}, A*={a,b} and s=baacbdac,
then r=5/8. In this way, each sequence, s, in C is associated with an
ordered pair {r(s),v(s)}, where v(s) is 0 or 1. The hypothesis is:

Given the data: {r(s),v(s)}, s in C, as a function of A*,
there exists A* such that for most sequences, s, the larger the value
of r(s) the higher the probability (relative frequency) that v(s)=1.
Thus, if such A* does not exist, the hypothesis is wrong.

In summary, we have reduced the original data-mining problem
to that of exploratory hypothesis testing. It should be noted that a
significant difference between the soccer problem and its abstract
version as formulated in the foregoing discussion, is that in the
soccer problem the set A* is fuzzy rather than crisp.

August 21 2000

Remark: Note that in the soccer problem and its abstract version the
hypothesis is fuzzy. What this suggests is that in most data-mining
problems the hypothesis must be fuzzy in order to be realistic.

me(zadeh@cs.berkeley.edu) with cc to Michael Berthold
(berthold@cs.berkeley.edu)

--------------------------------------------------------------------
If you ever want to remove yourself from this mailing list,
you can send mail to <Majordomo@EECS.Berkeley.EDU> with the following
command in the body of your email message:
unsubscribe bisc-group
or from another account,

############################################################################
This message was posted through the fuzzy mailing list.
(1) To subscribe to this mailing list, send a message body of
"SUB FUZZY-MAIL myFirstName mySurname" to listproc@dbai.tuwien.ac.at
(2) To unsubscribe from this mailing list, send a message body of
"UNSUB FUZZY-MAIL" or "UNSUB FUZZY-MAIL yoursubscription@email.address.com"
to listproc@dbai.tuwien.ac.at
(3) To reach the human who maintains the list, send mail to
fuzzy-owner@dbai.tuwien.ac.at
(4) WWW access and other information on Fuzzy Sets and Logic see
http://www.dbai.tuwien.ac.at/ftp/mlowner/fuzzy-mail.info
(5) WWW archive: http://www.dbai.tuwien.ac.at/marchives/fuzzy-mail/index.html

This archive was generated by hypermail 2b25 : Tue Aug 22 2000 - 16:54:01 MET DST