Current data collection techniques often produce massive data sets in which entities are described by a large number of properties/features/attributes. The number of features in many cases reaches thousands. Typically, it is possible to divide the features into relevant, redundant, irrelevant, or noise. To understand such a complex system, it is necessary to identify informative features, their internal dependencies, and to identify patterns or their anomalies. The aim is to describe the system using as few abstract dependencies as possible at a suitably chosen level of inaccuracy. The processing of data sets containing a large number of binary and ordinal symptoms, e.g., in the social sciences and humanities domains, using traditional statistical techniques is very limited.
In this contribution, we propose a methodology based on information theory and apply it to data sets with predominantly binary and ordinal features. The process allows to identify key relationships among binarized features and discover patterns and hierarchies even when many data items are missing or very noisy. While a majority of published methods focus on identifying relevant symptoms, our proposed technique benefits from robust properties of redundant features. The methods can be used not only for multidimensional data but also for the detection of communities in complex networks. Although direct calculation by definition exhibits cubic complexity, sparse structures can be processed at near linear time.
The methodology will be demonstrated on data from ancient Egypt. Specifically, it is a data set describing the anthropological features of selected people living in the Old Kingdom (2700–2180 BC , i.e. the time of pyramid builders): royal family members, high ranking dignitaries, middle and low officials. The proposed methodology enables, for example, to evaluate the influence of the individual's social status connected with supposed specific habitual activity on the manifestations of degenerative changes in joints (arthrosis) or entheseal changes. Relationships can be examined in both the supervised and unsupervised modes, i.e. clustering. Performance aspects will be briefly demonstrated as part of the Czech ČTK (Czech News Agency) reports processing aimed at identifying news topics in the millions of unique words and millions of ČTK news written in Czech.