ECML/PKDD 2014 workshop on

Statistically Sound Data Mining


11:00 Introduction to the workshop (after the coffee break)
11:10 Charles Elkan (University of California): Massive, sparse, efficient multilabel learning (Invited talk)
12:10 Florian Lemmerich and Frank Puppe: A Critical View on Automatic Significance-Filtering in Pattern Mining
12:35-14:00 Lunch break
14:00 Yuyi Wang, Jan Ramon and Christos Pelekis: Analysis of network-structured data using U-statistics with kernels of degree larger than one
14:25 Gitte Vanwinckelen and Hendrik Blockeel: Look before you leap: Some insights into learner evaluation with cross-validation
14:50 Jun Sese, Aika Terada, Yuki Saito and Koji Tsuda: Statistically significant subgraphs for genome-wide association study
15:15 Introduction to group works (PBL method, groups and topics)
15:30 Coffee break
16:00 Working in groups
17:00 Representing initial results
17:30 End of workshop

Motivation and objectives

Even if Data Mining has its roots in Statistics, there was a long while when data miners and statisticians walked their own paths. Data miners concentrated on developing efficient algorithms that addressed the practical issues associated with huge data sets, but in doing so may sometimes have paid less attention to the reliability of patterns or even their utility. On the other hand, statisticians continued on their traditional line offering well-founded and sound methods for validating statistically meaningful patterns, but they could not offer computational means to find them. Fortunately, the situation is now changing and both data miners and statisticians are recognizing the need for cooperation.

The main impetus for this new trend is coming from a third party, the application fields. In the computerized world, it is easy to collect large data sets but their analysis is more difficult. Knowing the traditional statistical tests is no more sufficient for scientists, because one should first find the most promising hidden patterns and models to be tested. This means that there is an urgent need for efficient data mining algorithms which are able to find desired patterns, without missing any significant discoveries or producing too many spurious ones. A related problem is to find a statistically justified compromise between underfitted (too generic to catch all important aspects) and overfitted (too specific, holding just due to chance) patterns. However, before any algorithms can be designed, one should first solve many principal problems, like how to define the statistical significance of desired patterns, how to evaluate overfitting, how to interprete the p-values when multiple patterns are tested, and so on. In addition, one should evaluate the existing data mining methods, alternative algorithms and goodness measures to see which of them produce statistically valid results.

As we can see, there are many important problems which should be worked together with people from Data mining, Machine learning, and Statistics as well as application fields. The goal of this workshop is to offer a meeting point for this discussion. We want bring together people from different backgrounds and schools of science, both theoretically and practically oriented, to specify problems, share solutions and brainstorm new ideas.

To encourage real workshopping of actual problems, the workshop is arranged in a novel way, containing an invited lecture and inspiring groupworks in addition to traditional presentations. This means that also the non-author participants can contribute to workshop results and submit a paper to the final proceedings afterwards. If you have relevant problems which you would like to be worked together in the workshop, please send them before the workshop.

Topics of Interest

Topics of interest include but are not limited to:

We particularly encourage submissions which compare different schools of statistics, like frequentist (Neyman-Pearsonian or Fisherian) vs. Bayesian, or analytic vs. empirical significance testing. Equally interesting are submissions introducing generic school-independent computational methods. You can also submit papers describing works-in-progress.


Workshop Chairs

Programme Committee

In addition to the workshop organizers:

Important Dates

Paper submission deadline: extended Sun July 20 (was Fri, June 20), 2014
Paper acceptance notification: Sun August 17 (was Fri, July 11), 2014
Paper camera-ready deadline: Sun August 31 (was Fri July 25), 2014
Problem submission: Mon September 8 2014 (preferrably earlier)
Workshop date: Monday, September 15, 2014

Paper Submission

The papers can be either regular papers (recommended maximum length 12 pages in the LNCS format) or short papers (6 pages). These page limits are somewhat flexible.

All papers will be peer-reviewed by 2-3 reviewers. The accepted papers will be presented at the workshop and included in the workshop proceedings. The proceedings will be published in the JMLR: Workshop and Conference Proceedings series after the conference.

Submit your paper as pdf by EasyChair SSDM'14 submission page.

If you have good problem ideas for groupworks, you can send them directly to Wilhelmiina by email.