Joint Fudan-HKBU Workshop on Data Science

May 4-7, 2016

Fudan University



Weiguo Gao (School of Mathematical Sciences)

Wei Lin (School of Mathematical Sciences)

Shuai Lu (School of Mathematical Sciences)

Xiaoyang Sean Wang (School of Computer Science)

Zongmin Wu (School of Mathematical Sciences)

Zhiguo Xiao (Department of Statistics)

Weihong Yang (School of Mathematical Sciences)

Shuqin Zhang (School of Mathematical Sciences)


Lizhi Liao (Department of Mathematics)

Hongyu Liu (Department of Mathematics)

Michael Ng (Department of Mathematics)

Henry Ngan (Department of Mathematics)

Heng Peng (Department of Mathematics)

Tiejun Tong (Department of Mathematics)

Xiang Wan (Department of Computer Science)

Yuliang Wang (Department of Mathematics)

Can Yang (Department of Mathematics)

Xiaoming Yuan (Department of Mathematics)

Tieyong Zeng (Department of Mathematics)


Wai-Ki Ching (Department of Mathematics)

Kevin Yip (Department of Computer Science and Engineering, CUHK)

Siuming Yiu (Department of Computer Science)

Zhiwen Zhang (Department of Mathematics)


Room 2001, East Guanghua Main Building

08:45-09:00 Open Remarks
09:00-09:30 Lizhi Liao Some Orthogonality Constrained Optimization Problems in Data Science

In this talk, several orthogonality constrained optimization problems, which are appeared in data science, are studied in detail. These orthogonality constrained optimization problems occur in linear discriminate analysis (LDA), canonical correlation analysis (CCA), graph clustering, etc. We will explore the relationships of these orthogonality constrained optimization problems and the existing solution schemes. Some new discoveries for some of these orthogonality constrained optimization problems will be also reported. Finally, some open and challenging problems will be mentioned.

09:30-10:00 Xiaoyang Sean Wang Supporting Searches via Concept Extraction and Management

Searching on the net has become an important tool for daily lives of people. The current search technology is mostly limited to the keyword matching enhanced by a sophisticated ranking system. This is far from satisfying, and the idea of responding to searches with “content” of documents on the net is promising. The key issue to achieve this is to extract concept (or topic or knowledge) from documents and to manage the concepts in a meaningful and easy-to-access manner. This talk is to introduce the audience to the idea of concept extraction and management, and gives an overview of the challenges and possible solutions.

10:00-10:30 Kevin Yip Computational Modeling and Analysis of Gene Regulatory Mechanisms

Gene expression is controlled by multiple regulatory mechanisms. A thorough understanding of these mechanisms would help elucidate gene functions and the impact of their dysregulation in diseases. The large amount of data obtained by high-throughput experiments has created an unprecedented opportunity to move away from simple qualitative descriptions of these regulatory mechanisms to detailed quantitative models, which provide information for studying properties of these mechanisms in far greater detail. On the other hand, high-throughput data are noisy and could contain various types of bias, and thus should be processed and analyzed with caution. In this talk, I will describe my previous and current work on modeling and analyzing gene regulatory mechanisms, with a focus on the roles of three-dimensional genome structure and context-specific enhancer-target interactions.

10:30-11:00 Coffee Break
11:00-11:30 Wai-Ki Ching Modeling Credit Defaults by Probabilistic Boolean Networks

One of the central issues in credit risk measurement and management is modeling and predicting correlated defaults. In this talk we introduce a novel model to investigate the relationship between correlated defaults of different industrial sectors and business cycles as well as the impacts of business cycles on modeling and predicting correlated defaults using Probabilistic Boolean Networks (PBNs). The key idea of the PBN is to decompose a transition probability matrix describing correlated defaults of different sectors into several BN matrices which contain information about business cycles. An efficient estimation method is proposed to estimate the model parameters. Using real default data, we build a PBN for explaining the default structure and make reasonably good prediction of joint defaults in different sectors.

11:30-12:00 Henry Ngan Automatic Incident Classification for Big Traffic Data by Adaptive Boosting SVM

Modern cities experience heavy traffic flows and congestions regularly across space and time. Monitoring traffic situations becomes an important challenge for the Traffic Control and Surveillance Systems (TCSS). In advanced TCSS, it is helpful to automatically detect and classify different traffic incidents such as severity of congestion, abnormal driving pattern, abrupt or illegal stop on road, etc. Although most TCSS are equipped with basic incident detection algorithms, they are however crude to be really useful as an automated tool for further classification. In literature, there is a lack of research for Automated Incident Classification (AIC). Therefore, a novel AIC method is proposed in this talk to tackle such challenges. In the proposed method, traffic signals are firstly extracted from captured videos and converted as spatial-temporal (ST) signals. Based on the characteristics of the ST signals, a set of realistic simulation data are generated to construct an extended big traffic database to cover a variety of traffic situations. Next, a Mean-Shift filter is introduced to suppress the effect of noise and extract significant features from the ST signals. The extracted features are then associated with various types of traffic data: one normal type (inliers) and multiple abnormal types (outliers). For the classification, an adaptive boosting classifier is trained to detect outliers in traffic data automatically. Further, a Support Vector Machine (SVM) based method is adopted to train the model for identifying the categories of outliers. In short, this hybrid approach is called an Adaptive Boosting Support Vector Machines (AB-SVM) method. Experimental results show that the proposed AB-SVM method achieves a satisfied result with more than 92% classification accuracy on average.

12:00-14:00 Lunch (Danyuan Restaurant)

Room 1801, East Guanghua Main Building

14:00-14:30 Tieyong Zeng Recent Progress in Image Recovery (dictionary, low rank)

Sparsity has played an important role in the domain of image processing. In this talk, we present some recent progress in image recovery based on dictionary learning and low rank prior.

14:30-15:00 Xiang Wan Integrating Pleiotropy and Tissue-specific Information for Prioritizing Risk Genes

Recent biotechnology breakthroughs have changed the world in ways never seen before by the generation of vast amount of genomic data, including thousands of genome-wide association studies (GWAS) and massive gene expression data from different tissues. How to perform joint analysis of these data to gain new biological insights becomes a critical step to understand the aetiology of complex diseases. Due to the polygenic architecture of complex diseases, identification of risk genes remains challenging. In this talk, I will present our recent approach to integrating pleiotropy in multiple GWAS and tissue-specific information for prioritizing risk genes. I will also briefly show some simulation results and the results of the real data analysis on the Bipolar disorder (BPD) and schizophrenia (SCZ) GWAS from Psychiatric Genomics Consortium along with gene expression data of multiple tissues from the Genotype-Tissue Expression project (GTEx).

15:00-15:30 Can Yang A Statistical Approach to Colocalizing Genetic Risk Gariants in Multiple Gwas

It is widely agreed that complex human phenotypes (such as height, obesity, and psychiatric disorders) are highly polygenic and a large number of risk variants with small effects remain undiscovered. Recently, accumulating evidence suggests that many genetic variants may affect multiple seemly different phenotypes. Such a phenomenon is known as “pleiotropy”. Undouble, identification of risk variants with pleiotropic effects not only helps to explain the relationship between diseases, but may also contribute to novel insights concerning the etiology of each specific disease. In this talk, we consider a statistical approach to colocalizing genetic risk variants in multiple GWAS by taking pleiotropic effects into account. An efficient algorithm was derived such that we were able to perform joint analysis of 20 GWAS within half an hour. Compared with single GWAS analysis, the statistical power of the proposed approach was improved about 15%-30% in this real data analysis. We believe that the proposed approach will greatly facilitate colocalization of genetic risk variants.

15:30-15:50 Coffee Break
15:50-16:20 Wei Lin TBD
16:20-16:50 Xiaoming Yuan Convex Optimization Perspectives for Data Science: A Case Study on ADMM

According to Frontiers in Massive Data Analysis edited by National Research Council of the National Academies of USA, Optimization is one of the seven computational giants of massive data analysis (or, big data). How to adapt some well-studied optimization techniques to big data scenarios, however, is by no means trivial. I will focus on two big data scenarios:high dimensional variables and huge numbers of features (obtained via, e.g.,a deep learning procedure); and discuss how to correctly use the extremely popular ADMM algorithm for these challenging scenarios. I will highlight some techniques for designing ADMM-based algorithms and some roadmaps to rigorously ensure their convergence. Some critical issues (both theoretical and algorithmic) which had been long ignored will be emphasized and our solutions will be presented. Some more challenging questions awaiting answers will also be posed.

16:50-17:20 Shuai Lu Filter Based Methods for Statistical Linear Inverse Problems

Ill-posed inverse problems are ubiquitous in applications. Understanding of algorithms for their solution has been greatly enhanced by a deep understanding of the linear inverse problem. In the applied communities ensemble-based filtering methods have recently been used to solve inverse problems by introducing an artificial dynamical system. This opens up the possibility of using a range of other filtering methods, such as 3DVAR and Kalman based methods, to solve inverse problems, again by introducing an artificial dynamical system. The aim of this talk is to analyze such methods in the context of the ill-posed linear inverse problem.

Statistical linear inverse problems are studied in the sense that the observational noise is assumed to be derived via realization of a Gaussian random variable. We investigate the asymptotic behavior of filter based methods for these statistical linear inverse problems. Rigorous convergence rates are established for 3DVAR and for the Kalman filters, including minimax rates in some instances. Blowup of 3DVAR and its variant form is also presented, and optimality of the Kalman filter is discussed. These analyses reveal close connection between (iterative) regularization schemes in deterministic inverse problems and filter based methods in data assimilation. 

It is a joint work with Dr. M. A. Iglesias (U. of Nottingham, UK), Dr. K. Lin (Fudan U., China) and Prof. A. M. Stuart (U. of Warwick, UK). 

18:00-20:00 Dinner (XingChen, Huangxing Road)

Room 2001, East Guanghua Main Building

08:45-09:15 Zongmin Wu Construction of Moving Knots Equations for Simulating Time Dependent PDEs

This paper constructs one kind of new moving knots partial differential equations (PDEs) based on the Equidistribution Principle (EP) proposed by de Boor. The moving of the knots according to the new moving knots PDE consists of two aspects each time iteration: one step of Newton Iteration on fixed time which could pull the knots closer to equidistribution, and then the temporal preservation which could keep the knots’ distribution when moved to the next time step. We proved that for arbitrary initial knots, the new moving knots PDE could pull the knots to monotone approximate to the equidistribution exponentially. The numerical algorithm for solving the coupled system of the moving knots equation and original PDE is proposed. At last several numerical examples are showed to verify the advantages of the new method compared with the previous methods.

09:15-09:45 Weihong Yang A Krylov Subspace Method for Large Scale SOCLCP

The second order cone linear complementarity problem (SOCLCP) can be solved by finding a positive root of a particular rational function h(s). We propose a Krylov subspace method to reduce the rational function h(s) to $h'(s)$ as in the model reduction. The zero s of h(s)can be accurately approximated by that of h'(s)=0 which itself can be casted as a small eigenvalue problem. The method is tested and compared against two state-of-the-art packages: SDPT3 and SeDuMi. Our numerical results show that the method is very efficient for both small-to-medium dense problems and large scale ones.

09:45-10:15 Michael Ng Manifold Regularization for Tensor Data

In this talk, we discuss recent results for dimension reduction of tensor data. Manifold regularization or other regularization terms can be considered. Numerical examples are reported to show the illustrative results.

10:15-10:45 Coffee Break
10:45-11:15 Heng Peng Model Selection for Gaussian Mixture Models

This talk is concerned with an important issue in finite mixture modeling, namely the selection of the number of mixing components. A new penalized likelihood method is proposed for finite multivariate Gaussian mixture models, and it is shown to be statistically consistent in determining the number of components. A modified EM algorithm is developed to simultaneously select the number of components and estimate the mixing probabilities and the unknown parameters of Gaussian distributions. Simulations and a real data analysis are presented to illustrate the performance of the proposed method.

11:15-11:45 Tiejun Tong Bias and Variance Reduction in Estimating the Proportion of True Null Hypotheses

When testing a large number of hypotheses, estimating the proportion of true nulls, denoted by pi0, becomes increasingly important. This quantity has many applications in practice. For instance, a reliable estimate of pi0 can eliminate the conservative bias of the Benjamini-Hochberg procedure on controlling the false discovery rate. It is known that most methods in the literature for estimating pi0 are conservative. Recently, some attempts have been paid to reduce such estimation bias. Nevertheless, they are either over bias corrected or suffering from an unacceptably large estimation variance. In this paper, we propose a new method for estimating pi0 that aims to reduce the bias and variance of the estimation simultaneously. To achieve this, we first utilize the probability density functions of false-null p-values and then propose a novel algorithm to estimate the quantity of pi0. The statistical behavior of the proposed estimator is also investigated. Finally, we carry out extensive simulation studies and several real data analysis to evaluate the performance of the proposed estimator. Both simulated and real data demonstrate that the proposed method may improve the existing literature significantly.

11:45-14:00 Lunch (Danyuan Restaurant)
14:00-14:30 Siu-Ming Yiu The Assembly Problem (Using Short Reads) in Bioinformatics

Computer science started to play an important role in data analysis for bioinformatics due to the advancement of next generation sequencing technology. In this talk, we will try to introduce the assembly problem, a critical step in analyzing a genome. We will talk about the difficulties in solving this problem and highlight some of the techniques used in existing tools.

14:30-15:00 Zhiwen Zhang A Class of Data-driven Methods for Stochastic Partial Differential Equations

We propose a data-driven stochastic method (DSM) to study stochastic partial differential equations (SPDEs) in the multiquery setting. An essential ingredient of the proposed method is to construct a data-driven stochastic basis under which the stochastic solutions to the SPDEs enjoy a compact representation for a broad range of forcing functions and/or boundary conditions. Our method consists of offline and online stages. A data-driven stochastic basis is computed in the offline stage using the Karhunen-Loeve (KL) expansion. In the online stage, we solve a relatively small number of coupled deterministic PDEs by projecting the stochastic solution into the data-driven stochastic basis constructed offline. Applications of DSM to stochastic elliptic problems show considerable computational savings over traditional methods even with a small number of queries.

15:00-15:30 Zhiguo Xiao Moving Average Barriers in Chinese Stock Markets

We find that the moving averages (MA) serve as psychological barriers of Chinese stock markets. Using a uniformity test and GARCH models, we established the following facts. First, the Shanghai stock index doesn't move continuously near the MA lines. Second, both the mean and volatility of the index return change significantly while the index crosses the MA lines. To substantiate our findings, we construct a technical trading strategy using MA rules. It turns out that the technical strategy outperform the passive buy-and-hold strategy by a large margin.

15:30-16:00 Coffee Break
16:00-16:30 Hongyu Liu Mathematical Design of a Novel Gesture-based Instruction /Input Device

I will talk about a conceptual design of a novel gesture-based instruction/input device using acoustic or electromagnetic waves. The gestures are modeled as the shapes of certain impenetrable obstacles or penetrable medium scatters. The device uses time-harmonic point signals to recognize/detect the gestures, and based on which to give the specific orders or inputs to the computing end. In our design, only two point signals will be needed, and the phase less scattered data are collected on a surface containing the point sources. I shall talk about the mathematical principles as well as the numerical implementations.

16:30-17:00 Shuqin Zhang Drug-target Interaction Prediction by Integrating Multiview Network Data

Drug-Target Interaction (DTI) prediction is a key step in further drug repositioning, drug discovery and drug design. Many mathematical models and statistical computation algorithms are explored to identify potential drug-target pairs. In this talk, we proposed an optimization model to integrate different types of data from multi views to predict DTI. We applied our algorithm to LINCS L1000 database and DrugBank 3.0. Comparison with other existing methods shows the better performance of our model and algorithm. In the end, we predict 54 possible and divide them into two classes.

18:00-20:00 Dinner (Yanyuan, Zhengtong Road)

Campus Map


Pudong Airport (PVG) to Fudan University

  • Take taxi (about 50 minutes, about 170 RMB).
  • Take Maglev Speed Train to Long Yang Road Station (50 RMB), then take taxi (about 60 RMB).
  • Take Airport Bus Line 4, get off at Wu Jiao Chang Station (20 RMB). Then walk about 10 minutes.

Hongqiao Airport (SHA) to Fudan University

  • Take taxi (about 95 RMB).
  • Take Shanghai metro line No. 10 to Jiangwan Stadium Station (5 RMB). Then walk about 10 minutes.