Organizers: Qi Tian, Kay Robbins, Weining Zhang, Tom Bylander, Carola Wenk, Yufei Huang (EE),
Time: 10:00-11:00 am, Every Friday
Place: SB 4.01.20, CS Conference Room
Schedule for Fall 2009
9/18 Angela Dean (Ruan lab)
9/25 Chengwei Lei(Ruan lab)
10/2 Xia Li (Tian lab)
10/9 Jie Xiao (Tian lab)
10/16 Hongwei Tian (Zhang lab)
10/23 Chifeng Ma (Huang lab)
10/30 Lijie Zhang (Zhang lab)
11/6 Mark Doderer (Robbins lab)
11/13 Jian Cui (Huang lab)
11/20 Jessica Sherette (Wenk lab)
12/4 Jia Meng (Huang lab)
09/18/09 A novel meta-analysis method exploiting consistency of high-throughput experiments
Speaker: Angela Dean
Motivation: Large-scale biological experiments provide snapshots into the huge number of processes running in parallel within the organism. These processes depend on a large number of (hidden) (epi)genetic, social, environmental and other factors that are out of experimentalists' control. This makes it extremely difficult to identify the dominant processes and the elements involved in them based on a single experiment. It is therefore desirable to use multiple sets of experiments targeting the same phenomena while differing in some experimental parameters (hidden or controllable). Although such datasets are becoming increasingly common, their analysis is complicated by the fact that the various biological elements could be influenced by different sets of factors.
Results: The central hypothesis of this article is that biologically related elements and processes are affected by changes in similar ways while unrelated ones are affected differently. Thus, the relations between related elements are more consistent across experiments. The method outlined here looks for groups of elements with robust intra-group relationships in the expectation that they are related. The major groups of elements may be identified in this way. The strengths of relationships per se are not valued, just their consistency. This represents a completely novel and unutilized source of information. In the analysis of time course microarray experiments, I found cell cycle- and ribosome-related genes to be the major groups. Despite not looking for these groups in particular, the identification of these genes rivals that of methods designed specifically for this purpose.
- Satwik Rajaram . A novel meta-analysis method exploiting consistency of high-throughput experiments. Bioinformatics 2009 25(5):636-642
09/25/09 Trawler: de novo regulatory motif discovery pipeline for chromatin immunoprecipitation
Speaker: Chengwei Lei
In this presentation, I will discuss a de novo motif finding tool called Trawler, the fastest computational pipeline to date, to efficiently discover over-represented motifs in chromatin immunoprecipitation (ChIP) experiments and to predict their functional instances. When applied to data from yeast and mammals, Trawler accurately discovered 83% of the known binding sites, often with other additional binding sites, providing hints of combinatorial input. Newly discovered motifs and their features (identity, conservation, position in sequence) are displayed on a web interface.
- Laurence Ettwiller, Benedict Paten, Mirana Ramialison, Ewan Birney and Joachim Wittbrodt. Trawler: de novo regulatory motif discovery pipeline for chromatin immunoprecipitation. Nature Methods - 4 , 563 - 565 (2007)
- Supplementary info for the above paper
- Laurence Ettwiller, Benedict Paten, Marcel Souren, Felix Loosli, Jochen Wittbrodt and Ewan Birney, The discovery, positioning and verification of a set of transcription-associated motifs in vertebrates. Genome Biology 2005, 6:R104
- Marsan L, Sagot MF: Algorithms for extracting structured motifs using a suffix tree with an application to promoter and regulatory site consensus identification. J Comput Biol 2000, 7 : 345-362.
10/02/09 Label to Region by Bi-Layer Sparsity Priors
Speaker: Xia Li
To achieve reliable and visible content-based image retrieval, it is critical to obtain the correspondence between the image labels and their precise regions within an image. But in practice, it is very tedious to manually annotate the image labels to the corresponding image regions, and a more feasible alternative is to annotate at image-level. Therefore, this paper investigate how to automatically reassign the manually annotated labels at image-level to those contextually derived semantic regions. First they propose a bi-layer sparse coding formulation to reconstruct an image or semantic region from over-segmented image patches of an image set. Each layer of sparse coding produces the image label assignment to those selected atomic patches and merged candidate regions based on the shared image labels. Then they fuse this for entire label to region assignment. Extensive experiments on three public image datasets clearly demonstrate the effectiveness of their proposed framework in both label to region assignment and image annotation tasks.
- Label to Region by Bi-?Layer Sparsity Priors. X. Liu, B. Cheng, S. Yan, J. Tang, T. Chua, and H. Jin. In ACM MM'09.
- Ef?cient graph-based image segmentation. P. Felzenszwalb and D. Huttenlocher. International Journal of Computer Vision, 2004.
- l1-Magic package, http://www.acm.caltech.edu/l1magic
10/09/09 Recognition using Regions
Speaker: Jie Xiao
Region features are appealing for encoding the shape and scale information of objects naturally. In this work, a bag of overlaid regions are produced and represented by shape, color and texture. A max-margin method is used to learn the region weights. Then, they use a generalized Hough voting scheme to generate the hypotheses of object locations, scales and support, followed by a verification classifier and a constrained segmenter on each hypothesis. The experimental results show that their approach outperforms the state of the art on the ETHZ shape database.
- Chunhui Gu, Joseph J. Lim, Pablo Arbelaez, and Jitendra Malik, Recognition using Regions , CVPR 2009
- Pablo Arbelaez, Michael Maire, Charless Fowlkes, and Jitendra Malik, From conours to regions: An empirical evalution , CVPR 2009
- Andrea Frome, Yoram Singer and Jitendra Malik, Image retrieval and classification using local distance functions , NIPS 2006
10/16/09 Extending l-Diversity for Better Data Anonymization
Speaker: Hongwei Tian
The notions of l-diversity provide a strong privacy guarantee for generalization. However, existing l-diversity algorithms may force users to choose between publishing no data and scarifying privacy if the data have a skewed distribution of SA values. In this paper, we solve this problem by extending l-diversity in two ways. First, we allow the generalization of SA values and second, we use a simple function to constraint frequencies of SA values. The resulting (tau, l)-diversity is more flexible and elaborate. We present an efficient heuristic algorithm that uses a novel order of quasi-identifier values to achieve (tau, l)-diversity. We compare our algorithm with two state-of-the-art algorithms based on existing l-diversity measures. Our preliminary experimental results indicate that our algorithm can not only effectively deal with data with skewed SA distributions but also result in better utility of anonymous data in general.
Hongwei Tian and Weining Zhang. Extending l-diversity for better data anonymization. In Sixth International Conference on Information Technology: New Generations, pages 461-466, 2009.
Ashwin Machanavajjhala, Johannes Gehrke, Daniel Kifer, and Muthuramakrishnan Venkitasubramaniam. l-diversity: Privacy beyond k-anonymity . In IEEE International Conference on Data Engineering, page 24, 2006.
Gabriel Ghinita, Panagiotis Karras, Panos Kalnis, and Nikos Mamoulis. Fast data anonymization with low information loss . In International Conference on Very Large Data Bases, pages 758-769, 2007.
10/23/09 A strategy for predicting the chemosensitivity of human cancers and its application to drug discovery
Speaker: Chifeng Ma
The U.S. National Cancer Institute has used a panel of 60 diverse human cancer cell lines (the NCI-60) to screen >100,000 chemical compounds for anticancer activity. However, not all important cancer types are included in the panel, nor are drug responses of the panel predictive of clinical efficacy in patients. We asked, therefore, whether it would be possible to extrapolate from that rich database (or analogous ones from other drug screens) to predict activity in cell types not included or, for that matter, clinical responses in patients with tumors. We address that challenge by developing and applying an algorithm we term ''coexpression extrapolation'' (COXEN). COXEN uses expression microarray data as a Rosetta Stone for translating from drug activities in the NCI-60 to drug activities in any other cell panel or set of clinical tumors. Here, we show that COXEN can accurately predict drug sensitivity of bladder cancer cell lines and clinical responses of breast cancer patients treated with commonly used chemotherapeutic drugs. Furthermore, we used COXEN for in silico screening of 45,545 compounds and identify an agent with activity against human bladder cancer.
- Lee et al. (2007) A strategy for predicting the chemosensitivity of human cancers and its application to drug discovery. Proc Nat Acad Sci. USA 104: 13086-13091
- Goodman MT, Hernandez BY, Hewitt S, Lynch CF, Cote TR, Frierson HF, Jr, Moskaluk CA, Killeen JL, Cozen W, Key CR, et al. (2005) Tissues from population-based cancer registries: a novel approach to increasing research potential. Hum Pathol 36:812-820.
- Su AI, Welsh JB, Sapinoso LM, Kern SG, Dimitrov P, Lapp H, Schultz PG, Powell SM, Moskaluk CA, Frierson HF, Jr, et al. (2001) Cancer Res 61:7388-7393.
- Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, LohML, Downing JR, Caligiuri MA, et al. (1999) Science 286:531-537.
10/30/09 K-Automorphism: A General Framework for Privacy Preserving Network Publication
Speaker: Lijie Zhang
The growing popularity of social networks has generated interesting data management and data mining problems. An important concern in the release of these data for study is their privacy, since social networks usually contain personal information. Simply removing all identifiable personal information (such as names and social security number) before releasing the data is insufficient. It is easy for an attacker to identify the target by performing different structural queries. In this paper they propose k-automorphism to protect against multiple structural attacks and develop an algorithm (called KM) that ensures k-automorphism. They also discuss an extension of KM to handle dynamic releases of the data. Extensive experiments show that the algorithm performs well in terms of protection it provides.
Lei Zou, Lei Chen, and M. Tamer Özsu." K-automorphism: A general framework for privacy preserving network publication ," In Proc. 35th Int. Conf. on Very Large Data Bases, August 2009, pages 946-957.
11/06/09 Data Integration in Genetics and Genomics: Methods, Challenges and a Case Study
Speaker: Mark Doderer
Due to rapid technological advances, various types of genomic and proteomic data with different sizes, formats, and structures have become available.
Among them are gene expression, single nucleotide polymorphism, copy number variation, and protein-protein/ gene-gene interactions. Each of these distinct data types provides a different, partly independent and complementary, view of the whole genome. However, understanding functions of genes, proteins, and other aspects of the genome requires more information than provided by each of the datasets. Integrating data from different sources is, therefore, an important part of current research in genomics and proteomics. Several approaches to handle data integration will be reviewed in general and a case study will be presented in depth.
Jemila S. Hamid, Pingzhao Hu, Nicole M. Roslin, Vicki Ling, CeliaM. T. Greenwood, and Joseph Beyene. " Data Integration in Genetics and Genomics: Methods and Challenges", Human Genomics and Proteomics 2009:869093, doi:10.4061/2009/869093
Yong Wang, Xiang-Sun Zhang and Yu Xia, "Predicting eukaryotic transcriptional cooperativity by Bayesian network integration of genome-wide data", Nucleic Acids Research, 2009, 37(18):5943-5958; doi:10.1093/nar/gkp625
11/13/09 Alignment of LC-MS images, with applications to biomarker discovery and protein identification
Speaker: Jian Cui
Abstract : LC-MS-based approaches have gained considerable interest for the analysis of complex peptide or protein mixtures, due to their potential for full automation and high sampling rates. Advances in resolution and accuracy of modern mass spectrometers allow new analytical LC-MS-based applications, such as biomarker discovery and cross-sample protein identification. Many of these applications compare multiple LC-MS experiments, each of which can be represented as a 2-D image. In this article, we survey current approaches to LC-MS image alignment. LC-MS image alignment corrects for experimental variations in the chromatography and represents a computational key technology for the comparison of LC-MS experiments. It is a required processing step for its two major applications: biomarker discovery and protein identification. Along with descriptions of the computational analysis approaches, we discuss their relative merits and potential pitfalls.
- Mathias Vandenbogaert, Sébastien Li-Thiao-Té, Hans-Michael Kaltenbach, Runxuan Zhang, Tero Aittokallio and Benno Schwikowski. Alignment of LC-MS images, with applications to biomarker discovery and protein identification. Proteomics 2008, 8, 650–67
11/20/09 Computing the Fréchet Distance Between Surfaces
Speaker: Jessica Sherette
Abstract : Similarity metrics between surfaces are used in graphics and computer-aided manufacturing. For example, in computer-aided manufacturing, their use helps to ensure quality control. The Fréchet distance in particular is a useful similarity metric because it takes the continuity of the given surfaces into account. Unfortunately, Computing the Fréchet distance between arbitrary surfaces has been shown to be NP-hard . However, an algorithm has been found to compute the Fréchet distance between simple polygons in polynomial time . Our work extends this algorithm to one which works for a more general class of surfaces. Specifically, we developed a ?xed parameter tractable algorithm to compute the Fréchet distance between two triangulated surfaces with acyclic dual graphs .
M. Godau. On the complexity of measuring the similarity between geometric objects in higher dimensions. PhD thesis, Freie Universität Berlin, Germany, 1998.
K. Buchin, M. Buchin, and C. Wenk. Computing the Fréchet distance between simple polygons in polynomial time. 22nd Symposium on Computational Geometry (SoCG), pages 80-87, 2006. http://portal.acm.org/citation.cfm?id=1137856.1137870
A. F. Cook IV, J. Sherette, and C. Wenk. Computing the Fréchet distance between polyhedral surfaces with acyclic dual graphs. 19th Fall Workshop on Computational Geometry, 2009.
12/04/09 Bayesian Sparse Correlated Factor Analysis
Speaker: Jia Meng
Abstract : In this paper, we propose a new sparse correlated factor model under a Bayesian framework that intended to model transcription factor regulation in a cell. Unlike the convention factors model, the factors are assumed to be non-negative and correlated. The correlation is due the the prior knowledge on the structure of the factors. To model the factors, a rectified function and the Dirichlet process mixture (DPM) prior are introduced. Moreover, the factors are . The loading matrix is sparse and since the prior knowledge of non-zero elements are assumed available, the sparse pattern of the loading matrix is significantly constrained, resulting unambiguous factor order. A Gibbs sampler is proposed to uncover the unknown non-negative factors and the loading matrix from data. The model and the Gibbs sampler are validated on the simulated systems.
Markus Harva and Ata Kaban. Variational Learning for Rectified Factor Analysis.
Questions and Comments?