Published in: Proc. Natl. Acad. Sci. USA, vol. 97, no. 18, pp. 10096-10100, (August 29, 2000):
Harmen J. Bussemaker 1, Hao Li 2, and Eric D. Siggia
Center for Studies in Physics and Biology, The Rockefeller University,
Box 25, 1230 York Avenue, New York, NY 10021,
E-mail: siggia@eds1.rockefeller.edu.
1 Present address: Swammerdam Institute for Life Sciences
and Amsterdam Center for Computational Science, University of Amsterdam,
Kruislaan 318, 1098 SM Amsterdam, The Netherlands.
E-mail: bussemaker@bio.uva.nl,
2 Present address: Departments of Biochemistry and Biophysics,
University of California, San Francisco, CA 94143.
E-mail: haoli@haoli1.ucsf.edu,
The availability of complete genome sequences and mRNA expression
data for all genes creates new opportunities and challenges for
identifying DNA sequence motifs that control gene expression. An
algorithm, "MobyDick," is presented that decomposes a set of
DNA sequences into the most probable dictionary of motifs or words.
This method is applicable to any set of DNA sequences: for example,
all upstream regions in a genome or all genes expressed under
certain conditions. Identification of words is based on a probabilistic
segmentation model in which the significance of longer words is
deduced from the frequency of shorter ones of various lengths,
eliminating the need for a separate set of reference data to define
probabilities. We have built a dictionary with 1,200 words for
the 6,000 upstream regulatory regions in the yeast genome; the
500 most significant words (some with as few as 10 copies
in all of the upstream regions) match 114 of 443 experimentally
determined sites (a significance level of 18 standard deviations).
When analyzing all of the genes up-regulated during sporulation
as a group, we find many motifs in addition to the few previously
identified by analyzing the subclusters individually to the
expression subclusters. Applying MobyDick to the genes derepressed
when the general repressor Tup1 is deleted, we find known as
well as putative binding sites for its regulatory partners.
Additional References:
1. "Mated Models of Gene Regulation in Eukaryotes".
2. "Oncogenes as Molecular Targets
within Active Chromatin".
Top of Page - Euchromatin
Network - Current
Research - Forums - Other
Sites - Future Events
-
For Further Information and Feedback:
E-mail: frenster@euchromatin.net
euchromatin: "the most active portion of the genome within the
cell nucleus".