I am a research associate with Philipp Koehn at the University of Edinburgh. Currently, I am working on hypergraph search as part of my Carnegie Mellon PhD thesis advised by Alon Lavie. My interests are machine translation, language models, machine learning, distributed systems, and theoretical computer science.
Before Carnegie Mellon, I worked at Google on Book Search and Picasa, at Caltech in Netlab and GALEX while earning a BSc in Mathematics and Computer Science, and in Bangalore at Infosys as a research intern. My Curriculum Vitæ is available in html and pdf.
Recent Projects
Each project has accompanying open source (LGPL) code in C++.- Fast and accurate hypergraph search
- in the presence of language models. I have focused on applying it to syntactic machine translation while others have found it useful for phrase-based translation, dependency-to-string translation, and spell checking.
- Language model estimation and querying (KenLM)
- that is simultaneously faster, smaller, and at least as accurate compared to other packages in common cases.
- System combination (MEMT)
- builds on top of other machine translation systems to produce one improved translation. Several research groups submitted system combinations to the 2011 Workshop on Machine Translation; my submission ranked best in 8 of 10 scenarios.
Publications
All papers in BibTeX formatThesis Proposal
- Efficient Statistical Machine Translation Decoding via Improved Language Modeling
. Committee: Alon Lavie, Chris Dyer, Bhiksha Raj, and Philipp Koehn. 15 August, 2012.
[BibTeX]
Estimating Language Models
- Scalable Modified Kneser-Ney Language Model Estimation
, Ivan Pouzyrevsky, Jonathan H. Clark, and Philipp Koehn. ACL, Sofia, Bulgaria, 4—9 August, 2013.
[Paper] [Code] [BibTeX]
Decoding with Language Models
- Grouping Language Model Boundary Words to Speed K-Best Extraction from Hypergraphs
, Philipp Koehn, and Alon Lavie. NAACL HLT, Atlanta, Georgia, USA, 10—12 June, 2013.
[Paper] [Code] [BibTeX] - Language Model Rest Costs and Space-Efficient Storage
, Philipp Koehn, and Alon Lavie. EMNLP, Jeju Island, Korea, 12—14 July, 2012.
[Paper] [Slides] [BibTeX] - Left Language Model State for Syntactic Machine Translation
, Hieu Hoang, Philipp Koehn, Tetsuo Kiso, and Marcello Federico. IWSLT, San Francisco, California, USA, 8—9 December, 2011.
[Paper] [Poster] [BibTeX]
Querying Language Models
- KenLM: Faster and Smaller Language Model Queries
. WMT at EMNLP, Edinburgh, Scotland, United Kingdom, 30—31 July, 2011.
[Paper] [Slides] [Code] [BibTeX]
System Combination
- CMU System Combination in WMT 2011
and Alon Lavie. WMT at EMNLP, Edinburgh, Scotland, United Kingdom, 30—31 July, 2011.
[Paper] [Slides] [BibTeX] - Voting on N-grams for Machine Translation System Combination
and Alon Lavie. AMTA, Denver, Colorado, USA, November, 2010.
[Paper] [BibTeX] - CMU Multi-Engine Machine Translation for WMT 2010
and Alon Lavie. WMT at ACL, Uppsala, Sweden, July, 2010.
[Paper] [Poster] [BibTeX] - Combining Machine Translation Output with Open Source: The Carnegie Mellon Multi-Engine Machine Translation Scheme
and Alon Lavie. The Prague Bulletin of Mathematical Linguistics 93. 25—30 January, 2010.
[Paper] [Slides] [BibTeX] - The Machine Translation Toolpack for LoonyBin: Automated Management of Experimental Machine Translation HyperWorkflows
Jonathan H. Clark, Jonathan Weese, Byung Gyu Ahn, Andreas Zollmann, Qin Gao, , and Alon Lavie. The Prague Bulletin of Mathematical Linguistics 93. 25—30 January, 2010.
[Paper] [BibTeX] - CMU-StatXfer Group System Combination
. NIST Open MT Workshop at MT Summit XII, Ottawa, Canada, 1 September, 2009.
[Description] [Slides] [BibTeX]1 - Machine Translation System Combination with Flexible Word Ordering
, Greg Hanneman, and Alon Lavie. WMT at EACL, Athens, Greece, 30—31 March, 2009.
[Paper] [Slides] [BibTeX]
Topic Modeling for Source Code
- Identification of Topics in Source Code
Girish Maskeri Rama, , and Santonu Sarkar. US Patent 8209665 filed in 2009 and issued 26 June, 2012.
[Patent] [BibTeX] - Mining Business Topics in Source Code using Latent Dirichlet Allocation
Girish Maskeri, Santonu Sarkar, and . 1st India Software Engineering Conference, Hyderabad, India, 19—22 February, 2008.
[Paper] [BibTeX]2
Image Recommendation
- Systems and Methods for Identifying Similar Documents
Taylor Curtis and . US Patent 7958136 filed in 2008 and issued 7 June, 2011.
[Patent] [BibTeX]
Variable Stars
- RR Lyrae Stars in the Far Ultraviolet: GALEX Observations Compared with Theoretical Predictions
Stanley Browne, Jonathan Wheatley, Barry Welsh, Mark Seibert, , R. Michael Rich, and the GALEX Science Team. American Astronomical Society 207th Meeting, Washington, DC, USA, 8—12 June, 2006.
[Poster] [BibTeX] - The GALEX Ultraviolet Variability Catalog
Barry Welsh, Johathan Wheatley, , Mark Seibert, and the GALEX Science Team. The Astronomical Journal 130. 2005.
[Paper] [BibTeX] - The Flaring UV Sky
Barry Welsh, Jonathan Wheatley, , Mark Seibert, Stanley Browne, and the GALEX Science Team. American Astronomical Society 205th Meeting, San Diego, California, USA, 9—13 January, 2005.
[Poster] [BibTeX]
Reports
National Science Foundation Graduate Research Fellowship

In 2008, I was awarded a National Science Foundation Graduate Research Fellowship. The application required three essays: a summary of past work, motivation, and a potential research plan.
Google


From March 2007 to August 2008, I worked at Google as a Software Engineer on Picasa Web Albums and Google Book Search. To share Google's approach to distributed systems, I lectured on the Hadoop MapReduce framework as part of a 3-day class at MIT. I wrote and delivered the introduction, basic join, and entropy lectures.4 Involved employees received a Site Award and a Peer Bonus.
- Intro
- Intended to follow a lecture on MapReduce theory, this introduces basic Hadoop programming
- Diff
- A few slides to explain reducers as joining data from separate sources
- k-Means
- Run through of the Hadoop API followed by k-means clustering
- Entropy
- Introduces an entropy-based word weighting scheme and uses it to motivate performance strategies
Netlab


In 2005, I worked for Netlab at Caltech as a Richard and Dena Krown Summer Undergraduate Research Fellow. Professor Low hired me after the summer and I continued until my Infosys internship in June 2006. These reports were prepared for the fellowship.
- Detecting Network Anomalies With Kernel Principal Component Analysis
and Steven Low. 2005.
[Paper] [Slides]
Galaxy Evolution Explorer
Galaxy Evolution Explorer (GALEX) is a NASA satellite observatory with science operations at Caltech. Starting in 2004 as a Summer Undergraduate Research Fellow, I found about 90 variable stars and asteroids in their 193 million measurements. They hired me to continue working with their data until I graduated in March 2007. Results are published and therefore listed under Publications, above.
- Transiting and Variable Objects: A Search Through Galaxy Evolution Explorer Observations
and Mark Seibert. 2004.
[Slides]
Information Management Systems and Services
I worked for Caltech's IT department as a student representative and later as a security tester. They hired me as a security tester after I sent them this video of an exploit in their production course registration system. The video shows how to use my roomate's login to read my grades. It has been patched.
- 1
- NIST serves to coordinate the NIST Open MT evaluations in order to support machine translation research and to help advance the state-of-the-art in machine translation technologies. NIST Open MT evaluations are not viewed as a competition, as such results reported by NIST are not to be construed, or represented, as endorsements of any participant's system, or as official findings on the part of NIST or the U.S. Government.
- 2
- © ACM, 2008. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in the Proceedings of the 1st India Software Engineering Conference, Hyderabad, India, February 19-22, 2008.
- 3
- This material is based upon work supported under a National Science Foundation Graduate Research Fellowship. Any opinions, findings, conclusions or recommendations expressed in this publication are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
- 4
- © Google, 2008. Except as otherwise noted, this presentation is released under the Creative Commons Attribution 2.5 license.
