In Iceland after a geothermal swim

Currently, I am visiting the University of Edinburgh working with Philipp Koehn. My home department is the Language Technologies Institute at Carnegie Mellon where I am a PhD student advised by Alon Lavie. My interests are machine translation, machine learning, distributed systems, and theoretical computer science.

I work on efficient language model intersection, particularly for machine translation. Language models are widely applied in natural language modeling and make output more fluent. Language model performance (speed, memory, and accurancy) substantially impacts overall system performance. My open-source code, dubbed KenLM, is simultaneously faster, smaller, and at least as accurate compared to other packages in common cases.

Previously, I worked on system combination for machine translation. System combination builds on top of other translation systems (i.e. Babelfish and Google Translate) to produce one improved translation. The 2011 Workshop on Machine Translation invited system combination teams at various universities to submit translations and asked human judges to rank their quality. The workshop found that human judges prefer my submission in six of eight language pairs. The code is open-source.

Before Carnegie Mellon, I worked at Google on Book Search and Picasa, at Caltech in Netlab and GALEX while earning a BSc in Mathematics and Computer Science, and in Bangalore at Infosys as a research intern. My Curriculum Vitæ is available in html and pdf.

Publications

The University of Edinburgh
Paper, Poster
Heafield, Hoang, Koehn, Kiso, and Federico. Left Language Model State for Syntactic Machine Translation. Proc. International Workshop on Spoken Language Translation, San Francisco, CA, December 8-9, 2011.
Carnegie Mellon University
Paper, Talk, and Code
Heafield. KenLM: Faster and Smaller Language Model Queries. Proc. EMNLP 2011 Sixth Workshop on Statistical Machine Translation, Edinburgh, UK, July 30-31, 2011.
Paper, Poster
Heafield and Lavie. CMU System Combination in WMT 2011. Proc. EMNLP 2011 Sixth Workshop on Statistical Machine Translation, Edinburgh, UK, July 30-31, 2011.
Paper
Heafield and Lavie. Voting on N-grams for Machine Translation System Combination. Proc. Ninth Conference of the Association for Machine Translation in the Americas, Denver, Colorado, October 31—November 5, 2010.
Paper, Poster, Boaster, and Evaluation
Heafield and Lavie. CMU Multi-Engine Machine Translation for WMT 2010. Proc. ACL 2010 Joint Fifth Workshop on Statistical Machine Translation and Metrics MATR, Uppsala, Sweden, July 15—16, 2010. In the evaluation, my submission (cmu-heafield-combo) received 6 wins, more than any other submission received.
Paper, Presentation, and Code
Heafield and Lavie. Combining Machine Translation Output with Open Source: The Carnegie Mellon Multi-Engine Machine Translation Scheme. The Prague Bulletin of Mathematical Linguistics 93, pages 27—36, 2010. ISBN 978-80-904175-4-0. doi: 10.2478/v10108-010-0008-4.
Description, Presentation, and Evaluation
Heafield. CMU-StatXfer Group System Combination. Proc. NIST Open MT Workshop 2009 at MT Summit XII, Ottawa, Canada, August 31—September 1, 2009. I also did Arabic and formal system combination; the system descriptions for these are similar.1
Paper, Poster, and Evaluation
Heafield, Hanneman, and Lavie. Machine Translation System Combination with Flexible Word Ordering. Proc. EACL 2009 Fourth Workshop on Statistical Machine Translation, Athens, Greece, March 30—31, 2009.
Google
Patent
Curtis and Heafield, 2008. Systems and Methods for Identifying Similar Documents. US Patent 7,958,136.
Infosys
Paper and Patent Application
Rama, Sarkar, and Heafield. Mining Business Topics in Source Code using Latent Dirichlet Allocation. Proc. 1st India Software Engineering Conference, pages 113—120, Hyderabad, India, February 19—22, 2008.2
Caltech
Poster
Browne, Wheatley, Welsh, Seibert, Heafield, Rich, and the GALEX Science Team. RR Lyrae Stars in the Far Ultraviolet: GALEX Observations Compared with Theoretical Predictions. Bulletin of the American Astronomical Society, January, 2006.
Journal Paper
Welsh, Wheatley, Heafield, Seibert, et al. The GALEX Ultraviolet Variability Catalog. The Astronomical Journal 130, pages 825—831. 2005.
Poster
Welsh, Wheatley, Heafield, Seibert, Browne, and the GALEX Science Team. The Flaring UV Sky. Bulletin of the American Astronomical Society, January, 2005.

Reports

National Science Foundation Graduate Research Fellowship NSF

Since August 2008, I am a National Science Foundation Graduate Research Fellow.3
Past Research
Application essay about my past research
Desire
Application essay about wanting to be a graduate student
Plan
A viable research plan in natural language processing

Google Google Books
Picasa Web Albums
MIT

From March 2007 to August 2008, I worked at Google as a Software Engineer on Picasa Web Albums and Google Book Search. To share Google's approach to distributed systems, I lectured on the Hadoop MapReduce framework as part of a 3-day class at MIT. I wrote and delivered the introduction, basic join, and entropy lectures.4 Involved employees received a Site Award and a Peer Bonus.
Intro
Intended to follow a lecture on MapReduce theory, this introduces basic Hadoop programming
Diff
A few slides to explain reducers as joining data from separate sources
k-Means
Run through of the Hadoop API followed by k-means clustering
Entropy
Introduces an entropy-based word weighting scheme and uses it to motivate performance strategies

Netlab Netlab
Fastsoft

In 2005, I worked for Netlab at Caltech as a Richard and Dena Krown Summer Undergraduate Research Fellow. Professor Low hired me after the summer and I continued until my Infosys internship in June 2006. These reports were prepared for the fellowship.
Paper and Presentation
Heafield, 2005. Detecting Network Anomalies With Kernel Principal Component Analysis.
Proposal
Heafield and Low, 2005. Locality Preservation in Manifolds to Reduce Dimensionality. Accepted for Summer Undergraduate Research Fellowship 2005.

Galaxy Evolution ExplorerGALEX logo

Galaxy Evolution Explorer (GALEX) is a NASA satellite observatory with science operations at Caltech. Starting in 2004 as a Summer Undergraduate Research Fellow, I found about 90 variable stars and asteroids in their 193 million measurements. They hired me to continue working with their data until I graduated in March 2007. Results are published and therefore listed under Publications, above.
Presentation
Heafield and Seibert, 2004. Transiting and Variable Objects: A Search Through Galaxy Evolution Explorer Observations.

Information Management Systems and ServicesCaltech

I worked for Caltech's IT department as a student representative and later as a security tester. They hired me as a security tester after I sent them this video:
Exploit
As part of a class project to make a course registration system, I found a simple hole in Caltech's production system. This shows how to use my roommate's login to read my grades. It has been patched.