University of Edinburgh
kheafield.com10 Crichton Street
Edinburgh EH8 9AB
United Kingdom
-
Interests
- Machine translation, language modeling, distributed systems, theoretical computer science
-
Current Positions
- Research Associate, University of Edinburgh
- August–December 2011; August
2012–
- PhD Student, Carnegie Mellon
- August 2008–August 2013
- I am working on my PhD thesis
as staff at the University of Edinburgh with Philipp Koehn and as a PhD student with Carnegie
Mellon advised by Alon Lavie. My thesis focuses on a new hypergraph search algorithm for
efficient syntactic machine translation, building on my efficient language model storage library.
-
Projects
-
-
Hypergraph Search
- Syntactic machine translation decoding consists of two steps: parse the
input sentence into a hypergraph and search the hypergraph for good translations. Search is
commonly done with cube pruning. As part of my thesis, I designed a new search algorithm.
My implementation is currently 1.5–3.5 times fast as cube pruning. It is available standalone
or with command-line options in Moses and cdec.
-
KenLM
- An efficient language model library. Compared with the widely-used SRILM, the
default is 2.4 times as fast while using 57% of the memory. Additional options save more
memory. It is used in many machine translation systems (including Moses, cdec, Joshua,
Phrasal, and Ncode) and in speech recognition.
-
System Combination
- My system combination software won the Workshop on Machine
Translation (WMT) 2011 system combination task in eight of ten language pairs. In WMT
2010, it won six of eight language pairs.
My code is open source (LGPL).
-
Software Familiarity
- Contributed to the Moses, cdec, and Joshua translation systems.
Extensive C++ with Boost, C, Ruby, SQL, Bash, and LATEX; Some Java.
Taught Hadoop; Administered Linux and PostgreSQL; Used MySQL, Octave, and PBS.
-
Awards
- National Science Foundation Graduate Research Fellowship
- 2008–11
- $121,500 in stipend
and tuition over three years
- Google Peer Bonus and Site Award
- 2008
- For lecturing at MIT
on Hadoop while a Software Engineer at Google
- International Collegiate Programming
Contest Regional
- 2006–07
- Ranked third of fifty in a team of two instead of three
- Carnation
Scholarship
- 2005–06
- Full Caltech tuition academic merit scholarship, 38 awarded per year
- Richard and
Dena Krown Summer Undergraduate Research Fellowship
- 2005
- $5,000 for ten weeks of summer research
in networking
- Summer Undergraduate Research Fellowship
- 2004
- $5,000 for ten weeks of summer research
in astronomy
-
Background
- Bachelor of Science, Caltech
- September 2003–March 2007
- Double major in Mathematics and
Computer Science; 3.8/4.0 GPA, with honors. Courses focused on formal language theory, distributed
systems, information theory, and combinatorics. I did three internships: two with Caltech research labs
and one with Infosys in Bangalore. The IT department hired me as a dormitory technician and security
tester. Student government appointed me to the university-wide Computing Advisory
Committee. Lastly, I finished a quarter early and went to work for Google.
- Google
- March
2007–August 2008
- As a Software Engineer with Google Book Search, I worked on a team
that uses machine learning to compile card catalogs from multiple sources into a single
coherent catalog of books. Previously, I created the scoring system behind a search function
in Picasa Web Albums. To share Google’s approach to distributed systems, I lectured
at MIT on the Hadoop MapReduce framework.
- Infosys Technologies
- July–September
2006
- I traveled to Bangalore, India to intern with the research division of Infosys, India’s
second largest software outsourcing company. We investigated automatic reorganization of
legacy source code. Specifically, I applied and customized Latent Dirichlet Allocation to
derive topics from names of functions and local variables. For example, it found SSL and
logging topics in Apache source code while correctly tagging files belonging to both topics.
- Netlab
- June 2005–June 2006
- As a Richard and Dena Krown Summer Undergraduate Research
Fellow, I developed an error model for kernel Principal Component Analysis (kPCA).
Professor Low hired me to continue with implementation during the school year. I applied
it to identify possible attacks in network traffic, which appear as points with unusually
high distance from the manifold learned by kPCA.
- Fastsoft
- January–April 2006
- Netlab
spun off a startup and I worked for them as a contractor. Using FAST TCP, the Netlab
algorithm responsible for breaking Internet speed records, their Aria product accelerates
connections passing through it. This allows senders to use high performance networks
more efficiently without custom operating systems. I setup experiments and worked on the
performance monitoring and configuration interface.
- Galaxy Evolution Explorer
- June
2004–March 2007
- I started working for the Galaxy Evolution Explorer (GALEX) project as a
Summer Undergraduate Research Fellow. My goal was finding variable stars and asteroids
in observations made by their satellite. To do so, I created a database of all 193 million
source measurements and used it to find and analyze over ninety variable objects. The
findings were reported in two posters and one journal article. After the summer, they hired
me to continue working on the database and to help scientists find interesting data.
-
Publications
-
- Kenneth Heafield. Efficient Statistical Machine Translation Decoding via Improved Language
Modeling. PhD thesis proposal. August, 2012.
- Kenneth Heafield, Philipp Koehn, and Alon Lavie. Grouping Language Model Boundary
Words to Speed K-Best Extraction from Hypergraphs. 2013 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies,
Atlanta, Georgia, USA, June, 2013.
- Kenneth Heafield, Philipp Koehn, and Alon Lavie. Language Model Rest Costs and
Space-Efficient Storage. Joint Conference on Empirical Methods in Natural Language
Processing and Computational Natural Language Learning, Jeju Island, Korea, July, 2012.
- Kenneth Heafield, Hieu Hoang, Philipp Koehn, Tetsuo Kiso, and Marcello Federico. Left
Language Model State for Syntactic Machine Translation. International Workshop on Spoken
Language Translation, San Francisco, California, USA, December, 2011.
- Kenneth Heafield. KenLM: Faster and Smaller Language Model Queries. Sixth Workshop on
Statistical Machine Translation, Edinburgh, Scotland, United Kingdom, July, 2011.
- Kenneth Heafield and Alon Lavie. CMU System Combination in WMT 2011. Sixth Workshop
on Statistical Machine Translation, Edinburgh, Scotland, United Kingdom, July, 2011.
- Kenneth Heafield and Alon Lavie. Voting on N-grams for Machine Translation System
Combination. Ninth Conference of the Association for Machine Translation in the Americas,
Denver, Colorado, USA, November, 2010.
- Kenneth Heafield and Alon Lavie. CMU Multi-Engine Machine Translation for WMT
2010. Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, Uppsala,
Sweden, July, 2010.
- Kenneth Heafield and Alon Lavie. Combining Machine Translation Output with Open Source:
The Carnegie Mellon Multi-Engine Machine Translation Scheme. The Prague Bulletin of
Mathematical Linguistics 93. January, 2010.
- Jon Clark, Jonathan Weese, Byung Gyu Ahn, Andreas Zollmann, Qin Gao, Kenneth Heafield,
and Alon Lavie. The Machine Translation Toolpack for LoonyBin: Automated Management
of Experimental Machine Translation HyperWorkflows. The Prague Bulletin of Mathematical
Linguistics 93. January, 2010.
- Kenneth Heafield. CMU-StatXfer Group System Combination. NIST Open MT Workshop at
MT Summit XII, Ottawa, Canada, September, 2009.
- Kenneth Heafield, Greg Hanneman, and Alon Lavie. Machine Translation System
Combination with Flexible Word Ordering. Fourth Workshop on Statistical Machine
Translation, Athens, Greece, March, 2009.
- Girish Maskeri Rama, Kenneth Heafield, and Santonu Sarkar. Identification of Topics in Source
Code. US Patent 8209665 filed in 2009 and issued June, 2012.
- Girish Maskeri, Santonu Sarkar, and Kenneth Heafield. Mining Business Topics in Source Code
using Latent Dirichlet Allocation. 1st India Software Engineering Conference, Hyderabad,
India, February, 2008.
- Taylor Curtis and Kenneth Heafield. Systems and Methods for Identifying Similar Documents.
US Patent 7958136 filed in 2008 and issued June, 2011.
- Stanley Browne, Jonathan Wheatley, Barry Welsh, Mark Seibert, Kenneth Heafield, R.
Michael Rich, and the GALEX Science Team. RR Lyrae Stars in the Far Ultraviolet: GALEX
Observations Compared with Theoretical Predictions. American Astronomical Society 207th
Meeting, Washington, DC, USA, June, 2006.
- Barry Welsh, Johathan Wheatley, Kenneth Heafield, Mark Seibert, and the GALEX Science
Team. The GALEX Ultraviolet Variability Catalog. The Astronomical Journal 130. 2005.
- Barry Welsh, Jonathan Wheatley, Kenneth Heafield, Mark Seibert, Stanley Browne, and the
GALEX Science Team. The Flaring UV Sky. American Astronomical Society 205th Meeting,
San Diego, California, USA, January, 2005.
-
Teaching
-
-
- March 2013Tutorial: Language Modeling with KenLM, Qatar Computing Research Institute
-
- March 2013Guest Course Lecture: Machine Translation, Carnegie Mellon
-
- October 2012Guest Course Lecture: Advanced NLP, University of Edinburgh
-
- September 2012Tutorial: Chart Based Decoding, MT Marathon
-
- Spring 2012Teaching Assistant: Language and Statistics, Carnegie Mellon
-
- September 2011Tutorial: Language Modeling, MT Marathon
-
- Fall 2010Teaching Assistant: Algorithms for NLP, Carnegie Mellon
-
Program Committees
-
-
- 2013NAACL
-
- 2012Coling, EMNLP, EACL
-
- 2011-12Workshop on Machine Translation
-
- 2011Transactions on Asian Language Information Processing, MT Journal