Cunei Machine Translation Platform

To get you started with Cunei we will build a small French-English system trained on the freely-available Europarl corpus. More information on the Europarl corpus is available at http://statmt.org/europarl/.

Prerequisites

Cunei requires Java 6. In addition, your system must have Apache Ant installed to use the automated build scripts. Please make sure both of these components are installed before continuing. These are the only core requirements to run Cunei, but in order to build a new system we will use a few other tools as specified in later sections.

Install Cunei

The source code for Cunei is hosted in a Subversion repository. If you have Subversion, run the following command to check out the latest version of of Cunei into the directory cunei.

svn co -r380 https://svn.cunei.org/svnroot/cunei /path/to/cunei

Throughout this tutorial we will run commands and reference files relative to Cunei's base directory. Please change your working directory now.

cd /path/to/cunei/

The Subversion repository only contains a copy of the source code. (If you are inclined to peek under the covers, feel free to browse the src directory.) Type the command below to run Apache Ant and compile the source code.

ant

Assuming all went well, a few seconds later you should be greeted with a BUILD SUCCESSFUL message.

Prepare the Data

Download and extract the French-English Europarl with the commands below. This will create a new directory europarl containing the corpus.

mkdir -p work/europarl

wget -P work http://statmt.org/europarl/v5/fr-en.tgz

tar -xzf work/fr-en.tgz -C work/europarl

Next, use the following code to create a new Perl script europarl.pl.

#!/usr/bin/perl use strict; die "$0 europarl-dir output-prefix src-lang tgt-lang\n" if @ARGV != 4; my $work_dir = $ARGV[0]; my $out_prefix = $ARGV[1]; my $src_lang = $ARGV[2]; my $tgt_lang = $ARGV[3]; my $src_dir = "$work_dir/$src_lang-$tgt_lang/$src_lang"; my $tgt_dir = "$work_dir/$src_lang-$tgt_lang/$tgt_lang"; my %files; foreach my $file (<$src_dir/*.txt>) { $file =~ s/^.*\///; $file =~ s/\.txt$//; $files{$file}++; } foreach my $file (<$tgt_dir/*.txt>) { $file =~ s/^.*\///; $file =~ s/\.txt$//; $files{$file}++; } open TRAIN_SRC_LEX, "|gzip >${out_prefix}train.$src_lang.lex.gz"; open TRAIN_TGT_LEX, "|gzip >${out_prefix}train.$tgt_lang.lex.gz"; open TRAIN_DOC, "|gzip >${out_prefix}train.doc.gz"; open DEV_SRC_LEX, ">${out_prefix}dev.$src_lang.lex"; open DEV_TGT_LEX, ">${out_prefix}dev.$tgt_lang.lex.0"; open DEV_DOC, ">${out_prefix}dev.doc"; open TEST_SRC_LEX, ">${out_prefix}test.$src_lang.lex"; open TEST_TGT_LEX, ">${out_prefix}test.$tgt_lang.lex.0"; open TEST_DOC, ">${out_prefix}test.doc"; while(my($file,$count) = each(%files)) { next if $count != 2; open SRC_TXT, "$src_dir/$file.txt"; open TGT_TXT, "$tgt_dir/$file.txt"; my $doc = $file; while(my $src_line = <SRC_TXT>) { my $tgt_line = <TGT_TXT>; chomp $src_line; chomp $tgt_line; $doc = "$file:$1" if $src_line =~ /^<chapter\s+id=\"?([0-9]+)\"?>/i; $doc = "$file:$1" if $tgt_line =~ /^<chapter\s+id=\"?([0-9]+)\"?>/i; next if $src_line =~ /^<.+>$/ or $tgt_line =~ /^<.+>$/; $src_line =~ s/\s*<[^>]+>\s*/ /g; $src_line =~ s/^\s+//; $src_line =~ s/\s+$//; $tgt_line =~ s/\s*<[^>]+>\s*/ /g; $tgt_line =~ s/^\s+//; $tgt_line =~ s/\s+$//; if($file !~ /ep-00-1[012]-/) { print TRAIN_SRC_LEX "$src_line\n"; print TRAIN_TGT_LEX "$tgt_line\n"; print TRAIN_DOC "$doc\n"; } elsif ($file =~ /ep-00-(10-02|11-16)/) { print DEV_SRC_LEX "$src_line\n"; print DEV_TGT_LEX "$tgt_line\n"; print DEV_DOC "$doc\n"; } elsif ($file =~ /ep-00-(10-23|11-29)/) { print TEST_SRC_LEX "$src_line\n"; print TEST_TGT_LEX "$tgt_line\n"; print TEST_DOC "$doc\n"; } } close SRC_TXT; close TGT_TXT; } close TRAIN_SRC_LEX; close TRAIN_TGT_LEX; close TRAIN_DOC; close DEV_SRC_LEX; close DEV_TGT_LEX; close DEV_DOC; close TEST_SRC_LEX; close TEST_TGT_LEX; close TEST_DOC;

This script will massage the Europarl corpus into a suitable format for our use. While our example is using the French-English portion, the Europarl is available in many language pairs and this script can be used with any of them. Additionally, after completing this procedure once, you will see that the data format we use is very simple. Feel free to grab any other data, but be sure to set aside some out some of the text for development and testing. The following commands will execute the script and create the prepared Europarl corpus in data/corpora/fr-en/raw:

mkdir -p data/corpora/fr-en/raw

perl europarl.pl work/europarl data/corpora/fr-en/raw/europarl-v5- fr en

rm -fr europarl

Distributed with the Cunei distribution is a configuration file in the systems/fr-en/default/ directory. Using the provided configuration, we first calculate the expected sentence ratios and then process the raw text into clean, tokenized text:

bin/cunei.sh EstimateSentenceRatios systems/fr-en/default/config \ -debug info -dir data/corpora/fr-en/raw/ \ -sequence-file europarl-v5-train.fr.lex.gz \ -sequence-lang source -sequence-type lexical \ -sequence-file europarl-v5-train.en.lex.gz \ -sequence-lang target -sequence-type lexical

bin/cunei.sh ProcessCorpus systems/fr-en/default/config -debug info \ -input-dir data/corpora/fr-en/raw/ \ -output-dir data/corpora/fr-en/clean \ -sequence-file europarl-v5-train.fr.lex.gz \ -sequence-lang source -sequence-type lexical \ -sequence-file europarl-v5-train.en.lex.gz \ -sequence-lang target -sequence-type lexical \ -docs europarl-v5-train.doc.gz

Estimate Word Alignments

We will use the GIZA++ toolkit to induct word-alignments from the corpus. Download the latest version of GIZA++ from http://code.google.com/p/giza-pp/ and install it. If the GIZA++ tools are not in your $PATH then set the environment variable $GIZA_HOME to point to the location of your GIZA++ installation. To simply the process, a convenience script is provided to build the appropriate alignments. GIZA++ will take several hours to complete.

export GIZA_HOME=/path/to/GIZA++-v2/

bin/align-giza.sh data/corpora/fr-en/clean/europarl-v5-train.fr.lex.gz \ data/corpora/fr-en/clean/europarl-v5-train.en.lex.gz \ data/corpora/fr-en/clean/europarl-v5-train.giza

When the script completes, data/corpora/fr-en/clean/europarl-v5-train.giza will be populated with several files. The files s.A3.final.gz and t.A3.final.gz are the Viterbi word alignments from GIZA++ for P(s|t) and P(t|s) respectively. We will use these files in the next step.

Index the Corpus

In this step we will index the text files we processed earlier. In order to avoid re-processing these files, edit the configuration systems/fr-en/default/config and comment out the lines that begin with 'Processors' using the '#' character as shown below.

Processors.Source.Input.Text: #Canonicalizer Lowercaser [...elided...] Processors.SourceTarget.Input.Sentence: #SentenceEliminator Processors.Target.Input.Text: #Canonicalizer Lowercaser [...elided...]

Now run the following command to index the corpus. Expect this to take about an hour.

bin/cunei.sh IndexCorpus systems/fr-en/default/config \ -dir data/corpora/fr-en/clean/ \ -alignment-file europarl-v5-train.giza/s.A3.final.gz \ -alignment-lang source \ -alignment-file europarl-v5-train.giza/t.A3.final.gz \ -alignment-lang target \ -sequence-file europarl-v5-train.fr.lex.gz \ -sequence-lang source -sequence-type lexical \ -sequence-file europarl-v5-train.en.lex.gz \ -sequence-lang target -sequence-type lexical \ -docs europarl-v5-train.doc.gz -debug info

Once indexing is complete, uncomment the 'Processors.Source.Input.Text' and 'Processors.Target.Input.Text' lines (as shown previously). This will ensure future text files are processed appropriately.

(Re-)Estimate Lexicons and Alignments

The word-alignments from GIZA++ are not weighted. Run the following command to estimate lexicons for P(s|t) and P(t|s). These probabilities are incorporated into the alignments and are also used at run-time to calculate lexical features.

bin/cunei.sh EstimateLexiconAlignment systems/fr-en/default/config \ -debug info

Build a Language Model

We use SRILM to estimate n-gram probabilities. Download the latest version of SRILM from http://www.speech.sri.com/projects/srilm/download.html and install it. If the SRILM tools are not in your $PATH then set the environment variable $SRILM_HOME to point to the location of your SRILM installation. Usually it's best to build a language model from a larger monolingual corpus, but for this tutorial we'll just use the English half of our bilingual corpus to build a 4-gram model. The output of SRILM is then fed to Cunei which indexes and stores it to disk in a format that is faster to load. Once Cunei has indexed the language model, the SRILM file is no longer necessary and may be erased.

export SRILM_HOME=/path/to/srilm/

mkdir -p data/lm/en/

zcat data/corpora/fr-en/clean/europarl-v5-train.en.lex.gz | \ bin/build-srilm.sh data/lm/en/europarl-v5-train.5-gram.srilm.gz 5

bin/cunei.sh IndexLanguageModel systems/fr-en/default/config -debug info \ -input data/lm/en/europarl-v5-train.5-gram.srilm.gz -model Default

Translate

If you successfully completed all the steps above you should have a working system that can produce translations. Let's give it a whirl!

To check that everything is working, generate the translation lattice with the command below. This will output the many possible translations found for each word or phrase in the test sentence.

echo "Je voudrais une bouteille d'eau" | \ bin/cunei.sh Translate systems/fr-en/default/config -debug info

Now try generating full-sentence translations with the command below. This will output the top four translations of the sentence.

echo "Je voudrais une bouteille d'eau" | \ bin/cunei.sh Decode systems/fr-en/default/config -nbest 4 -debug info

Optimize

The default parameters are not adjusted for any particular language pair and, thus, are quite poor. Now it's time to remedy that. The following command will optimize Cunei's settings and significantly improve the quality of translations. The test document will be translated over and over, each time adjusting Cunei's parameters in order to produce translation that (as closely as possible) match the human translated document(s). This process will use a lot of memory and may take a few days to complete.

bin/cunei.sh Optimize systems/fr-en/default/config \ -input data/corpora/fr-en/raw/europarl-v5-dev.fr.lex \ -ref data/corpora/fr-en/raw/europarl-v5-dev.en.lex.0 \ -output opt.log -debug -info

Info messages will appear on the console, but the actual output will be logged in the file opt.log. When optimization is complete a new configuration file systems/fr-en/default/config.opt will be created for you with the optimized parameters. From now on, use this new configuration file for translation.

Congratulations

Congratulations on completing the Cunei tutorial. Feedback is welcome. If something didn't quite work right or you have suggestions for improvement please send mail to cunei-support@lists.sourceforge.net.

Happy translating!