Prerequisites
Cunei requires Java 6. In
addition, your system must have Apache Ant installed to use
the automated build scripts. Please make sure both of these
components are installed before continuing. These are the only
core requirements to run Cunei, but in order to build a new system
we will use a few other tools as specified in later sections.
Install Cunei
The source code for Cunei is hosted in a Subversion repository. If
you have Subversion, run the following command to check out the
latest version of of Cunei into the directory cunei.
svn co -r380 https://svn.cunei.org/svnroot/cunei /path/to/cunei
Throughout this tutorial we will run commands and reference files
relative to Cunei's base directory. Please change your working
directory now.
cd /path/to/cunei/
The Subversion repository only contains a copy of the source code.
(If you are inclined to peek under the covers, feel free to browse
the src directory.) Type the command
below to run Apache Ant and compile the source code.
ant
Assuming all went well, a few seconds later you should be greeted
with a BUILD SUCCESSFUL message.
Prepare the Data
Download and extract the French-English Europarl with the commands
below. This will create a new directory europarl containing the corpus.
mkdir -p work/europarl
wget -P work http://statmt.org/europarl/v5/fr-en.tgz
tar -xzf work/fr-en.tgz -C work/europarl
Next, use the following code to create a new Perl script europarl.pl.
#!/usr/bin/perl
use strict;
die "$0 europarl-dir output-prefix src-lang tgt-lang\n" if @ARGV != 4;
my $work_dir = $ARGV[0];
my $out_prefix = $ARGV[1];
my $src_lang = $ARGV[2];
my $tgt_lang = $ARGV[3];
my $src_dir = "$work_dir/$src_lang-$tgt_lang/$src_lang";
my $tgt_dir = "$work_dir/$src_lang-$tgt_lang/$tgt_lang";
my %files;
foreach my $file (<$src_dir/*.txt>) {
$file =~ s/^.*\///;
$file =~ s/\.txt$//;
$files{$file}++;
}
foreach my $file (<$tgt_dir/*.txt>) {
$file =~ s/^.*\///;
$file =~ s/\.txt$//;
$files{$file}++;
}
open TRAIN_SRC_LEX, "|gzip >${out_prefix}train.$src_lang.lex.gz";
open TRAIN_TGT_LEX, "|gzip >${out_prefix}train.$tgt_lang.lex.gz";
open TRAIN_DOC, "|gzip >${out_prefix}train.doc.gz";
open DEV_SRC_LEX, ">${out_prefix}dev.$src_lang.lex";
open DEV_TGT_LEX, ">${out_prefix}dev.$tgt_lang.lex.0";
open DEV_DOC, ">${out_prefix}dev.doc";
open TEST_SRC_LEX, ">${out_prefix}test.$src_lang.lex";
open TEST_TGT_LEX, ">${out_prefix}test.$tgt_lang.lex.0";
open TEST_DOC, ">${out_prefix}test.doc";
while(my($file,$count) = each(%files)) {
next if $count != 2;
open SRC_TXT, "$src_dir/$file.txt";
open TGT_TXT, "$tgt_dir/$file.txt";
my $doc = $file;
while(my $src_line = <SRC_TXT>) {
my $tgt_line = <TGT_TXT>;
chomp $src_line;
chomp $tgt_line;
$doc = "$file:$1" if $src_line =~ /^<chapter\s+id=\"?([0-9]+)\"?>/i;
$doc = "$file:$1" if $tgt_line =~ /^<chapter\s+id=\"?([0-9]+)\"?>/i;
next if $src_line =~ /^<.+>$/ or $tgt_line =~ /^<.+>$/;
$src_line =~ s/\s*<[^>]+>\s*/ /g;
$src_line =~ s/^\s+//;
$src_line =~ s/\s+$//;
$tgt_line =~ s/\s*<[^>]+>\s*/ /g;
$tgt_line =~ s/^\s+//;
$tgt_line =~ s/\s+$//;
if($file !~ /ep-00-1[012]-/) {
print TRAIN_SRC_LEX "$src_line\n";
print TRAIN_TGT_LEX "$tgt_line\n";
print TRAIN_DOC "$doc\n";
} elsif ($file =~ /ep-00-(10-02|11-16)/) {
print DEV_SRC_LEX "$src_line\n";
print DEV_TGT_LEX "$tgt_line\n";
print DEV_DOC "$doc\n";
} elsif ($file =~ /ep-00-(10-23|11-29)/) {
print TEST_SRC_LEX "$src_line\n";
print TEST_TGT_LEX "$tgt_line\n";
print TEST_DOC "$doc\n";
}
}
close SRC_TXT;
close TGT_TXT;
}
close TRAIN_SRC_LEX;
close TRAIN_TGT_LEX;
close TRAIN_DOC;
close DEV_SRC_LEX;
close DEV_TGT_LEX;
close DEV_DOC;
close TEST_SRC_LEX;
close TEST_TGT_LEX;
close TEST_DOC;
This script will massage the Europarl corpus into a suitable
format for our use. While our example is using the French-English
portion, the Europarl is available in many language pairs and this
script can be used with any of them. Additionally, after
completing this procedure once, you will see that the data format
we use is very simple. Feel free to grab any other data, but be
sure to set aside some out some of the text for development and
testing.
The following commands will execute the script and create the
prepared Europarl corpus in data/corpora/fr-en/raw:
mkdir -p data/corpora/fr-en/raw
perl europarl.pl work/europarl data/corpora/fr-en/raw/europarl-v5- fr en
rm -fr europarl
Distributed with the Cunei distribution is a configuration file in
the systems/fr-en/default/
directory. Using the provided configuration, we first calculate
the expected sentence ratios and then process the raw text into
clean, tokenized text:
bin/cunei.sh EstimateSentenceRatios systems/fr-en/default/config \
-debug info -dir data/corpora/fr-en/raw/ \
-sequence-file europarl-v5-train.fr.lex.gz \
-sequence-lang source -sequence-type lexical \
-sequence-file europarl-v5-train.en.lex.gz \
-sequence-lang target -sequence-type lexical
bin/cunei.sh ProcessCorpus systems/fr-en/default/config -debug info \
-input-dir data/corpora/fr-en/raw/ \
-output-dir data/corpora/fr-en/clean \
-sequence-file europarl-v5-train.fr.lex.gz \
-sequence-lang source -sequence-type lexical \
-sequence-file europarl-v5-train.en.lex.gz \
-sequence-lang target -sequence-type lexical \
-docs europarl-v5-train.doc.gz
Estimate Word Alignments
We will use the GIZA++ toolkit to induct word-alignments from the
corpus. Download the latest version of GIZA++ from http://code.google.com/p/giza-pp/
and install it. If the GIZA++ tools are not in your $PATH then set the environment variable $GIZA_HOME to point to the location of your
GIZA++ installation. To simply the process, a convenience script
is provided to build the appropriate alignments. GIZA++ will take
several hours to complete.
export GIZA_HOME=/path/to/GIZA++-v2/
bin/align-giza.sh data/corpora/fr-en/clean/europarl-v5-train.fr.lex.gz \
data/corpora/fr-en/clean/europarl-v5-train.en.lex.gz \
data/corpora/fr-en/clean/europarl-v5-train.giza
When the script completes, data/corpora/fr-en/clean/europarl-v5-train.giza
will be populated with several files. The files s.A3.final.gz and t.A3.final.gz are the Viterbi word alignments
from GIZA++ for P(s|t) and P(t|s) respectively. We will use these
files in the next step.
Index the Corpus
In this step we will index the text files we processed earlier. In
order to avoid re-processing these files, edit the configuration
systems/fr-en/default/config and comment
out the lines that begin with 'Processors' using the '#' character
as shown below.
Processors.Source.Input.Text: #Canonicalizer Lowercaser [...elided...]
Processors.SourceTarget.Input.Sentence: #SentenceEliminator
Processors.Target.Input.Text: #Canonicalizer Lowercaser [...elided...]
Now run the following command to index the corpus. Expect this to
take about an hour.
bin/cunei.sh IndexCorpus systems/fr-en/default/config \
-dir data/corpora/fr-en/clean/ \
-alignment-file europarl-v5-train.giza/s.A3.final.gz \
-alignment-lang source \
-alignment-file europarl-v5-train.giza/t.A3.final.gz \
-alignment-lang target \
-sequence-file europarl-v5-train.fr.lex.gz \
-sequence-lang source -sequence-type lexical \
-sequence-file europarl-v5-train.en.lex.gz \
-sequence-lang target -sequence-type lexical \
-docs europarl-v5-train.doc.gz -debug info
Once indexing is complete, uncomment the
'Processors.Source.Input.Text' and 'Processors.Target.Input.Text'
lines (as shown previously). This will ensure future text files
are processed appropriately.
(Re-)Estimate Lexicons and Alignments
The word-alignments from GIZA++ are not weighted. Run the
following command to estimate lexicons for P(s|t) and
P(t|s). These probabilities are incorporated into the alignments
and are also used at run-time to calculate lexical features.
bin/cunei.sh EstimateLexiconAlignment systems/fr-en/default/config \
-debug info
Build a Language Model
We use SRILM to estimate n-gram probabilities. Download the latest
version of SRILM from
http://www.speech.sri.com/projects/srilm/download.html and
install it. If the SRILM tools are not in your $PATH then set the environment variable $SRILM_HOME to point to the location of your
SRILM installation. Usually it's best to build a language model
from a larger monolingual corpus, but for this tutorial we'll just
use the English half of our bilingual corpus to build a 4-gram
model. The output of SRILM is then fed to Cunei which indexes and
stores it to disk in a format that is faster to load. Once Cunei
has indexed the language model, the SRILM file is no longer
necessary and may be erased.
export SRILM_HOME=/path/to/srilm/
mkdir -p data/lm/en/
zcat data/corpora/fr-en/clean/europarl-v5-train.en.lex.gz | \
bin/build-srilm.sh data/lm/en/europarl-v5-train.5-gram.srilm.gz 5
bin/cunei.sh IndexLanguageModel systems/fr-en/default/config -debug info \
-input data/lm/en/europarl-v5-train.5-gram.srilm.gz -model Default
Translate
If you successfully completed all the steps above you should have
a working system that can produce translations. Let's give it a
whirl!
To check that everything is working, generate the translation
lattice with the command below. This will output the many possible
translations found for each word or phrase in the test sentence.
echo "Je voudrais une bouteille d'eau" | \
bin/cunei.sh Translate systems/fr-en/default/config -debug info
Now try generating full-sentence translations with the command
below. This will output the top four translations of the sentence.
echo "Je voudrais une bouteille d'eau" | \
bin/cunei.sh Decode systems/fr-en/default/config -nbest 4 -debug info
Optimize
The default parameters are not adjusted for any particular
language pair and, thus, are quite poor. Now it's time to remedy
that. The following command will optimize Cunei's settings and
significantly improve the quality of translations. The test
document will be translated over and over, each time adjusting
Cunei's parameters in order to produce translation that (as
closely as possible) match the human translated document(s). This
process will use a lot of memory and may take a few days to
complete.
bin/cunei.sh Optimize systems/fr-en/default/config \
-input data/corpora/fr-en/raw/europarl-v5-dev.fr.lex \
-ref data/corpora/fr-en/raw/europarl-v5-dev.en.lex.0 \
-output opt.log -debug -info
Info messages will appear on the console, but the actual output
will be logged in the file opt.log. When
optimization is complete a new configuration file systems/fr-en/default/config.opt will be
created for you with the optimized parameters. From now on, use
this new configuration file for translation.
Congratulations
Congratulations on completing the Cunei tutorial. Feedback is
welcome. If something didn't quite work right or you have
suggestions for improvement please send mail to