r21 - 28 Oct 2016 - 14:42:50 - RomanYangarberYou are here: TWiki >  Main Web > EtymonMain > EtymologyMdlNew > PubEtymologyJavaCode

Etymon: Description of Java-implementation

Look here to see how it works

How to start

Revisions:

  1. Check out the main Java project, EtyMalign: do not use main branch, use Lv's branch
    • svn co svn+ssh://melkinpaasi.cs.helsinki.fi/home/group/langtech/svnroot/etymon/java/EtyMalign_Lv/EtyMalign
  2. Check out the Java project for data extraction,
    • svn co svn+ssh://melkinpaasi.cs.helsinki.fi/home/group/langtech/svnroot/etymon/java/EtymologyDataHandler

There are 3 different input files the program needs to know, check you have access to these:

  1. /group/home/langtech/Etymology-Project/StarLing/starling-top-dialects.utf8
    • This is the Starling data input file, start with this file
    • If needed, new input can be produced using the EtymologyDataHandler package
    • For Starling data, use etymologydata.starling.StarlingMain.java
    • For SSA data, use etymologydata.ssa.SSAMain.java
    • SSA data for some languages can be found at /group/home/langtech/Etymology-Project/SSA/ssa-all.utf8
    • If you want to prepare your own data, find the correct format for your data in next section.
  2. /home/group/langtech/Etymology-Project/etymon-logs/languageSpecificRules
    • This file contains rules to convert some input data glyphs to different form, see this page for info.
    • The purpose of this file is to ensure that all glyphs of a language have an unique feature vector presentation
    • Another copy of this file is in svn, see ~/NetBeansProjects/EtyMalign/ (modify this, copy the updated version to FS disk space)
  3. /home/group/langtech/Etymology-Project/etymon-logs/logRegretMatrix
    • A precomputed log-regret matrix for Normalized maximum likelihood computation.
    • You can create a new version using etymology.util.LogRegretMatrixPrinter.java

Then try to run using NetBeans:

  1. Change parameter values, as needed
  2. Clean and Build
  3. Run
  4. After the run is finished, check the log files in ~/NetBeansProjects/EtyMalign/log/

See end of this page to see how to do runs on the Computer Cluster.

Input File

Input file contains the sets of cognate words to be aligned. You can see the sample input file. It consists of a header line indicating language names, each row after that is a set of cognate words. First column is always ID which contains an integer as the ID of the cognate set. If an entry is missing for some language, or is not available, this is indicated with a dash "-". The encoding of the file: must be encoded in UTF8.

Required Libraries

Following libraries are required for running the program. Copy them into the directory "lib" next to the jar file.

File name Library name URL
bzip2.jar bzip2 library from Apache Ant http://www.kohsuke.org/bzip2/
commons-cli-1.2.jar Apache Commons CLI library http://commons.apache.org/cli/
commons-cli-1.2-javadoc.jar    
commons-cli-1.2-sources.jar    
commons-math-2.1.jar Apache Commons Math Library http://commons.apache.org/math/
commons-math-2.1-javadoc.jar    
commons-math-2.1-sources.jar    

Config File

Settings for running the program can be defined either using command line switches, or using a config file. Note that command line switches OVERRIDE settings in config files. You can specify the config file by adding -config PATH-TO-CONFIG-FILE switch when running the jar file. For more info about command line parameters, see next section.

A config file is a text file, which gives a value for one parameter per line. Lines with # are interpreted as comments, and are not taken into account. Inline comments are also allowed: if there is a # in a line, the rest of that line is ignored. Currently there is no way to escape the comment character, but this can be added later.

The following table explains the values for different parameters:

Parameter Description
UTF8InputFile/ Path to the UTF8 input file containing the cognate set data
ConversionRules/ Path to conversion rules: language-specific rule file
LogRegretMatrixFile/ Path to log-Regret matrix
languages[] Whitespace separated list of languages: e.g.: FIN EST
LogDirectory/ Log directory: for output logging
PrintOnlyFinalLogs? Boolean setting
UseSimulatedAnnealing? Boolean setting
SimulatedAnnealingInitialTemp Initial temperature for simulated annealing
SimulatedAnnealingCoolingSchedule Cooling factor for simulated annealing
version"" Alignment version (e.g., context_2d)
CostFunction"" Cost function identifier: Prequential or NML
TreeBuildingFrequency An integer indicating tree building frequency
BaselineContextCombination? Boolean setting. If TRUE, performs baseline first then context model alignment
UseZeroLevel? Boolean setting, like -zero in command line
InfDepthRestricted? Boolean setting, like -inf in command line
UseJointCoding? Boolean setting, like -joint in command line
UseBinaryCandidates? Boolean setting, like -binary in command line
NoMultipleValueTrees? Boolean setting, like -nomulti in command line
UsePreviousContextModel? Boolean setting.
UseWordBoundaries? Boolean setting, like -boundaries in command line
ImputeWords? Boolean setting, like -impute in command line
RemoveSuffixes? Boolean setting, like -suffixes in command line
AlignTwoGlyphs? Boolean setting, like -g2 in command line
RepetitionCount Number of repetitions: -rep in command line)
Iteration Integer, identifying the iteration, when run with one repetition only
RandomSeed Random seed to be used: -seed in command line

  • Legend:
    • Parameters which end in a "?" are boolean settings.
    • ending in a slash "/", are paths to files or directories,
    • [] indicates lists,
    • "" indicates string values.

A sample config file can be found in group directory here.

Parameters

Parameter Description Default value Note
-config config file path - See previous section.
-a use simulated annealing FALSE use always
-alpha arg the temperature multiplier in simulated annealing 0.99 use 0.995 for context models
-bc do first baseline alignment, then context alignment FALSE fix naming of log files before use. (Not interesting model now.)
-binary compute also binary candidates of trees FALSE for context models only, use always
-boundaries Use word initial and end symbols. default FALSE use with 2x2 model (option -g2)
-costfunction arg COST-FUNCTION-CODE (see table below) codebook-no-kinds prequential "default" for context
-cr arg file containing conversion rules   copy this to relevant place
-f,--file arg input file   starling-top-dialects.utf8
-g arg path to gold standard file   not used
-g2 use also glyph-pairs for alignment FALSE 2x2 alignment
-impute compute the imputation distance FALSE use
-inf restrictions in normal simann FALSE choose this or -zero or none, not both
-iteration arg add the number of iteration, this terminates (final-)logging, if non-zero ZERO (0)  
-joint use joint coding instead of separate FALSE use -joint -zero or -joint
-l,--languages arg languages to utilize   -l FIN -l EST, cluster runs for all language pairs
-lc log to console FALSE  
-lp arg path to put logs into    
-monitor,--monitor arg words to monitor: print their complete Viterbi matrix on each iteration    
-nomulti Use multi-value candidates of trees TRUE do not turn off
-of print only final logs FALSE for cluster runs, turn off
-old Use the old version of the zero- or infinite-depth model FALSE do not use
-rep arg repetition count   use 50
-rev flip the words around FALSE run the word backwards
-suffixes try to remove the suffixes of word FALSE not implemented yet
-t arg initial annealing temperature 50 ok
-tf arg tree rebuilding frequency   ok
-v arg 3D non-feature, symbol alignment: version [joint, marginal];    
  for 2D: context for context-based alignment    
-zero depth 0 trees have some different restrictions, defined in Wiki FALSE use -v context -zero or -v context -inf or -v context
-dict Dictionary file path (Not supported yet). - Not supported yet

Cost Function Codes

Code Description
baseline use only with 2D 1x1 non-context model
codebook-no-kinds 2D and 3D 1x1 non-context models
codebook-and-kinds 2D and 3D 1x1 non-context models 2D 2x2 model
written-out-nxn (not used possibly out-of-date code)
code-book-and-kinds-separate 2D codebook with kinds and conditional with kinds
codebook-and-kinds-nml use nml (not separate kinds)
codebook-and-kinds-separate-nml use NML (separate kinds)
prequential prequential coding for context models
nml nml coding for context separate model

Cluster Runs

What do we need:

  • 50 repetitions of each run: use parameter -rep 50 for every run
  • For context models, do 50 separate runs * use parameter -iteration ITERNUMBER to distinguish one run from another * after finding the lowest cost run, use parameter -seed SEEDNUMBER (found from the .cost file) to rerun the best option * NB: this is inefficient with respect to time, is necessary to save space: but keeping around complete logs from 50 runs (of which we will select the lowest-cost run) may require a prohibitive amount of disk space.

Runs should be done for:

  • all language pairs (e.g., for 10 language pairs, 10 * 9 = 90 runs)
  • forwards AND backwards (parameter -rev)
  • all models (or some selected ones) (see the published papers and MDL-writing page and the main page

The 1x1 models:

  • -costfunction baseline
  • -costfunction codebook-no-kinds
  • -costfunction codebook-and-kinds

The 2x2 models:

  • -costfunction codebook-and-kinds  -g2
  • -costfunction codebook-and-kinds  -g2 -boundaries (Preferred)

The 3D model (use -v marginal; also some implementation for variant: -v joint ):

  • -costfunction codebook-no-kinds -v marginal
  • -costfunction codebook-and-kinds -v marginal

And also the 5 context models: (separate and joint, basic description: here and the variants here):

Model *Program parameters
separate normal -v context_2D -binary (NB: separate is default)
separate zero -v context_2D -zero -binary
separate infinite -v context_2D -inf -binary
joint normal -v context_2D -binary -joint
joint zero -v context_2D -zero -binary -joint

For context models with separate coding (i.e., source level coded separately from target level):

  • the default cost function: -costfunction prequential
  • for separate model, also NML implementation: -costfunction nml

Cluster runs -- How to do them

How to start:

  • Clean and build the project: creates EtyMalign.jar to /EtyMalign/dist/
  • copy everything from /EtyMalign/dist/ to your space in the File System

Create task files:

  • task file: on each line there is one Java command to be executed
  • task creation scripts in /EtyMalign/shell/
    • createStarlingContextTasks.py
    • createStarlingNonContextTasks.py
  • An example task file is, e.g., /EtyMalign/shell/best-context-models (you can't use that directly, log files must be user specific or in group directory)
    • that file is produced using Java: go to etymology.util/CostsToTableReader.java and modify method writeTaskFileByRandomSeed(null) to create the best seed task

The python script distribute.py

  • polls cluster machines and distributes the jobs in the taskFile
  • before using; change the user specific or run specific things (e.g., user name, passwd, memory requirements)
  • before using; modify ssh setting so that cluster machines do not ask for password
    • create ssh keys
  • location in svn: /EtyMalign/shell/
  • usage: python distribute.py taskFile

Collecting the results of cluster runs:

  • go to etymology.util/CostsToTableReader.java and modify it to find your log file directories
  • note that ModelType enum class has a parameter name that is the directory name under the log file path
  • see main() method
    • readInAllData(); collects the results
    • other methods print the results, various ways to do that, various methods

Another option to do the cluster runs is to use the shell scripts, those need to be cleaned up:

  • nice java -Xmx8g -Xms2g -jar $JAR_PATH -rep 50 -lp $LOG_PATH -costfunction $FUNCTION -f $INPUT_FILE -of -impute
  • examples of runs are found in
    • distributed-run-suvi.sh
    • distributed-run-suvi-boundaries.sh
    • distributed-run-all-funcs.sh
  • example runs located in directory /EtyMalign/shell/
  • NOTE scripts use user-specific files. To fix!!

Example runs

Always write the Java command:

  • nice java -Xmx5g -Xms2g -jar /fs/jarDir/EtyMalign.jar
  • Adjust -Xmx and -Xms parameters if needed (more memory for context models)

These parameters are common for most non-context models:

  • -f /group/home/langtech/Etymology-Project/StarLing/starling-top-dialects.utf8 -of -a -t 100 -rep 50 -impute -lp /fs/logDir/ -l1 LANG1 -l2 LANG2 [-l3 LANG3]
  • -f: location of dialect file
  • -of: don't write *.log files and .gp files
  • -a: use simulated annealing
  • -t 100: use initial temperature 100 instead of default 50
  • -rep 50: repeat the run 50 times
  • -impute: do also imputation
  • -lp: the log path, where logFile.log.final, logFile.log.costs and logFile.log.best files are written.
  • -l : the language to use, at least 2 langs, max 3.
  • check, e.g., /EtyMalign/shell/createStarlingNonContextTasks.py

  • Typical parameters for the context model:
    • -f /group/home/langtech/Etymology-Project/StarLing/starling-top-dialects.utf8 -of -a -t 100 -lp /fs/logDir/ -l1 LANG1 -l2 LANG2 [-l3 LANG3] -of -a -alpha 0.995 -tf 0
    • -tf 0 means rebuild the trees only after the entire iteration: all the words in the corpus are aligned once
    • -tf 10 rebuild trees after re-aligning every 10 word pairs; same for -tf N where N > 0

Example Model Additional parameters
Baseline -costfunction baseline
1x1 codebook-and-kinds codebook-and-kinds
2x2 boundaries -costfunction codebook-and-kinds -boundaries -g2
3D model -costfunction codebook-no-kinds -boundaries -g2 -v marginal
context zero (a single random iteration) -v context_2d -tf 0 -impute -costfunction prequential -zero -binary -iteration 15
context zero (iteration using random seed 1234567) -v context_2d -tf 0 -impute -costfunction prequential -zero -binary -seed 1234567
context zero (no iteration specification) -v context_2d -tf 0 -impute -costfunction prequential -zero -binary
context zero (joint model, no specification) -v context_2d -tf 0 -impute -costfunction prequential -zero -binary -joint
context normal -v context_2d -tf 0 -impute -costfunction prequential -binary
context normal (use NML) -v context_2d -tf 0 -impute -costfunction nml -binary
context 3D -v context_marginal_3d -tf 0 -zero -binary

  • In case of questions, please contact the developer team.
Edit | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r21 < r20 < r19 < r18 < r17 | More topic actions
Main.PubEtymologyJavaCode moved from UraLink.EtymologyJavaCode on 14 Aug 2012 - 20:28 by RomanYangarber - put it back

tip TWiki Tip of the Day
WikiWords for linking
WikiWords are capitalized words, run together, such as WebPreferences and CollaborationPlatform. Using ... Read on Read more

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback