ftodtf package

Submodules

ftodtf.cli module

This module handles parsing of cli-flags and then calls the needed function from the library

ftodtf.cli.add_arguments_to_parser(arglist, parser, required, group=None)

Adds arguments (obtained from the settings-class) to an agrparse-parser

Parameters:
  • arglist (list(str)) – A list of strings representing the names of the flags to add
  • parser (argparse.ArgumentParser) – The parser to add the arguments to
  • required (list(str)) – A list of argument-names that are required for the command
  • group (str) – If set place the arguments in an argument-group of the specified name
ftodtf.cli.cli_main()

Program entry point.

ftodtf.cli.show_prog(q)

Show progressbar, converges against the next max progress_bar.n and finishes only when the function “write_batches_to_file” ends.

Parameters:q – Process which handles the progressbar.
ftodtf.cli.spawn_progress_bar()

This function will spawn a new process using multiprocessing module.

Returns:A child process.

ftodtf.export module

This module handles the expporting of trained models

ftodtf.export.export_embeddings(settings, outputdir)

Builds an model using the given settings, loads the last checkpoint and saves only the embedding-variable to a new checkpoint inside outputdir, leaving out all the other weights. The new checkpoint is much smaller then the original. This new Checkpoint can be used for inference but not to continue training.

Parameters:

ftodtf.inference module

This module contains functions to use trained word-embeddings to do usefull things Currently the only implemented thing is to compute the similarities between words.

class ftodtf.inference.PrintSimilarityHook(every_n_steps, similarityop, words)

Bases: tensorflow.python.training.basic_session_run_hooks.StepCounterHook

Implements a Hook that computes and printes the similarity between given words every x-steps. To be used with tf.train.MonitoredTrainingSession

after_run(run_context, run_values)

Called after each call to run().

The run_values argument contains results of requested ops/tensors by before_run().

The run_context argument is the same one send to before_run call. run_context.request_stop() can be called to stop the iteration.

If session.run() raises any exceptions then after_run() is not called.

Args:
run_context: A SessionRunContext object. run_values: A SessionRunValues object.
before_run(run_context)

Called before each call to run().

You can return from this call a SessionRunArgs object indicating ops or tensors to add to the upcoming run() call. These ops/tensors will be run together with the ops/tensors originally passed to the original run() call. The run args you return can also contain feeds to be added to the run() call.

The run_context argument is a SessionRunContext that provides information about the upcoming run() call: the originally requested op/tensors, the TensorFlow Session.

At this point graph is finalized and you can not add ops.

Args:
run_context: A SessionRunContext object.
Returns:
None or a SessionRunArgs object.
ftodtf.inference.compute_similarities(words, settings)

Use trained embeddigs to compute the similarity between given input-words

Parameters:
  • words (list(str)) – A list of words to compare to each other
  • settings (ftodtf.settings.FastTextSettings) – The settings for the fasttext-model
ftodtf.inference.print_similarity(similarity, words)

Print similarity between given words :param similarity: A matrix of format len(words)xlen(words) containing the similarity between words :param list(str) words: Words to print the similarity for

ftodtf.input module

This module handles all the input-relatet tasks like loading, pre-processing and batching

class ftodtf.input.InputProcessor(settings)

Bases: object

Handles the creation of training-examble-batches from the raw training-text

Constructor of InputProcessor

Parameters:settings (ftodtf.settings.FasttextSettings) – An object encapsulating all the settings for the fasttext-model
batches(passes=1)

Returns a generator the will yield an infinite amout of training-batches ready to feed into the model

Parameters:repetitions (int) – How many passes over the input data should be done. Default: 1. 0 will repeat the input forever.
preprocess()

Do the needed proprocessing of the dataset. Count word frequencies, create a mapping word->int

string_samples()

Returns a generator for samples (targetword->contextword) :returns: A generator yielding 2-tuple consisting of a target-word and a context word.

ftodtf.input.find_and_clean_sentences(corpus, language)

Uses NLTK to parse the corpus and find the sentences. :param str corpus: The corpus where the sentences should be found. :return: A list with sentences.

ftodtf.input.find_and_clean_sentences_helper(args)

Auxiliary function to unwrap the arguments for multiprocessing. :param args: Takes the corpus and specified language of the corpus. :return: The result of the find_and_clean_sentence function.

ftodtf.input.generate_ngram_per_word(word, ngram_window=2)

Generates ngram strings of the specified size for a given word. Before processing beginning and end of the word will be marked with “*”. The ngrams will also include the full word (including the added *s). This is the same process as described in the fasttext paper.

Parameters:
  • word (str) – The token string which represents a word.
  • ngram_window (int) – The size of the ngrams
Returns:

A generator which yields ngrams.

ftodtf.input.hash_string_list(strings, buckets, offset=0)

Hashes each element in a list of strings using the FNVa1 algorithm.

Parameters:
  • strings (list(str)) – A list of strings to hash.
  • buckets (int) – How many different hash-values to produce maximally. (all Hashes are mod buckets)
  • offset (int) – The smallest possible hash value. Can be used to make hashvalues start at an other number then 0
ftodtf.input.inform_progressbar(func)

Decorator used to put the function names into the QUEUE for showing the progress in the progressbar :param func: The function which should be decorated.

ftodtf.input.pad_to_length(li, length, pad='')

Pads a given list to a given length with a given padding-element

Parameters:
  • li (list()) – The list to be padded
  • length (int) – The length to pad the list to
  • pad (object) – The element to add to the list until the desired length is reached
ftodtf.input.parse_files_sequential(file_folder, language, sentences)

Parse the raw data files from the training folder sequentially. :param file_folder: The folder which contains the raw text files. :param language: The language of the text files. :param sentences: A reference to the sentence list.

ftodtf.input.words_to_ngramhashes(words, num_buckets)

Converts a list of words into a list of padded lists of ngrams-hashes. The resulting matrix can then be used to compute the word-verctors for the original words :param list(str) words: The words to convert :param int num_buckets: The number of hash-buckets to use when hashing the ngrams :returns: list(list(int))

ftodtf.input.write_batches_to_file(*args, **kwargs)

ftodtf.model module

This module handles the building of the tf execution graph

class ftodtf.model.InferenceModel(settings)

Bases: object

Builds and represents the tensorflow computation graph for using the trained embeddings. Exports all important operations via fields. An existing checkpoint must be loaded via load() before this model can be used to compute anything.

Constuctor for Model

Parameters:settings (ftodtf.settings.FasttextSettings) – An object encapsulating all the settings for the fasttext-model
load(logdir, session)

Loades pre-trained embeddings from the filesystem

Parameters:
  • logdir (str) – The path of the folder where the checkpoints created by the training were saved
  • session (tf.Session) – The session to restore the variables into
class ftodtf.model.TrainingModel(settings, cluster=None)

Bases: object

Builds and represents the tensorflow computation graph for the training of the embeddings. Exports all important operations via fields

Constuctor for Model

Parameters:
  • settings (ftodtf.settings.FasttextSettings) – An object encapsulating all the settings for the fasttext-model
  • cluster – A tf.train.ClusterSpec object describint the tf-cluster. Needed for variable and ops-placement
get_scaffold()

Returns a tf.train.Scaffold object describing this graph

Returns:tf.train.Scaffold
ftodtf.model.compute_word_similarities(ngramhashmatrix, embeddings)

Returns a tensorflow-operation that computes the similarities between all input-words using the given embeddings

Parameters:
  • ngramhashmatrix (tf.Tensor) – A list of lists of ngram-hashes, each list represents the ngrams for one word. (In principle a trainings-batch without labels)
  • embeddings (tf.Tensor) – The embeddings to use for converting words to vectors. (Can be a list of tensors)
  • num_buckets (int) – The number of hash-buckets used when hashing ngrams
ftodtf.model.create_embedding_weights(settings)

Creates a (partitioned) tensorflow variable for the word-embeddings Exists as seperate function to minimize code-duplication between training and inference-models

ftodtf.model.ngrams_to_vectors(ngrams, embeddings)

Create a tensorflow operation converting a batch consisting of lists of ngrams for a word to a list of vectors. One vector for each word

Parameters:
  • ngrams – A batch of lists of ngrams
  • embeddings – The embeddings to use as tensorflow variable. Can also be a list of variables.
Returns:

a batch of vectors

ftodtf.model.parse_batch_func(batch_size)

Returns a function that can parse a batch from a tfrecord-entry

Parameters:batch_size (int) – How many samples are in a batch

ftodtf.settings module

This module contains the FasttextSettings class

class ftodtf.settings.FasttextSettings

Bases: object

This class contains all the settings for the fasttext-training and also handles things like validation. Use the attributes/variables of this class to set hyperparameters for the model.

Variables:
  • corpus_path (str) – Path to the file containing text for training the model.
  • batches_file (str) – The Filename for the file containing the training-batches. The file is written by the preprocess command and read by the train command.
  • log_dir (str) – Directory to write the generated files (e.g. the computed word-vectors) to and read/write checkoints from.
  • steps (int) – How many training steps to perform.
  • vocabulary_size (int) – How many words the vocabulary will have. Only the vocabulary_size most frequent words will be processed.
  • batch_size (int) – How many trainings-samples to process per batch.
  • embedding_size (int) – Dimension of the computed embedding vectors.
  • skip_window (int) – How many words to consider left and right of the target-word maximally. The actual window is randomly sampled for each word between 1 and this value
  • num_sampled (int) – Number of negative examples to sample when computing the nce_loss.
  • ngram_size (int) – How large the ngrams (in which the target words are split) should be.
  • num_buckets (int) – How many hash-buckets to use when hashing the ngrams to numbers.
  • validation_words (str) – A string of comma-seperated words. The similarity of these words to each other will be regularily computed and printed to indicade the progress of the training.
  • profile (boolean) – If set to True tensorflow will profile the graph-execution and writer results to ./profile.json.
  • learnrate (float) – The starting learnrate for the training. The actual learnrate will lineraily decrease to beyth 0 when the specified amount of training-steps is reached.
  • rejection_threshold (float) – In order to subsample the most frequent words.
  • job (string) – The role of this node in a distributed setup. Can be worker’ or ‘ps’.
  • workers (str) – A comma seperated list of host:port combinations representing the workers in the distributed setup.
  • ps (str) – A comma seperated list of host:port combinations representing the parameter servers in the distributed setup. If empty a non-distributed setup is assumed.
  • num_batch_files (int) – Number of batch files which should be created.
  • index (int) – The of the node itself in the list of –workers (or –ps, depending on –job).
  • language (str) – The language of the corpus.
attribute_docstring(attribute, include_defaults=True)

Given the name of an attribute of this class, this function will return the docstring for the attribute.

Parameters:attribute (str) – The name of the attribute
Returns:The docstring for the attribute
static distribution_settings()

Returns the names of the settings that are used for configuren the tensoflow-cluster

static inference_settings()

Returns the names of the settings that are used for the infer command

static preprocessing_settings()

Returns the names of the settings that are used for the preprocessing command

ps_list

Returns ps as list of strings instead of a comma seperate string like the attribute would do :returns: A list of strings if ps is set and else None

static training_settings()

Returns the names of the settings that are used for the training command

validate_preprocess()

Check if the current settings are valid for pre processing. :raises: ValueError if the validation fails

validate_train()

Check if the current settings are valid for training. :raises: ValueError if the validation fails

validation_words_list

Returns the validation_words as list of strings instead of a comma seperate string like the attribute would do :returns: A list of strings if validation_words is set and else None

workers_list

Returns workers as list of strings instead of a comma seperate string like the attribute would do :returns: A list of strings if workers is set and else None

ftodtf.settings.check_batch_size(batch_size)
ftodtf.settings.check_batches_file(batches_file)
ftodtf.settings.check_corpus_path(corpus_path)
ftodtf.settings.check_embedding_size(embedding_size)
ftodtf.settings.check_index(job, workers, ps, index)
ftodtf.settings.check_job(job)
ftodtf.settings.check_learn_rate(learnrate)
ftodtf.settings.check_log_dir(log_dir)
ftodtf.settings.check_ngram_size(ngram_size)
ftodtf.settings.check_nodelist(noli, allow_empty=False)

Checks if the given argument is a comma seperated list of host:port strings.

Raises:ValueError if it is not
ftodtf.settings.check_num_buckets(number_buckets)
ftodtf.settings.check_num_sampled(num_sampled)
ftodtf.settings.check_rejection_threshold(rejection_threshold)
ftodtf.settings.check_skip_window(skip_window)
ftodtf.settings.check_steps(steps)
ftodtf.settings.check_vocabulary_size(vocabulary_size)

ftodtf.training module

This module handles the training of the word-vectors

class ftodtf.training.PrintLossHook(every_n_steps, lossop, steptensor)

Bases: tensorflow.python.training.basic_session_run_hooks.StepCounterHook

Implements a Hook that prints the current step and current average loss every x steps

after_run(run_context, run_values)

Called after each call to run().

The run_values argument contains results of requested ops/tensors by before_run().

The run_context argument is the same one send to before_run call. run_context.request_stop() can be called to stop the iteration.

If session.run() raises any exceptions then after_run() is not called.

Args:
run_context: A SessionRunContext object. run_values: A SessionRunValues object.
before_run(run_context)

Called before each call to run().

You can return from this call a SessionRunArgs object indicating ops or tensors to add to the upcoming run() call. These ops/tensors will be run together with the ops/tensors originally passed to the original run() call. The run args you return can also contain feeds to be added to the run() call.

The run_context argument is a SessionRunContext that provides information about the upcoming run() call: the originally requested op/tensors, the TensorFlow Session.

At this point graph is finalized and you can not add ops.

Args:
run_context: A SessionRunContext object.
Returns:
None or a SessionRunArgs object.
ftodtf.training.train(settings)

Run the fasttext training.

Parameters:settings (ftodtf.settings.FasttextSettings) – An object encapsulating all the settings for the fasttext-model

Module contents