ftodtf package¶
Submodules¶
ftodtf.cli module¶
This module handles parsing of cli-flags and then calls the needed function from the library
-
ftodtf.cli.
add_arguments_to_parser
(arglist, parser, required, group=None)¶ Adds arguments (obtained from the settings-class) to an agrparse-parser
Parameters: - arglist (list(str)) – A list of strings representing the names of the flags to add
- parser (argparse.ArgumentParser) – The parser to add the arguments to
- required (list(str)) – A list of argument-names that are required for the command
- group (str) – If set place the arguments in an argument-group of the specified name
-
ftodtf.cli.
cli_main
()¶ Program entry point.
-
ftodtf.cli.
show_prog
(q)¶ Show progressbar, converges against the next max progress_bar.n and finishes only when the function “write_batches_to_file” ends.
Parameters: q – Process which handles the progressbar.
-
ftodtf.cli.
spawn_progress_bar
()¶ This function will spawn a new process using multiprocessing module.
Returns: A child process.
ftodtf.export module¶
This module handles the expporting of trained models
-
ftodtf.export.
export_embeddings
(settings, outputdir)¶ Builds an model using the given settings, loads the last checkpoint and saves only the embedding-variable to a new checkpoint inside outputdir, leaving out all the other weights. The new checkpoint is much smaller then the original. This new Checkpoint can be used for inference but not to continue training.
Parameters: - settings (ftodtf.settings.FasttextSettings) – The settings for the model
- outputdir (str) – The directory to store the new checkpoint to.
ftodtf.inference module¶
This module contains functions to use trained word-embeddings to do usefull things Currently the only implemented thing is to compute the similarities between words.
-
class
ftodtf.inference.
PrintSimilarityHook
(every_n_steps, similarityop, words)¶ Bases:
tensorflow.python.training.basic_session_run_hooks.StepCounterHook
Implements a Hook that computes and printes the similarity between given words every x-steps. To be used with tf.train.MonitoredTrainingSession
-
after_run
(run_context, run_values)¶ Called after each call to run().
The run_values argument contains results of requested ops/tensors by before_run().
The run_context argument is the same one send to before_run call. run_context.request_stop() can be called to stop the iteration.
If session.run() raises any exceptions then after_run() is not called.
- Args:
- run_context: A SessionRunContext object. run_values: A SessionRunValues object.
-
before_run
(run_context)¶ Called before each call to run().
You can return from this call a SessionRunArgs object indicating ops or tensors to add to the upcoming run() call. These ops/tensors will be run together with the ops/tensors originally passed to the original run() call. The run args you return can also contain feeds to be added to the run() call.
The run_context argument is a SessionRunContext that provides information about the upcoming run() call: the originally requested op/tensors, the TensorFlow Session.
At this point graph is finalized and you can not add ops.
- Args:
- run_context: A SessionRunContext object.
- Returns:
- None or a SessionRunArgs object.
-
-
ftodtf.inference.
compute_similarities
(words, settings)¶ Use trained embeddigs to compute the similarity between given input-words
Parameters: - words (list(str)) – A list of words to compare to each other
- settings (ftodtf.settings.FastTextSettings) – The settings for the fasttext-model
-
ftodtf.inference.
print_similarity
(similarity, words)¶ Print similarity between given words :param similarity: A matrix of format len(words)xlen(words) containing the similarity between words :param list(str) words: Words to print the similarity for
ftodtf.input module¶
This module handles all the input-relatet tasks like loading, pre-processing and batching
-
class
ftodtf.input.
InputProcessor
(settings)¶ Bases:
object
Handles the creation of training-examble-batches from the raw training-text
Constructor of InputProcessor
Parameters: settings (ftodtf.settings.FasttextSettings) – An object encapsulating all the settings for the fasttext-model -
batches
(passes=1)¶ Returns a generator the will yield an infinite amout of training-batches ready to feed into the model
Parameters: repetitions (int) – How many passes over the input data should be done. Default: 1. 0 will repeat the input forever.
-
preprocess
()¶ Do the needed proprocessing of the dataset. Count word frequencies, create a mapping word->int
-
string_samples
()¶ Returns a generator for samples (targetword->contextword) :returns: A generator yielding 2-tuple consisting of a target-word and a context word.
-
-
ftodtf.input.
find_and_clean_sentences
(corpus, language)¶ Uses NLTK to parse the corpus and find the sentences. :param str corpus: The corpus where the sentences should be found. :return: A list with sentences.
-
ftodtf.input.
find_and_clean_sentences_helper
(args)¶ Auxiliary function to unwrap the arguments for multiprocessing. :param args: Takes the corpus and specified language of the corpus. :return: The result of the find_and_clean_sentence function.
-
ftodtf.input.
generate_ngram_per_word
(word, ngram_window=2)¶ Generates ngram strings of the specified size for a given word. Before processing beginning and end of the word will be marked with “*”. The ngrams will also include the full word (including the added *s). This is the same process as described in the fasttext paper.
Parameters: - word (str) – The token string which represents a word.
- ngram_window (int) – The size of the ngrams
Returns: A generator which yields ngrams.
-
ftodtf.input.
hash_string_list
(strings, buckets, offset=0)¶ Hashes each element in a list of strings using the FNVa1 algorithm.
Parameters: - strings (list(str)) – A list of strings to hash.
- buckets (int) – How many different hash-values to produce maximally. (all Hashes are mod buckets)
- offset (int) – The smallest possible hash value. Can be used to make hashvalues start at an other number then 0
-
ftodtf.input.
inform_progressbar
(func)¶ Decorator used to put the function names into the QUEUE for showing the progress in the progressbar :param func: The function which should be decorated.
-
ftodtf.input.
pad_to_length
(li, length, pad='')¶ Pads a given list to a given length with a given padding-element
Parameters: - li (list()) – The list to be padded
- length (int) – The length to pad the list to
- pad (object) – The element to add to the list until the desired length is reached
-
ftodtf.input.
parse_files_sequential
(file_folder, language, sentences)¶ Parse the raw data files from the training folder sequentially. :param file_folder: The folder which contains the raw text files. :param language: The language of the text files. :param sentences: A reference to the sentence list.
-
ftodtf.input.
words_to_ngramhashes
(words, num_buckets)¶ Converts a list of words into a list of padded lists of ngrams-hashes. The resulting matrix can then be used to compute the word-verctors for the original words :param list(str) words: The words to convert :param int num_buckets: The number of hash-buckets to use when hashing the ngrams :returns: list(list(int))
-
ftodtf.input.
write_batches_to_file
(*args, **kwargs)¶
ftodtf.model module¶
This module handles the building of the tf execution graph
-
class
ftodtf.model.
InferenceModel
(settings)¶ Bases:
object
Builds and represents the tensorflow computation graph for using the trained embeddings. Exports all important operations via fields. An existing checkpoint must be loaded via load() before this model can be used to compute anything.
Constuctor for Model
Parameters: settings (ftodtf.settings.FasttextSettings) – An object encapsulating all the settings for the fasttext-model -
load
(logdir, session)¶ Loades pre-trained embeddings from the filesystem
Parameters: - logdir (str) – The path of the folder where the checkpoints created by the training were saved
- session (tf.Session) – The session to restore the variables into
-
-
class
ftodtf.model.
TrainingModel
(settings, cluster=None)¶ Bases:
object
Builds and represents the tensorflow computation graph for the training of the embeddings. Exports all important operations via fields
Constuctor for Model
Parameters: - settings (ftodtf.settings.FasttextSettings) – An object encapsulating all the settings for the fasttext-model
- cluster – A tf.train.ClusterSpec object describint the tf-cluster. Needed for variable and ops-placement
-
get_scaffold
()¶ Returns a tf.train.Scaffold object describing this graph
Returns: tf.train.Scaffold
-
ftodtf.model.
compute_word_similarities
(ngramhashmatrix, embeddings)¶ Returns a tensorflow-operation that computes the similarities between all input-words using the given embeddings
Parameters: - ngramhashmatrix (tf.Tensor) – A list of lists of ngram-hashes, each list represents the ngrams for one word. (In principle a trainings-batch without labels)
- embeddings (tf.Tensor) – The embeddings to use for converting words to vectors. (Can be a list of tensors)
- num_buckets (int) – The number of hash-buckets used when hashing ngrams
-
ftodtf.model.
create_embedding_weights
(settings)¶ Creates a (partitioned) tensorflow variable for the word-embeddings Exists as seperate function to minimize code-duplication between training and inference-models
-
ftodtf.model.
ngrams_to_vectors
(ngrams, embeddings)¶ Create a tensorflow operation converting a batch consisting of lists of ngrams for a word to a list of vectors. One vector for each word
Parameters: - ngrams – A batch of lists of ngrams
- embeddings – The embeddings to use as tensorflow variable. Can also be a list of variables.
Returns: a batch of vectors
-
ftodtf.model.
parse_batch_func
(batch_size)¶ Returns a function that can parse a batch from a tfrecord-entry
Parameters: batch_size (int) – How many samples are in a batch
ftodtf.settings module¶
This module contains the FasttextSettings class
-
class
ftodtf.settings.
FasttextSettings
¶ Bases:
object
This class contains all the settings for the fasttext-training and also handles things like validation. Use the attributes/variables of this class to set hyperparameters for the model.
Variables: - corpus_path (str) – Path to the file containing text for training the model.
- batches_file (str) – The Filename for the file containing the training-batches. The file is written by the preprocess command and read by the train command.
- log_dir (str) – Directory to write the generated files (e.g. the computed word-vectors) to and read/write checkoints from.
- steps (int) – How many training steps to perform.
- vocabulary_size (int) – How many words the vocabulary will have. Only the vocabulary_size most frequent words will be processed.
- batch_size (int) – How many trainings-samples to process per batch.
- embedding_size (int) – Dimension of the computed embedding vectors.
- skip_window (int) – How many words to consider left and right of the target-word maximally. The actual window is randomly sampled for each word between 1 and this value
- num_sampled (int) – Number of negative examples to sample when computing the nce_loss.
- ngram_size (int) – How large the ngrams (in which the target words are split) should be.
- num_buckets (int) – How many hash-buckets to use when hashing the ngrams to numbers.
- validation_words (str) – A string of comma-seperated words. The similarity of these words to each other will be regularily computed and printed to indicade the progress of the training.
- profile (boolean) – If set to True tensorflow will profile the graph-execution and writer results to ./profile.json.
- learnrate (float) – The starting learnrate for the training. The actual learnrate will lineraily decrease to beyth 0 when the specified amount of training-steps is reached.
- rejection_threshold (float) – In order to subsample the most frequent words.
- job (string) – The role of this node in a distributed setup. Can be worker’ or ‘ps’.
- workers (str) – A comma seperated list of host:port combinations representing the workers in the distributed setup.
- ps (str) – A comma seperated list of host:port combinations representing the parameter servers in the distributed setup. If empty a non-distributed setup is assumed.
- num_batch_files (int) – Number of batch files which should be created.
- index (int) – The of the node itself in the list of –workers (or –ps, depending on –job).
- language (str) – The language of the corpus.
-
attribute_docstring
(attribute, include_defaults=True)¶ Given the name of an attribute of this class, this function will return the docstring for the attribute.
Parameters: attribute (str) – The name of the attribute Returns: The docstring for the attribute
-
static
distribution_settings
()¶ Returns the names of the settings that are used for configuren the tensoflow-cluster
-
static
inference_settings
()¶ Returns the names of the settings that are used for the infer command
-
static
preprocessing_settings
()¶ Returns the names of the settings that are used for the preprocessing command
-
ps_list
¶ Returns ps as list of strings instead of a comma seperate string like the attribute would do :returns: A list of strings if ps is set and else None
-
static
training_settings
()¶ Returns the names of the settings that are used for the training command
-
validate_preprocess
()¶ Check if the current settings are valid for pre processing. :raises: ValueError if the validation fails
-
validate_train
()¶ Check if the current settings are valid for training. :raises: ValueError if the validation fails
-
validation_words_list
¶ Returns the validation_words as list of strings instead of a comma seperate string like the attribute would do :returns: A list of strings if validation_words is set and else None
-
workers_list
¶ Returns workers as list of strings instead of a comma seperate string like the attribute would do :returns: A list of strings if workers is set and else None
-
ftodtf.settings.
check_batch_size
(batch_size)¶
-
ftodtf.settings.
check_batches_file
(batches_file)¶
-
ftodtf.settings.
check_corpus_path
(corpus_path)¶
-
ftodtf.settings.
check_embedding_size
(embedding_size)¶
-
ftodtf.settings.
check_index
(job, workers, ps, index)¶
-
ftodtf.settings.
check_job
(job)¶
-
ftodtf.settings.
check_learn_rate
(learnrate)¶
-
ftodtf.settings.
check_log_dir
(log_dir)¶
-
ftodtf.settings.
check_ngram_size
(ngram_size)¶
-
ftodtf.settings.
check_nodelist
(noli, allow_empty=False)¶ Checks if the given argument is a comma seperated list of host:port strings.
Raises: ValueError if it is not
-
ftodtf.settings.
check_num_buckets
(number_buckets)¶
-
ftodtf.settings.
check_num_sampled
(num_sampled)¶
-
ftodtf.settings.
check_rejection_threshold
(rejection_threshold)¶
-
ftodtf.settings.
check_skip_window
(skip_window)¶
-
ftodtf.settings.
check_steps
(steps)¶
-
ftodtf.settings.
check_vocabulary_size
(vocabulary_size)¶
ftodtf.training module¶
This module handles the training of the word-vectors
-
class
ftodtf.training.
PrintLossHook
(every_n_steps, lossop, steptensor)¶ Bases:
tensorflow.python.training.basic_session_run_hooks.StepCounterHook
Implements a Hook that prints the current step and current average loss every x steps
-
after_run
(run_context, run_values)¶ Called after each call to run().
The run_values argument contains results of requested ops/tensors by before_run().
The run_context argument is the same one send to before_run call. run_context.request_stop() can be called to stop the iteration.
If session.run() raises any exceptions then after_run() is not called.
- Args:
- run_context: A SessionRunContext object. run_values: A SessionRunValues object.
-
before_run
(run_context)¶ Called before each call to run().
You can return from this call a SessionRunArgs object indicating ops or tensors to add to the upcoming run() call. These ops/tensors will be run together with the ops/tensors originally passed to the original run() call. The run args you return can also contain feeds to be added to the run() call.
The run_context argument is a SessionRunContext that provides information about the upcoming run() call: the originally requested op/tensors, the TensorFlow Session.
At this point graph is finalized and you can not add ops.
- Args:
- run_context: A SessionRunContext object.
- Returns:
- None or a SessionRunArgs object.
-
-
ftodtf.training.
train
(settings)¶ Run the fasttext training.
Parameters: settings (ftodtf.settings.FasttextSettings) – An object encapsulating all the settings for the fasttext-model