Skip to content


Repository files navigation


Sample codes on how to use SpeakLeash python package:


example_1_basic_read <-- Script demonstrates how to iterate over datasets and documents in a given repository, printing the name of each dataset, as well as the metadata and text of each document.

example_2_inventory_check <-- Sample code demonstrates how to extract information about datasets from the Speakleash repository and print a summary of the metadata in a tabular format, with colorized fields for easier visibility.
(Additional libraries can be installed via requirements.txt in this folder)

example_3_quality_metrics <-- Example shows how to check quality metrics distribution in a given dataset and extract quality info for each document.

example_4_extraction_to_files <-- Example shows how to extract high quality documents from selected dataset and provides functionalities to create necessary directories and save documents of specified quality from the speakleash dataset to a specific folder location.

example_5_word_cloud <-- Example shows two cases of generation word cloud examples. The first case is with the usage of spaCy library, using the lemmatizer. Second case uses NLTK library.

example_6_pandas <-- Example shows how to put Speakleash dataset into the pandas DataFrame.

example_7_pandas_polars <-- Example shows how to import data from SpeakLeash datasets into Pandas and Polars libraries dataframes.

example_8_dataset_vis <-- Example shows how to load data from the SpeakLeash library datasets and visualize selected metrics considering document quality, among other things.


No description, website, or topics provided.







No releases published


No packages published