Table of Contents
This page documents the official and community datasets, models and research papers featuring TLDR-pages.
Datasets
Official Datasets
We provide and generate datasets in formats like CSV, XML, JSON and TMX (Translation Memory eXchange) using https://github.com/tldr-pages/tldr-translation-pairs-gen tool. And can be found under its latest release. These artifacts are also available with the below sources:
-
OPUS TLDR-pages Dataset (TMX format) (2023 - present)
- OPUS is a public dataset of translated resources on the web. All translations are derived from freely available and openly licensed sources, so the translations themselves are safe to use with minimal restrictions.
- These datasets are helpful for a variety of applications such as research and machine learning.
- A notable project that uses the OPUS corpora is LibreTranslate (which is powered by argos-translate).
-
Kaggle Translation Pairs Dataset (CSV format) (2024 - present)
- Kaggle is a data science competition platform and online community of data scientists and machine learning practitioners under Google LLC.
- It is popular among Students, Researchers, and Data Scientists.
- This multilingual text dataset contains paired strings mapping various localized TLDR-pages.
Community Datasets
Warning
The below links contains various datasets from the community for academic and research reference, use them at your own discretion as it's contents aren't vet by our maintainers.
- https://www.kaggle.com/datasets/bppuneethpai/tldr-summary-for-man-pages (2020) - This dataset provides paired man pages and their concise tldr summaries, facilitating the development of text summarization models.
- https://huggingface.co/datasets/neulab/tldr (2022) (Research paper) - Natural language to bash generation dataset based on tldr pages in English, used for evaluating code generations.
- https://huggingface.co/datasets/tmskss/linux-man-pages-tldr-summarized (2023) - This dataset provides a small CSV of Linux man pages in English paired with their concise tldr summaries for text summarization tasks.
- https://huggingface.co/datasets/Edoigtrd/tldr-pages (2024) - This dataset contains Linux Bash commands from tldr along with their descriptions.
Papers
- Explainable Natural Language to Bash Translation using Abstract Syntax Tree (2021)
- DocPrompting: Generating Code by Retrieving the Docs (2022, 2023)
- ShellFusion: An Answer Generator for Shell Programming Tasks via Knowledge Fusion (2023)
- LLM-Supported Natural Language to Bash Translation (2025)
Newer research papers, featuring TLDR pages, can be found here at Google Scholar.