How to start Scribe

Welcome to Scribe-Data, an open-source initiative designed to systematically extract, structure, and manage multilingual linguistic data from Wikidata and Wikipedia. It supports the development of language tools such as intelligent keyboards, translation systems, and grammar assistants. Below is a structured overview of its purpose and workflow:

Project Overview

Scribe-Data is a command-line interface (CLI) tool that automates the retrieval and organization of language-specific data (e.g., verb conjugations, noun forms, emoji metadata) from open knowledge bases. Its core focus is on leveraging Wikidata’s lexeme dump—a structured repository of lexical entries—to generate datasets for downstream applications like Scribe-iOS.

Two types of getting data availability from Wikidata, one is Wikidata Query Service and another one is Lexeme dump database.

Let's break down the key differences between WDQS and Wiki Lexeme dumps:

FeatureWikidata Query Service (WDQS)Wiki Lexeme Dump
Access MethodInteractive web interface with SPARQL queries wikidata.orgStatic file downloads
Data FreshnessReal-time access to current dataSnapshot of data at dump time
Query FlexibilityComplex queries with filters and conditionsLimited to downloaded content
Language SupportResults available in any language wikidata.orgLimited to dump contents
Data ScopeFull Wikidata knowledge baseLexeme-specific data only wikidata.org
Processing PowerServer-side processingScribe-Data
Storage RequirementsNo storage neededLarge file downloads required

While Lexeme dumps are useful for offline processing or bulk data analysis, WDQS provides a more flexible and efficient solution for most querying needs, especially when working with dynamic or complex queries requiring real-time data access.

The autosuggestion process uses popular words from Wikipedia and their common successors as a baseline until NLP methods are applied. Autosuggestions are generated in gen_autosuggestions.ipynb. Emojis come from Unicode CLDR via the scribe-data get -lang LANGUAGE -dt emoji-keywords command.

What is Scribe-Data?

Scribe-Data is a command-line interface (CLI) tool that simplifies extracting, formatting, and managing multilingual data (e.g., verbs, nouns, emojis) from open knowledge bases. Developers use this data to build apps like Scribe-iOS, which offers features like verb conjugation and translation.


Key Features

  1. CLI Commands:

    • list: Show available languages (e.g., scribe-data list --language).

    • get: Fetch data (e.g., scribe-data get -lang English -dt verbs).

    • total: Check data counts (e.g., scribe-data total -lang German).

    • convert: Transform data into CSV/TSV/JSON formats.

  2. Interactive Mode:
    Run scribe-data get -i for a guided interface to select languages, data types, and output formats.

  3. Data Sources:

    • Wikidata (lexemes, grammar rules).

    • Wikipedia (popular words for autosuggestions).

    • Unicode CLDR (emoji keywords).


Quick Example: Fetching English Verbs

# Retrieve English verbs and save to a directory
scribe-data get --language English --data-type verbs --output-dir ./my_data

By adding —wikidata-dump-path(-wdp) will make use wiki Lexeme dump.

You can check the Scribe-Data CLI Usage and also in scribe-data.readthedocs.

Why Contribute?

Contributing to Scribe-Data offers meaningful impact and growth opportunities:

Professional Growth

  • Gain expertise in SPARQL queries, data processing, and multilingual computing

  • Build a public portfolio of meaningful open-source contributions

  • Collaborate with developers globally on language technology challenges

Impact on Language Accessibility

  • Help underserved language communities access better digital tools

  • Support language preservation through structured documentation

  • Enable innovative applications in education and communication

Technical Learning

  • Master practical skills in Python, CLI development, and data processing

  • Understand complex linguistic data structures and relationships

  • Learn best practices in open-source development and documentation


How to Contribute?

  • Fix Data Issues: Improve Wikidata entries (e.g., missing verb forms) instead of editing Scribe’s files directly. (Which can be found in Scribe-data open issues)

  • Expand Languages: Add support for underrepresented languages by writing missing SPARQL queries or updating Wikidata.

  • Build Features: Work on CLI enhancements (e.g., a terminal UI for interactive mode), any missing tests to add or a new feature that can help scribe community.


Get Started Today!

  1. Clone the repo and see the details on scribe-data installation process:

  2. Join the Community: Chat with the team on Matrix and tackle good first issue labeled tasks.

Scribe-Data is more than code—it’s about making language tools accessible to everyone. Ready to dive in? 🚀