How to start Scribe

Welcome to Scribe-Data, an open-source initiative designed to systematically extract, structure, and manage multilingual linguistic data from Wikidata and Wikipedia. It supports the development of language tools such as intelligent keyboards, translation systems, and grammar assistants. Below is a structured overview of its purpose and workflow:

Project Overview

Scribe-Data is a command-line interface (CLI) tool that automates the retrieval and organization of language-specific data (e.g., verb conjugations, noun forms, emoji metadata) from open knowledge bases. Its core focus is on leveraging Wikidata’s lexeme dump—a structured repository of lexical entries—to generate datasets for downstream applications like Scribe-iOS.

Two types of getting data availability from Wikidata, one is Wikidata Query Service and another one is Lexeme dump database.

Let's break down the key differences between WDQS and Wiki Lexeme dumps:

Feature	Wikidata Query Service (WDQS)	Wiki Lexeme Dump
Access Method	Interactive web interface with SPARQL queries wikidata.org	Static file downloads
Data Freshness	Real-time access to current data	Snapshot of data at dump time
Query Flexibility	Complex queries with filters and conditions	Limited to downloaded content
Language Support	Results available in any language wikidata.org	Limited to dump contents
Data Scope	Full Wikidata knowledge base	Lexeme-specific data only wikidata.org
Processing Power	Server-side processing	Scribe-Data
Storage Requirements	No storage needed	Large file downloads required

While Lexeme dumps are useful for offline processing or bulk data analysis, WDQS provides a more flexible and efficient solution for most querying needs, especially when working with dynamic or complex queries requiring real-time data access.

The autosuggestion process uses popular words from Wikipedia and their common successors as a baseline until NLP methods are applied. Autosuggestions are generated in gen_autosuggestions.ipynb. Emojis come from Unicode CLDR via the scribe-data get -lang LANGUAGE -dt emoji-keywords command.

What is Scribe-Data?

Scribe-Data is a command-line interface (CLI) tool that simplifies extracting, formatting, and managing multilingual data (e.g., verbs, nouns, emojis) from open knowledge bases. Developers use this data to build apps like Scribe-iOS, which offers features like verb conjugation and translation.

Key Features

CLI Commands:
- list: Show available languages (e.g., scribe-data list --language).
- get: Fetch data (e.g., scribe-data get -lang English -dt verbs).
- total: Check data counts (e.g., scribe-data total -lang German).
- convert: Transform data into CSV/TSV/JSON formats.
Interactive Mode:
Run scribe-data get -i for a guided interface to select languages, data types, and output formats.
Data Sources:
- Wikidata (lexemes, grammar rules).
- Wikipedia (popular words for autosuggestions).
- Unicode CLDR (emoji keywords).

Quick Example: Fetching English Verbs

# Retrieve English verbs and save to a directory
scribe-data get --language English --data-type verbs --output-dir ./my_data

By adding —wikidata-dump-path(-wdp) will make use wiki Lexeme dump.

You can check the Scribe-Data CLI Usage and also in scribe-data.readthedocs.

Why Contribute?

Contributing to Scribe-Data offers meaningful impact and growth opportunities:

Professional Growth

Gain expertise in SPARQL queries, data processing, and multilingual computing
Build a public portfolio of meaningful open-source contributions
Collaborate with developers globally on language technology challenges

Impact on Language Accessibility

Help underserved language communities access better digital tools
Support language preservation through structured documentation
Enable innovative applications in education and communication

Technical Learning

Master practical skills in Python, CLI development, and data processing
Understand complex linguistic data structures and relationships
Learn best practices in open-source development and documentation

How to Contribute?

Fix Data Issues: Improve Wikidata entries (e.g., missing verb forms) instead of editing Scribe’s files directly. (Which can be found in Scribe-data open issues)
Expand Languages: Add support for underrepresented languages by writing missing SPARQL queries or updating Wikidata.
Build Features: Work on CLI enhancements (e.g., a terminal UI for interactive mode), any missing tests to add or a new feature that can help scribe community.

Get Started Today!

Clone the repo and see the details on scribe-data installation process:
Join the Community: Chat with the team on Matrix and tackle good first issue labeled tasks.

Scribe-Data is more than code—it’s about making language tools accessible to everyone. Ready to dive in? 🚀