sopel-unicode

Sopel plugin for information lookup on Unicode codepoints.

This plugin is designed as a drop-in replacement for the built-in unicode_info plugin. It provides:

The General Category of each codepoint
Optional support for Unicode Character Database (UCD) versions newer than the Python release used to run Sopel, if unicodedata2 is available
Optional support for reporting the Unicode version that introduced each codepoint, if unicode_age is available

Installing

Releases are hosted on PyPI, so after installing Sopel, all you need is pip:

$ pip install sopel-unicode

Disable built-in `unicode_info` plugin

You should edit your Sopel core config to add unicode_info to the exclude plugin list, otherwise you will get duplicated responses from both plugins.

Optional features

sopel-unicode is designed to act as a drop-in replacement for the original built-in plugin on a basic installation. To enable all the optional features:

$ pip install sopel-unicode[all]

Usage

Note that output given in this section corresponds to sopel-unicode[all] except where noted. Output layout may differ if some optional dependencies are missing.

Codepoint lookup

The unicode (short-form u) command provides lookup of codepoints in a provided string. Input characters defined by the configuration option ignore_chars are ignored.

Lookup uses unicodedata2 if it is available, and falls back on stdlib unicodedata otherwise.

<SnoopJ> .unicode 🫩
<terribot> [unicode] (🫩): U+1FAE9 v16.0 (So) FACE WITH BAGS UNDER EYES
<SnoopJ> .u 🏴‍☠ 
<terribot> [unicode] (🏴): U+1F3F4 v7.0 (So) WAVING BLACK FLAG
<terribot> [unicode] (‍): U+200D v1.1 (Cf) ZERO WIDTH JOINER
<terribot> [unicode] (☠): U+2620 v1.1 (So) SKULL AND CROSSBONES

It is sometimes convenient to discard all ASCII characters from lookup, which can be done with the unicode:noascii(u:noascii) command:

<SnoopJ> .u:noascii ça va?
<terribot> [unicode] (ç): U+00E7 v1.1 (Ll) LATIN SMALL LETTER C WITH CEDILLA

The unicode:raw (u:raw) command is provided to avoid discarding any codepoints when performing lookup.

<SnoopJ> .unicode:raw a b
<terribot> [unicode] (a): U+0061 v1.1 (Ll) LATIN SMALL LETTER A
<terribot> [unicode] ( ): U+0020 v1.1 (Zs) SPACE
<terribot> [unicode] (b): U+0062 v1.1 (Ll) LATIN SMALL LETTER B

Individual codepoints can also be looked up with hex notation, in either U+NNNN form, 0xNNNN form, or \uNNNN form.

<SnoopJ> .unicode U+037E
<terribot> [unicode] (;): U+037E v1.1 (Po) GREEK QUESTION MARK
<SnoopJ> .u 0xBEEF
<terribot> [unicode] (뻯): U+BEEF v2.0 (Lo) HANGUL SYLLABLE BBEGS
<SnoopJ> .u \u732b
<terribot> [unicode] (猫): U+732B v1.1 (Lo) CJK UNIFIED IDEOGRAPH-732B

Note that the \u notation is not restricted in the same way as the same notation for Python literals. You may use as many or as few hex digits as you like.

<SnoopJ> .u \u1
<terribot> [unicode] (): U+0001 v1.1 (Cc) START OF HEADING
<SnoopJ> .u \u12345
<terribot> [unicode] (𒍅): U+12345 v5.0 (Lo) CUNEIFORM SIGN URU TIMES KI

Normalization forms

The Unicode normalization forms are available to transform input strings.

Input characters defined by the configuration option ignore_chars are ignored.

<SnoopJ> .unicode:NFKD ça va
<terribot> [unicode] (c): U+0063 v1.1 (Ll) LATIN SMALL LETTER C
<terribot> [unicode] (◌̧): U+0327 v1.1 (Mn) COMBINING CEDILLA
<terribot> [unicode] (a): U+0061 v1.1 (Ll) LATIN SMALL LETTER A
<terribot> [unicode] (v): U+0076 v1.1 (Ll) LATIN SMALL LETTER V
<terribot> [unicode] (a): U+0061 v1.1 (Ll) LATIN SMALL LETTER A
<SnoopJ> .u:NFKC ça va
<terribot> [unicode] (ç): U+00E7 v1.1 (Ll) LATIN SMALL LETTER C WITH CEDILLA
<terribot> [unicode] (a): U+0061 v1.1 (Ll) LATIN SMALL LETTER A
<terribot> [unicode] (v): U+0076 v1.1 (Ll) LATIN SMALL LETTER V
<terribot> [unicode] (a): U+0061 v1.1 (Ll) LATIN SMALL LETTER A

Codepoint search

A rudimentary search functionality is available. The maximum number of matches reported can be configured, as many queries produce a large number of results.

<SnoopJ> .unicode:search apple
<terribot> [unicode] 3 results:
<terribot> [unicode] 🍍 U+1f34d PINEAPPLE
<terribot> [unicode] 🍎 U+1f34e RED APPLE
<terribot> [unicode] 🍏 U+1f34f GREEN APPLE

Configuring

The easiest way to configure sopel-unicode is via Sopel's configuration wizard—simply run sopel-plugins configure sopel-unicode and enter the values for which it prompts you.

Field	Description	Default (if any)
`max_length`	Maximum length of Unicode string input	`5`
`length_override_channels`	Channels where max_length does not apply	`[]`
`ignore_characters`	Characters ignored during lookup	`[' ']`
`search_max_matches`	Maximum number of matches for a codepoint search	`10`
`search_num_public_matches`	Number of matches publicly reported for a codepoint search	`2`

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
.github/workflows		.github/workflows
sopel_unicode		sopel_unicode
tools		tools
COPYING		COPYING
Makefile		Makefile
NEWS		NEWS
README.md		README.md
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

sopel-unicode

Installing

Disable built-in `unicode_info` plugin

Optional features

Usage

Codepoint lookup

Normalization forms

Codepoint search

Configuring

About

Uh oh!

Releases 1

Sponsor this project

Uh oh!

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

sopel-unicode

Installing

Disable built-in unicode_info plugin

Optional features

Usage

Codepoint lookup

Normalization forms

Codepoint search

Configuring

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Disable built-in `unicode_info` plugin

Packages