Skip to content

Commit 08ad60e

Browse files
Merge pull request #42 from nipunsadvilkar/npn-pysbd-spacy-factory
✨ `pysbd` as a spacy component through entrypoints
2 parents a2bb451 + b9e0949 commit 08ad60e

File tree

5 files changed

+49
-2
lines changed

5 files changed

+49
-2
lines changed

README.md

Lines changed: 24 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,13 @@
11
# pySBD: Python Sentence Boundary Disambiguation (SBD)
22

3-
[![Build Status](https://travis-ci.org/nipunsadvilkar/pySBD.svg?branch=master)](https://travis-ci.org/nipunsadvilkar/pySBD) [![License](https://img.shields.io/badge/license-MIT-brightgreen.svg?style=flat)](https://github.com/nipunsadvilkar/pySBD/blob/master/LICENSE)
3+
[![Build Status](https://travis-ci.org/nipunsadvilkar/pySBD.svg?branch=master)](https://travis-ci.org/nipunsadvilkar/pySBD) [![License](https://img.shields.io/badge/license-MIT-brightgreen.svg?style=flat)](https://github.com/nipunsadvilkar/pySBD/blob/master/LICENSE) [![PyPi](https://img.shields.io/pypi/v/pysbd?color=blue&logo=pypi&logoColor=white)](https://pypi.python.org/pypi/pysbd) [![GitHub](https://img.shields.io/github/v/release/nipunsadvilkar/pySBD.svg?include_prereleases&logo=github&style=flat)](https://github.com/nipunsadvilkar/pySBD)
44

55
pySBD - python Sentence Boundary Disambiguation (SBD) - is a rule-based sentence boundary detection module that works out-of-the-box.
66

77
This project is a direct port of ruby gem - [Pragmatic Segmenter](https://github.com/diasks2/pragmatic_segmenter) which provides rule-based sentence boundary detection.
88

9+
![pysbd_code](artifacts/pysbd_code.png?raw=true "pysbd_code")
10+
911
## Install
1012

1113
**Python**
@@ -25,6 +27,27 @@ print(seg.segment(text))
2527
```
2628

2729
- Use `pysbd` as a [spaCy](https://spacy.io/usage/processing-pipelines) pipeline component. (recommended)</br>Please refer to example [pysbd\_as\_spacy\_component.py](https://github.com/nipunsadvilkar/pySBD/blob/master/examples/pysbd_as_spacy_component.py)
30+
- Use pysbd through [entrypoints](https://spacy.io/usage/saving-loading#entry-points-components)
31+
32+
```python
33+
import spacy
34+
from pysbd.util import PySBDFactory
35+
36+
nlp = spacy.blank('en')
37+
38+
# explicitly adding component to pipeline
39+
# (recommended - makes it more readable to tell what's going on)
40+
nlp.add_pipe(PySBDFactory(nlp))
41+
42+
# or you can use it implicitly with keyword
43+
# pysbd = nlp.create_pipe('pysbd')
44+
# nlp.add_pipe(pysbd)
45+
46+
doc = nlp('My name is Jonas E. Smith. Please turn to p. 55.')
47+
print(list(doc.sents))
48+
# [My name is Jonas E. Smith., Please turn to p. 55.]
49+
50+
```
2851

2952
## Contributing
3053

artifacts/pysbd_code.png

81.1 KB
Loading

pysbd/about.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
# https://python-packaging-user-guide.readthedocs.org/en/latest/single_source_version/
33

44
__title__ = "pysbd"
5-
__version__ = "0.2.0"
5+
__version__ = "0.2.1"
66
__summary__ = "pysbd (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box across many languages."
77
__uri__ = "http://nipunsadvilkar.github.io/"
88
__author__ = "Nipun Sadvilkar"

pysbd/utils.py

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
#!/usr/bin/env python
22
# -*- coding: utf-8 -*-
33
import re
4+
import pysbd
45

56

67
class Rule(object):
@@ -65,6 +66,26 @@ def __eq__(self, other):
6566
return False
6667

6768

69+
class PySBDFactory(object):
70+
"""pysbd as a spacy component through entrypoints"""
71+
72+
def __init__(self, nlp, language='en', clean=False, char_span=True):
73+
self.nlp = nlp
74+
self.seg = pysbd.Segmenter(language=language, clean=clean,
75+
char_span=char_span)
76+
77+
def __call__(self, doc):
78+
sents_char_spans = self.seg.segment(doc.text)
79+
char_spans = [doc.char_span(sent_span.start, sent_span.end)
80+
for sent_span in sents_char_spans]
81+
start_token_ids = [span[0].idx for span in char_spans if span
82+
is not None]
83+
for token in doc:
84+
token.is_sent_start = (True if token.idx
85+
in start_token_ids else False)
86+
return doc
87+
88+
6889
if __name__ == "__main__":
6990
SubstituteListPeriodRule = Rule('♨', '∯')
7091
StdRule = Rule(r'∯', r'∯♨')

setup.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -88,4 +88,7 @@ def run(self):
8888
cmdclass={
8989
'upload': UploadCommand,
9090
},
91+
entry_points={
92+
"spacy_factories": ["pysbd = pysbd.utils:PySBDFactory"]
93+
}
9194
)

0 commit comments

Comments
 (0)