Skip to content

Commit 05edfdb

Browse files
✨ βœ… Handle intermittent punctuations (#35)
✨ βœ… Handle intermittent punctuations
2 parents 6006cdd + ddebf0f commit 05edfdb

File tree

7 files changed

+14
-10
lines changed

7 files changed

+14
-10
lines changed

β€ŽCHANGELOG.mdβ€Ž

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,3 +17,7 @@
1717
- πŸ› Fix & ♻️refactor `replace_multi_period_abbreviations` - \#30
1818
- πŸ› Fix `abbreviation_replacer` - \#31
1919
- βœ… Add regression tests for issues
20+
21+
# v0.1.4
22+
23+
- ✨ βœ… Handle intermittent punctuations - \#34

β€Žpysbd/about.pyβ€Ž

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
# https://python-packaging-user-guide.readthedocs.org/en/latest/single_source_version/
33

44
__title__ = "pysbd"
5-
__version__ = "0.1.3"
5+
__version__ = "0.1.4"
66
__summary__ = "pysbd (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box across many languages."
77
__uri__ = "http://nipunsadvilkar.github.io/"
88
__author__ = "Nipun Sadvilkar"

β€Žpysbd/lang/common/numbers.pyβ€Ž

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,9 @@
55

66
class Common(object):
77

8-
SENTENCE_BOUNDARY_REGEX = r"\\u{ff08}(?:[^\\u{ff09}])*\\u{ff09}(?=\s?[A-Z])|\\u{300c}(?:[^\\u{300d}])*\\u{300d}(?=\s[A-Z])|\((?:[^\)]){2,}\)(?=\s[A-Z])|\'(?:[^\'])*[^,]\'(?=\s[A-Z])|\"(?:[^\"])*[^,]\"(?=\s[A-Z])|\β€œ(?:[^\”])*[^,]\”(?=\s[A-Z])|\S.*?[γ€‚οΌŽ.!!?οΌŸΘΈΘΉβ˜‰β˜ˆβ˜‡β˜„]"
9-
8+
# added special case: r"[γ€‚οΌŽ.!!?].*" to handle intermittent dots, exclamation, etc.
9+
# TODO: above special cases group can be updated as per developer needs
10+
SENTENCE_BOUNDARY_REGEX = r"((?:[^οΌ‰])*οΌ‰(?=\s?[A-Z])|γ€Œ(?:[^」])*」(?=\s[A-Z])|\((?:[^\)]){2,}\)(?=\s[A-Z])|\'(?:[^\'])*[^,]\'(?=\s[A-Z])|\"(?:[^\"])*[^,]\"(?=\s[A-Z])|\β€œ(?:[^\”])*[^,]\”(?=\s[A-Z])|[γ€‚οΌŽ.!!?].*|\S.*?[γ€‚οΌŽ.!!?οΌŸΘΈΘΉβ˜‰β˜ˆβ˜‡β˜„]"
1011
# # Rubular: http://rubular.com/r/NqCqv372Ix
1112
QUOTATION_AT_END_OF_SENTENCE_REGEX = r'[!?\.-][\"\'β€œβ€]\s{1}[A-Z]'
1213

β€Žpysbd/lists_item_replacer.pyβ€Ž

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -130,7 +130,6 @@ def replace_item(match, val=None, strip=False, repl='♨'):
130130
if strip:
131131
match = str(match).strip()
132132
chomped_match = match if len(match) == 1 else match.strip('.])')
133-
print(each, match, chomped_match)
134133
if str(each) == chomped_match:
135134
return "{}{}".format(each, replacement)
136135
else:
@@ -246,4 +245,3 @@ def iterate_alphabet_array(self, regex, parens=False, roman_numeral=False):
246245
li = ListItemReplacer(text)
247246
li.add_line_break()
248247
print(repr(li.text))
249-
# print(li.text)

β€Žpysbd/rules.pyβ€Ž

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -37,15 +37,12 @@ class Text(str):
3737
def apply(self, *rules):
3838
for each_r in rules:
3939
self = re.sub(each_r.pattern, each_r.replacement, self)
40-
# print(self, each_r)
4140
return self
4241

4342
if __name__ == "__main__":
4443
SubstituteListPeriodRule = Rule('♨', '∯')
4544
StdRule = Rule(r'∯', r'βˆ―β™¨')
4645
more_rules = [Rule(r'βˆ―β™¨', r'∯∯∯∯'), Rule(r'∯∯∯∯', '♨♨')]
47-
# Text("I. abcd ♨ acnjfe").apply(SubstituteListPeriodRule, StdRule)
4846
sample_text = Text("I. abcd ♨ acnjfe")
4947
output = sample_text.apply(SubstituteListPeriodRule, StdRule, *more_rules)
5048
print(output)
51-
# I. abcd $ acnjfe

β€Žpysbd/segmenter.pyβ€Ž

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ def segment(self, text):
2626
# text = "Proof. First let v ∈ V be incident to at least three leaves and suppose there is a minimum power dominating set S of G that does not contain v. If S excludes two or more of the leaves of G incident to v, then those leaves cannot be dominated or forced at any step. Thus, S excludes at most one leaf incident to v, which means S contains at least two leaves β„“ 1 and β„“ 2 incident to v. Then, (S\{β„“ 1 , β„“ 2 }) βˆͺ {v} is a smaller power dominating set than S, which is a contradiction. Now consider the case in which v ∈ V is incident to exactly two leaves, β„“ 1 and β„“ 2 , and suppose there is a minimum power dominating set S of G such that {v, β„“ 1 , β„“ 2 } ∩ S = βˆ…. Then neither β„“ 1 nor β„“ 2 can be dominated or forced at any step, contradicting the assumption that S is a power dominating set. If S is a power dominating set that contains β„“ 1 or β„“ 2 , say β„“ 1 , then (S\{β„“ 1 }) βˆͺ {v} is also a power dominating set and has the same cardinality. Applying this to every vertex incident to exactly two leaves produces the minimum power dominating set required by (3). Definition 3.4. Given a graph G = (V, E) and a set X βŠ† V , define β„“ r (G, X) as the graph obtained by attaching r leaves to each vertex in X. If X = {v 1 , . . . , v k }, we denote the r leaves attached to vertex v i as β„“"
2727
text = "Random walk models (Skellam, 1951;Turchin, 1998) received a lot of attention and were then extended to several more mathematically and statistically sophisticated approaches to interpret movement data such as State-Space Models (SSM) (Jonsen et al., 2003(Jonsen et al., , 2005 and Brownian Bridge Movement Model (BBMM) (Horne et al., 2007). Nevertheless, these models require heavy computational resources (Patterson et al., 2008) and unrealistic structural a priori hypotheses about movement, such as homogeneous movement behavior. A fundamental property of animal movements is behavioral heterogeneity (Gurarie et al., 2009) and these models poorly performed in highlighting behavioral changes in animal movements through space and time (Kranstauber et al., 2012)."
2828
print("Input String:\n{}".format(text))
29-
seg = Segmenter(language="en", clean=True)
29+
seg = Segmenter(language="en", clean=False)
3030
segments = seg.segment(text)
3131
print("\n################## Processing #######################\n")
3232
print("Number of sentences: {}\n".format(len(segments)))
Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,11 @@
1515
"EMC 3 algorithm is implemented with the Java SE platform and is running on a Java HotSpot(TM) 64-Bit Server VM; and the implementation details are given in Appendix, available in the online supplemental material."
1616
]),
1717
('#31', r"Proof. First let v ∈ V be incident to at least three leaves and suppose there is a minimum power dominating set S of G that does not contain v. If S excludes two or more of the leaves of G incident to v, then those leaves cannot be dominated or forced at any step. Thus, S excludes at most one leaf incident to v, which means S contains at least two leaves β„“ 1 and β„“ 2 incident to v. Then, (S\{β„“ 1 , β„“ 2 }) βˆͺ {v} is a smaller power dominating set than S, which is a contradiction. Now consider the case in which v ∈ V is incident to exactly two leaves, β„“ 1 and β„“ 2 , and suppose there is a minimum power dominating set S of G such that {v, β„“ 1 , β„“ 2 } ∩ S = βˆ…. Then neither β„“ 1 nor β„“ 2 can be dominated or forced at any step, contradicting the assumption that S is a power dominating set. If S is a power dominating set that contains β„“ 1 or β„“ 2 , say β„“ 1 , then (S\{β„“ 1 }) βˆͺ {v} is also a power dominating set and has the same cardinality. Applying this to every vertex incident to exactly two leaves produces the minimum power dominating set required by (3). Definition 3.4. Given a graph G = (V, E) and a set X βŠ† V , define β„“ r (G, X) as the graph obtained by attaching r leaves to each vertex in X. If X = {v 1 , . . . , v k }, we denote the r leaves attached to vertex v i as β„“",
18-
['Proof.', 'First let v ∈ V be incident to at least three leaves and suppose there is a minimum power dominating set S of G that does not contain v. If S excludes two or more of the leaves of G incident to v, then those leaves cannot be dominated or forced at any step.', 'Thus, S excludes at most one leaf incident to v, which means S contains at least two leaves β„“ 1 and β„“ 2 incident to v. Then, (S\\{β„“ 1 , β„“ 2 }) βˆͺ {v} is a smaller power dominating set than S, which is a contradiction.', 'Now consider the case in which v ∈ V is incident to exactly two leaves, β„“ 1 and β„“ 2 , and suppose there is a minimum power dominating set S of G such that {v, β„“ 1 , β„“ 2 } ∩ S = βˆ….', 'Then neither β„“ 1 nor β„“ 2 can be dominated or forced at any step, contradicting the assumption that S is a power dominating set.', 'If S is a power dominating set that contains β„“ 1 or β„“ 2 , say β„“ 1 , then (S\\{β„“ 1 }) βˆͺ {v} is also a power dominating set and has the same cardinality.', 'Applying this to every vertex incident to exactly two leaves produces the minimum power dominating set required by (3).', 'Definition 3.4.', 'Given a graph G = (V, E) and a set X βŠ† V , define β„“ r (G, X) as the graph obtained by attaching r leaves to each vertex in X. If X = {v 1 , . . . , v k }, we denote the r leaves attached to vertex v i as β„“'])
18+
['Proof.', 'First let v ∈ V be incident to at least three leaves and suppose there is a minimum power dominating set S of G that does not contain v. If S excludes two or more of the leaves of G incident to v, then those leaves cannot be dominated or forced at any step.', 'Thus, S excludes at most one leaf incident to v, which means S contains at least two leaves β„“ 1 and β„“ 2 incident to v. Then, (S\\{β„“ 1 , β„“ 2 }) βˆͺ {v} is a smaller power dominating set than S, which is a contradiction.', 'Now consider the case in which v ∈ V is incident to exactly two leaves, β„“ 1 and β„“ 2 , and suppose there is a minimum power dominating set S of G such that {v, β„“ 1 , β„“ 2 } ∩ S = βˆ….', 'Then neither β„“ 1 nor β„“ 2 can be dominated or forced at any step, contradicting the assumption that S is a power dominating set.', 'If S is a power dominating set that contains β„“ 1 or β„“ 2 , say β„“ 1 , then (S\\{β„“ 1 }) βˆͺ {v} is also a power dominating set and has the same cardinality.', 'Applying this to every vertex incident to exactly two leaves produces the minimum power dominating set required by (3).', 'Definition 3.4.', 'Given a graph G = (V, E) and a set X βŠ† V , define β„“ r (G, X) as the graph obtained by attaching r leaves to each vertex in X. If X = {v 1 , . . . , v k }, we denote the r leaves attached to vertex v i as β„“']),
19+
('#34', '.', ['.']),
20+
('#34', '..', ['..']),
21+
('#34', '. . .', ['. . .']),
22+
('#34', '! ! !', ['! ! !']),
1923
]
2024

2125
@pytest.mark.parametrize('issue_no,text,expected_sents', TEST_ISSUE_DATA)

0 commit comments

Comments
Β (0)