CRAN

tokenizers 0.2.1

Fast, Consistent Tokenization of Natural Language Text

Released Mar 29, 2018 by Lincoln Mullen

This package can be loaded by Renjin but 26 out 73 tests failed.

Dependencies

SnowballC 0.5.1 Rcpp stringi

Convert natural language text into tokens. Includes tokenizers for shingled n-grams, skip n-grams, words, word stems, sentences, paragraphs, characters, shingled characters, lines, tweets, Penn Treebank, regular expressions, as well as functions for counting characters, words, and sentences, and a function for splitting longer texts into separate documents, each with the same number of words. The tokenizers have a consistent interface, and the package is built on the 'stringi' and 'Rcpp' packages for fast yet correct tokenization in 'UTF-8'.

Installation

Maven

This package can be included as a dependency from a Java or Scala project by including the following your project's pom.xml file. Read more about embedding Renjin in JVM-based projects.

<dependencies>
  <dependency>
    <groupId>org.renjin.cran</groupId>
    <artifactId>tokenizers</artifactId>
    <version>0.2.1-b5</version>
  </dependency>
</dependencies>
<repositories>
  <repository>
    <id>bedatadriven</id>
    <name>bedatadriven public repo</name>
    <url>https://nexus.bedatadriven.com/content/groups/public/</url>
  </repository>
</repositories>

View build log

Renjin CLI

If you're using Renjin from the command line, you load this library by invoking:

library('org.renjin.cran:tokenizers')

Test Results

This package was last tested against Renjin 0.9.2687 on Aug 25, 2018.

Source

R
C++

View GitHub Mirror

Release History