CRAN
tokenizers 0.2.1
Fast, Consistent Tokenization of Natural Language Text
Released Mar 29, 2018 by Lincoln Mullen
Dependencies
Convert natural language text into tokens. Includes tokenizers for shingled n-grams, skip n-grams, words, word stems, sentences, paragraphs, characters, shingled characters, lines, tweets, Penn Treebank, regular expressions, as well as functions for counting characters, words, and sentences, and a function for splitting longer texts into separate documents, each with the same number of words. The tokenizers have a consistent interface, and the package is built on the 'stringi' and 'Rcpp' packages for fast yet correct tokenization in 'UTF-8'.
Installation
Maven
This package can be included as a dependency from a Java or Scala project by including
the following your project's pom.xml
file.
Read more
about embedding Renjin in JVM-based projects.
<dependencies> <dependency> <groupId>org.renjin.cran</groupId> <artifactId>tokenizers</artifactId> <version>0.2.1-b5</version> </dependency> </dependencies> <repositories> <repository> <id>bedatadriven</id> <name>bedatadriven public repo</name> <url>https://nexus.bedatadriven.com/content/groups/public/</url> </repository> </repositories>
Renjin CLI
If you're using Renjin from the command line, you load this library by invoking:
library('org.renjin.cran:tokenizers')
Test Results
This package was last tested against Renjin 0.9.2687 on Aug 25, 2018.
- Basic_tokenizers.Character_tokenizer_produces_correct_output
- Basic_tokenizers.Paragraph_tokenizer_produces_correct_output
- Basic_tokenizers.Regex_tokenizer_produces_correct_output
- Basic_tokenizers.Sentence_tokenizer_produces_correct_output_E1
- Basic_tokenizers.Sentence_tokenizer_produces_correct_output_E2
- Basic_tokenizers.Sentence_tokenizer_produces_correct_output_E3
- Basic_tokenizers.Word_tokenizer_produces_correct_output
- Basic_tokenizers.Word_tokenizer_removes_stop_words
- Document_chunking.Document_chunking_work_on_lists_and_character_vectors_E1
- Document_chunking.Document_chunking_work_on_lists_and_character_vectors_E10
- Document_chunking.Document_chunking_work_on_lists_and_character_vectors_E2
- Document_chunking.Document_chunking_work_on_lists_and_character_vectors_E3
- Document_chunking.Document_chunking_work_on_lists_and_character_vectors_E4
- Document_chunking.Document_chunking_work_on_lists_and_character_vectors_E5
- Document_chunking.Document_chunking_work_on_lists_and_character_vectors_E6
- Document_chunking.Document_chunking_work_on_lists_and_character_vectors_E7
- Document_chunking.Document_chunking_work_on_lists_and_character_vectors_E8
- Document_chunking.Document_chunking_work_on_lists_and_character_vectors_E9
- Encodings.Encodings_work_on_Windows_E1
- Encodings.Encodings_work_on_Windows_E2
- Encodings.Encodings_work_on_Windows_E3
- Encodings.Encodings_work_on_Windows_E4
- Encodings.Encodings_work_on_Windows_E5
- Encodings.Encodings_work_on_Windows_E6
- N-gram_tokenizers.Combinations_for_skip_grams_are_correct_E1
- N-gram_tokenizers.Combinations_for_skip_grams_are_correct_E2
- N-gram_tokenizers.Combinations_for_skip_grams_are_correct_E3
- N-gram_tokenizers.Combinations_for_skip_grams_are_correct_E4
- N-gram_tokenizers.Shingled_n-gram_tokenizer_consistently_produces_NAs_where_appropriate
- N-gram_tokenizers.Skip_n-gram_tokenizer_consistently_produces_NAs_where_appropriate
- N-gram_tokenizers.Skip_n-gram_tokenizer_produces_correct_output_E1
- N-gram_tokenizers.Skip_n-gram_tokenizer_produces_correct_output_E2
- N-gram_tokenizers.Skips_with_values_greater_than_k_are_refused_E1
- N-gram_tokenizers.Skips_with_values_greater_than_k_are_refused_E2
- N-gram_tokenizers.Skips_with_values_greater_than_k_are_refused_E3
- N-gram_tokenizers.Skips_with_values_greater_than_k_are_refused_E4
- N-gram_tokenizers.Skips_with_values_greater_than_k_are_refused_E5
- N-gram_tokenizers.Skips_with_values_greater_than_k_are_refused_E6
- PTB_tokenizer.Word_tokenizer_produces_correct_output_E1
- PTB_tokenizer.Word_tokenizer_produces_correct_output_E2
- Stem_tokenizers.Stem_tokenizer_produces_correct_output
- Text_Interchange_Format.Can_coerce_a_TIF_compliant_data_frame_to_a_character_vector
- Text_Interchange_Format.Can_detect_a_TIF_compliant_data_frame_E1
- Text_Interchange_Format.Can_detect_a_TIF_compliant_data_frame_E2
- Text_Interchange_Format.Different_methods_produce_identical_output
- Tweet_tokenizer.names_are_preserved_with_tweet_tokenizer_E1
- Tweet_tokenizer.names_are_preserved_with_tweet_tokenizer_E2
- Tweet_tokenizer.punctuation_as_part_of_tweets_can_preserved_E1
- Tweet_tokenizer.punctuation_as_part_of_tweets_can_preserved_E2
- Tweet_tokenizer.tweet_tokenizer_works_correctly_with_case_E1
- Tweet_tokenizer.tweet_tokenizer_works_correctly_with_case_E2
- Tweet_tokenizer.tweet_tokenizer_works_correctly_with_case_E3
- Tweet_tokenizer.tweet_tokenizer_works_correctly_with_case_E4
- Tweet_tokenizer.tweet_tokenizer_works_correctly_with_strip_punctuation_E1
- Tweet_tokenizer.tweet_tokenizer_works_correctly_with_strip_punctuation_E2
- Tweet_tokenizer.tweet_tokenizer_works_correctly_with_strip_punctuation_E3
- Tweet_tokenizer.tweet_tokenizer_works_correctly_with_strip_url_E1
- Tweet_tokenizer.tweet_tokenizer_works_correctly_with_strip_url_E2
- Utils.Inputs_are_verified_correct_E1
- Utils.Inputs_are_verified_correct_E2
- Utils.Inputs_are_verified_correct_E3
- Utils.Inputs_are_verified_correct_E4
- Utils.Inputs_are_verified_correct_E5
- Utils.Stopwords_are_removed
- Word_counts.Word_counts_give_correct_results_E1
- Word_counts.Word_counts_give_correct_results_E2
- Word_counts.Word_counts_give_correct_results_E3
- basic-tokenizers-examples
- ngram-tokenizers-examples
- shingle-tokenizers-examples
- stem-tokenizers-examples
- testthat
- word-counting-examples