Skip to content
/ yore Public

rust library for handling single byte character encodings like CP850 etc

Notifications You must be signed in to change notification settings

bonega/yore

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Yore

A Rust library for decoding and encoding character sets based on OEM code pages.

yore at crates.io yore at docs.rs

Features

  • Fast performance *
  • Minimal memory usage with Cow and shrink_to_fit
  • Easy-to-use API
  • Broad range of supported code pages
  • Handles code pages with redefined ASCII characters (<0x80), such as '٪' in CP864

Usage

Add yore to your Cargo.toml file.

[dependencies]
yore = "1.1.0"

Examples

Using a specific code page

use yore::code_pages::{CP857, CP850};
use yore::{DecodeError, EncodeError};

// Vec contains ASCII "text"
let bytes = vec![116, 101, 120, 116];
// Vec contains ASCII "text " and codepoint 231
let bytes_undefined = vec![116, 101, 120, 116, 32, 231]; 

// Notice that decoding CP850 can't fail because it is completely defined
assert_eq!(CP850.decode(&bytes), "text");

// However, CP857 can fail
assert_eq!(CP857.decode(&bytes).unwrap(), "text");

// "text " + codepoint 231 
assert!(matches!(CP857.decode(&bytes_undefined), DecodeError));

// Lossy decoding won't fail due to fallback
assert_eq!(CP857.decode_lossy(&bytes_undefined), "text �");

// Encoding
assert_eq!(CP850.encode("text").unwrap(), bytes);
assert!(matches!(CP850.encode("text 🦀"), EncodeError));
assert_eq!(CP850.encode_lossy("text 🦀", 231), bytes_undefined);

Using a trait object

use yore::CodePage;
fn do_something(code_page: &dyn CodePage, bytes: &[u8]) {
    println!("{}", code_page.decode(bytes).unwrap());
}

Supported code pages

Identifier Name Description
437 ibm437 OEM United States
737 ibm737 OEM Greek (formerly 437G); Greek (DOS)
775 ibm775 OEM Baltic; Baltic (DOS)
850 ibm850 OEM Multilingual Latin 1; Western European (DOS)
852 ibm852 OEM Latin 2; Central European (DOS)
855 ibm855 OEM Cyrillic (primarily Russian)
857 ibm857 OEM Turkish; Turkish (DOS)
860 ibm860 OEM Portuguese; Portuguese (DOS)
861 ibm861 OEM Icelandic; Icelandic (DOS)
862 dos-862 OEM Hebrew; Hebrew (DOS)
863 ibm863 OEM French Canadian; French Canadian (DOS)
864 ibm864 OEM Arabic; Arabic (864)
865 ibm865 OEM Nordic; Nordic (DOS)
866 cp866 OEM Russian; Cyrillic (DOS)
869 ibm869 OEM Modern Greek; Greek, Modern (DOS)
874 windows-874 Thai (Windows)
910 ibm910 IBM-PC APL2
1250 windows-1250 ANSI Central European; Central European (Windows)
1251 windows-1251 ANSI Cyrillic; Cyrillic (Windows)
1252 windows-1252 ANSI Latin 1; Western European (Windows)
1253 windows-1253 ANSI Greek; Greek (Windows)
1254 windows-1254 ANSI Turkish; Turkish (Windows)
1255 windows-1255 ANSI Hebrew; Hebrew (Windows)
1256 windows-1256 ANSI Arabic; Arabic (Windows)
1257 windows-1257 ANSI Baltic; Baltic (Windows)
1258 windows-1258 ANSI/OEM Vietnamese; Vietnamese (Windows)

* Benchmarks

encoding_rs supports only a few of the encodings that oem_cp and yore support. Additionally, encoding_rs focuses on streaming use cases.

Refer to the bench crate for more details.

About

rust library for handling single byte character encodings like CP850 etc

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages