Copyright | (c) 2017 Zac Slade |
---|---|
License | BSD-style |
Maintainer | [email protected] |
Stability | experimental |
Portability | GHC |
Safe Haskell | Safe-Inferred |
Language | Haskell98 |
Data.Text.ICU.CharsetDetection
Description
Access to the Unicode Character Set Detection facilities, implemented in the International Components for Unicode (ICU) libraries.
For more information see the "Character Set Detection" chapter in the ICU User Guide http://userguide.icu-project.org/conversion/detection.
Synopsis
- setText :: ByteString -> CharsetDetector -> IO ()
- detect :: ByteString -> IO CharsetMatch
- mkCharsetDetector :: IO CharsetDetector
- withCharsetDetector :: CharsetDetector -> (Ptr UCharsetDetector -> IO a) -> IO a
- wrapUCharsetMatch :: CharsetDetector -> IO (Ptr UCharsetMatch) -> IO CharsetMatch
- data CharsetMatch
- data CharsetDetector
- getConfidence :: CharsetMatch -> IO Int
- getName :: CharsetMatch -> IO Text
- getLanguage :: CharsetMatch -> IO Text
Documentation
setText :: ByteString -> CharsetDetector -> IO () Source #
From the ICU C API documentation: "Character set detection is at best an imprecise operation. The detection process will attempt to identify the charset that best matches the characteristics of the byte data, but the process is partly statistical in nature, and the results can not be guaranteed to always be correct.
For best accuracy in charset detection, the input data should be primarily in a single language, and a minimum of a few hundred bytes worth of plain text in the language are needed. The detection process will attempt to ignore html or xml style markup that could otherwise obscure the content."
Use the first 512 bytes, if available, as the text in the
CharsetDetector
object. This function is low-level and used by the more
high-level detect
function.
detect :: ByteString -> IO CharsetMatch Source #
Attempt to perform a detection without an input filter. The best match will be returned.
withCharsetDetector :: CharsetDetector -> (Ptr UCharsetDetector -> IO a) -> IO a Source #
Temporarily unwraps an CharsetDetector
to perform operations on its
raw UCharsetDetector
handle.
wrapUCharsetMatch :: CharsetDetector -> IO (Ptr UCharsetMatch) -> IO CharsetMatch Source #
data CharsetMatch Source #
Opaque character set match handle. The memory backing these objects is managed entirely by the ICU C library. TODO: UCharsetMatch is reset after the setText call. We need to handle it.
data CharsetDetector Source #
Handy wrapper for the pointer to the UCharsetDetector
. We must
always call ucsdet_close on any UCharsetDetector when we are done. The
withCharsetDetector
and wrapUCharsetDetector
functions simplify
management of the pointers.
getConfidence :: CharsetMatch -> IO Int Source #
See the confidence score from 0-100 of the CharsetMatch
object.
getName :: CharsetMatch -> IO Text Source #
Extract the character set encoding name from the CharsetMatch
object.
getLanguage :: CharsetMatch -> IO Text Source #
Extracts the three letter ISO code for the language encoded in the
CharsetMatch
.