�
�d, � �l � d dl Z d dlZd dlmZmZ ddlmZmZ ej d� Z G d� d� Z
y)� N)�Optional�Union� )�LanguageFilter�ProbingStates% [a-zA-Z]*[�-�]+[a-zA-Z]*[^a-zA-Z�-�]?c � � e Zd ZdZej
fdeddfd�Zdd�Zede e
fd�� Zede e
fd�� Zd e
eef defd
�Zedefd�� Zdefd�Zed
e
eef defd�� Zed
e
eef defd�� Zed
e
eef defd�� Zy)�
CharSetProbergffffff�?�lang_filter�returnNc � � t j | _ d| _ || _ t j t � | _ y )NT) r � DETECTING�_state�activer
�logging� getLogger�__name__�logger)�selfr
s �7/usr/lib/python3/dist-packages/chardet/charsetprober.py�__init__zCharSetProber.__init__, s0 � �"�,�,������&����'�'��1��� c �. � t j | _ y �N)r r
r �r s r �resetzCharSetProber.reset2 s � �"�,�,��r c � � y r � r s r �charset_namezCharSetProber.charset_name5 s � �r c � � t �r ��NotImplementedErrorr s r �languagezCharSetProber.language9 s � �!�!r �byte_strc � � t �r r )r r# s r �feedzCharSetProber.feed= s � �!�!r c � � | j S r )r r s r �statezCharSetProber.state@ s � ��{�{�r c � � y)Ng r r s r �get_confidencezCharSetProber.get_confidenceD s � �r �bufc �4 � t j dd| � } | S )Ns ([ -])+� )�re�sub)r* s r �filter_high_byte_onlyz#CharSetProber.filter_high_byte_onlyG s � ��f�f�&��c�2���
r c �� � t � }t j | � }|D ]C }|j |dd � |dd }|j � s|dk rd}|j |� �E |S )u7
We define three types of bytes:
alphabet: english alphabets [a-zA-Z]
international: international characters [-ÿ]
marker: everything else [^a-zA-Z-ÿ]
The input buffer can be thought to contain a series of words delimited
by markers. This function works to filter all words that contain at
least one international character. All contiguous sequences of markers
are replaced by a single space ascii character.
This filter applies to all scripts which do not use English characters.
N���� �r, )� bytearray�INTERNATIONAL_WORDS_PATTERN�findall�extend�isalpha)r* �filtered�words�word� last_chars r �filter_international_wordsz(CharSetProber.filter_international_wordsL sv � � �;��
,�3�3�C�8���
'�D��O�O�D��"�I�&� �R�S� �I��$�$�&�9�w�+>� � ��O�O�I�&�
'� �r c �* � t � }d}d}t | � j d� } t | � D ]F \ }}|dk( r|dz }d}�|dk( s�||kD r'|s%|j | || � |j d� d}�H |s|j | |d � |S )
a[
Returns a copy of ``buf`` that retains only the sequences of English
alphabet and high byte characters that are not between <> characters.
This filter can be applied to all scripts which contain both English
characters and extended ASCII characters, but is currently only used by
``Latin1Prober``.
Fr �c� >r � <r, TN)r3 �
memoryview�cast� enumerater6 )r* r8 �in_tag�prev�curr�buf_chars r �remove_xml_tagszCharSetProber.remove_xml_tagsn s� � � �;��������o�"�"�3�'��'��n� �N�D�(� �4���a�x�����T�!��$�;�v� �O�O�C��T�N�3��O�O�D�)��� �$ �
�O�O�C���J�'��r )r N)r �
__module__�__qualname__�SHORTCUT_THRESHOLDr �NONEr r �propertyr �strr r" |