Tech & Traditions: ECFP에 대한 소개

2024년 8월 25일 일요일

ECFP에 대한 소개

Introduction

Extended-Connectivity Fingerprints (ECFPs) are circular topological fingerprints designed for molecular characterization, similarity searching, and structure-activity modeling. They are among the most popular similarity search tools in drug discovery and they are effectively used in a wide variety of applications.

ChemAxon provides the GenerateMD program for producing various molecular descriptors, including ECFPs, that can be processed further. This program can also be applied to fine-tune fingerprint parameters for JChem.

ECFP(Extended-Connectivity Fingerprint)는 분자 특성화, 유사성 검색 및 구조-활동 모델링을 위해 설계된 원형 토폴로지 지문입니다. 그것들은 약물 발견에서 가장 인기 있는 유사성 검색 도구들 중 하나이며, 광범위한 응용 분야에서 효과적으로 사용된다.

생성 기능을 제공하는 ChemAxon추가 처리가 가능한 ECFP를 포함한 다양한 분자 설명자를 생성하기 위한 MD 프로그램. 이 프로그램은 JChem에 대한 지문 파라미터를 미세 조정하는 데도 적용할 수 있습니다.

Applications

Circular fingerprints are effective and popular search tools, which have been successfully applied to a wide range of applications.

The initial application of ECFPs was in the area of high-throughput screening (HTS). In the evaluation of HTS results, ECFPs are widely used to analyze false positive and/or false negative hits. Furthermore, ECFPs are frequently applied in ligand-based virtual screening studies to distinguish between actives and inactives. Comprehensive studies revealed that these circular fingerprints are typically among the best performing search tools.

Various other areas of drug research related to similarity searching, including chemical clustering and compound library analysis, successfully utilize the rich information encoded in these fingerprints.

Beside similarity searching, ECFPs are well suited to the recognition of the presence or absence of particular substructures. Thus they are often used in QSAR and QSPR model building in the lead optimization phase, including the prediction of ADMET properties.

원형 지문은 효과적이며 널리 사용되는 검색 도구로서, 광범위한 분야에 성공적으로 적용되었습니다.

ECFP의 최초 적용 분야는 높은 처리량 선별(HTS) 분야였다. HTS 결과의 평가에서 ECFP는 거짓 양성 및/또는 거짓 음성 히트를 분석하는 데 널리 사용된다. 또한 ECFP는 활성과 비활성을 구별하기 위해 리간드 기반 가상 선별 연구에 자주 사용된다. 포괄적인 연구에 따르면 이러한 원형 지문은 일반적으로 최고의 검색 툴 중 하나라고 합니다.

화학 클러스터링 및 복합 라이브러리 분석을 포함하여 유사성 검색과 관련된 다양한 약물 연구 분야는 이러한 지문에 암호화된 풍부한 정보를 성공적으로 활용한다.

유사성 검색 외에도 ECFP는 특정 하위 구조의 존재 또는 부재를 인식하기에 매우 적합합니다. 따라서 ADMET 속성 예측을 포함하여 리드 최적화 단계의 QSAR 및 QSPR 모델 구축에서 자주 사용됩니다.

Properties of ECFPs

The main properties of ECFPs are the following:

They represent molecular structures by means of circular atom neighborhoods.
They can be very rapidly calculated.
Their features represent the presence of particular substructures.
They are not predefined and can represent a huge number of different molecular features (including stereochemical information).
They are designed to represent both the presence and the absence of functionality, since both are crucial for analyzing molecular activity.
Their generation method can be flexibly customized to produce various types of circular fingerprints for diverse applications.

ECFP의 주요 특성은 다음과 같습니다.

• 원형 원자 주변을 통해 분자 구조를 나타낸다.

• 매우 신속하게 계산할 수 있습니다.

• 이들의 특징은 특정 하위 구조의 존재를 나타낸다.

• 미리 정의되어 있지 않으며 다양한 분자 특성(입체화학 정보 포함)을 나타낼 수 있다.

• 기능의 존재와 부재를 모두 나타내도록 설계되었으며, 이는 둘 다 분자 활동 분석에 중요하기 때문이다.

• 다양한 용도에 사용할 수 있는 다양한 원형 지문을 생성하도록 유연하게 맞춤화할 수 있습니다.

Comparison to Path-Based Fingerprints

Path-based fingerprints, such as ChemAxon Chemical Fingerprint, were specifically designed and are widely used for pre-filtering in substructure searching. In contrast to this, ECFPs are not suitable for substructure searching, but they provide a rapid and highly effective screening method for full structure and similarity searching. Compared to path-based fingerprints, ECFPs typically provide more adequate results for similarity searching, which approximate the expectations of a medicinal chemist better.

화학은 같은Path-based 지문,.축삭 화학 지문, 구체적으로 널리 하부 구조 검색에 pre-filtering에 사용되도록 설계되었다.이에 대조적으로, ECFPs의 하부 구조를 검색에 사용했지만 전체 구조와 유사성 검색을 위한 매우 효과적인 빠른 검사 메서드를 제공할 적합하지 않다.path-based 지문에 비해, ECFPs 일반적으로 더 잘 약의 화학자의 기대를 거의 정확한 유사 검색을 더 충분한 결과들을 제공한다.

Representation and Generation

ECFPs are not based on predefined substructural keys, but their features are generated in a molecule-directed manner. The ECFP generation process systematically records the neighborhood of each non-hydrogen atom into multiple circular layers up to a given diameter. These atom-centered substructural features are then mapped into integer codes using a hashing procedure. It is the set of the resulting identifiers that defines the extended-connectivity fingerprint.

ECFPs에 정의된 하부 구조의 키에 오르지만, 형상이 자동molecule-directed 방식으로 생성된다 근거하지 않고 있다.그 ECFP 발생 과정 체계적으로 각non-hydrogen 원자의 주어진 직경에 다중 원형 층으로 이웃을 기록한다.이 atom-centered 하부 구조의 특징 그 정수 코드를 해시 절차를 이용해 도식화하였다.결과 식별자의 extended-connectivity 지문을 정의합니다 그것은 집합입니다.

Representations

ECFPs have two typical representations, both of which are supported by the ChemAxon implementation:

ECFP에는 화학에 의해 지원되는 두 가지 일반적인 표현이 있으며 모두 ChemAxon구현으로 지원된다.

List of integer identifiers

The natural and accurate representation of ECFPs is by means of varying-length lists of integer identifiers. Each identifier represents a particular substructure, more precisely, a circular atom neighborhood, which is present in the molecule. The list of integer identifiers is sorted in ascending order.

These identifiers can also be interpreted as indexes of bits in a huge virtual bit string. Each position in this bit string accounts for the presence or absence of a specific substructural feature. Since this virtual bit string is extremely large and sparse, it is not stored explicitly, but the indexes of the 1 bits are recorded in a varying-length list.

In spite of this interpretation, the feature identifiers are stored as signed values due to technical reasons, that is, they can be either positive or negative.

By default, this integer list representation contains only one instance of each identifier. However, in particular applications, it could be beneficial to consider the frequency count of the ECFP features, that is, to record each identifier as many times as the represented feature occurs in the molecule. This variation of ECFP is often denoted as ECFC.

In our implementation, there is a configuration parameter that controls if the occurrence counts of the identifiers should be discarded (this is the default behavior) or kept (ECFC mode).

ECFP는 정수 식별자의 다양한 길이 목록을 통해 자연스럽고 정확하게 표현됩니다. 각 식별자는 분자에 존재하는 특정 하부 구조, 보다 정확하게는 원형 원자 이웃을 나타냅니다. 정수 식별자 목록이 오름차순으로 정렬됩니다.

이러한 식별자는 큰 가상 비트 문자열에 있는 비트의 인덱스로 해석할 수도 있습니다. 이 비트 문자열의 각 위치는 특정 하위 구조 기능의 존재 또는 부재를 설명합니다. 이 가상 비트 문자열은 매우 크고 스파스하므로 명시적으로 저장되지 않고 1비트의 인덱스가 다양한 길이 목록에 기록됩니다.

이러한 해석에도 불구하고 기술상의 이유로 기능 식별자는 양수 또는 음수일 수 있기 때문에 할당된 값으로 저장됩니다.

기본적으로 이 정수 목록 표현에는 각 식별자의 인스턴스가 하나만 포함됩니다. 그러나 특히, ECFP 기능의 주파수 카운트를 고려하는 것이 유익할 수 있다. 즉, 각 식별자를 분자에 나타난 특성만큼 기록하는 것이 좋다. 이러한 ECFP의 변형을 ECFC라고 한다.

이 구현에서는 식별자의 발생 횟수를 삭제할지(기본 동작) 또는 유지할지(ECFC 모드) 제어하는 구성 매개 변수가 있습니다.

Fixed-length bit string

Traditional representation of binary molecular fingerprints is by means of fixed-length bit strings. This representation can also be applied to ECFPs by "folding" the underlying virtual bit string into a much shorter bit string of specified length (e.g., 1024).

Compared to the identifier lists, this representation simplifies the comparison and similarity calculation of ECFPs and it could reduce the required storage space, especially for large molecules. On the other hand, the applied folding operation increases the likelihood of collision, that is, two (or more) different substructural features could be represented by the same bit position. As a result, a certain amount of information is usually lost, which worsens both the quality and interpretability of this representation.

바이너리 분자 지문의 전통적인 표현은 고정 길이의 비트 문자열을 사용하는 것입니다. 이 표현은 기본 가상 비트 문자열을 지정된 길이의 훨씬 짧은 비트 문자열(예: 1024)로 "폴딩"하여 ECFP에도 적용할 수 있습니다.

이 표현은 식별자 목록과 비교하여 ECFP의 비교 및 유사성 계산을 단순화하며 특히 대형 분자의 경우 필요한 저장 공간을 줄일 수 있다. 반면에 적용된 폴딩 작업은 충돌 가능성을 높입니다. 즉, 동일한 비트 위치로 두 개 이상의 서로 다른 하위 구조 특징을 나타낼 수 있습니다. 그 결과 일반적으로 일정량의 정보가 손실되며, 이는 이 표현의 품질과 해석 가능성을 모두 악화시킵니다.

Note that the fixed-length binary representation can be derived from the identifier list representation, but the opposite transformation is not possible. In other words, the fixed-length bit string representation can be viewed as a "lossy" compression of the identifier list (or the underlying virtual bit string).

고정 길이 이진 표현은 식별자 목록 표현에서 파생될 수 있지만, 그 반대의 변환은 불가능합니다. 즉, 고정 길이 비트 문자열 표현을 식별자 목록(또는 기본 가상 비트 문자열)의 "손실" 압축으로 볼 수 있습니다.

Generating ECFPs in different ways

The following tools are available for generaing ECFPs:

GenerateMD command line tool

Using the GenerateMD command line application, the -D option selects the identifier list output, and the -2 option selects the fixed-length bit string output, e.g.

ECFP를 생성하는 데 사용할 수 있는 도구는 다음과 같습니다.

•생성MD 명령줄 도구

생성 사용MD 명령줄 응용 프로그램, -D 옵션은 식별자 목록 출력을 선택하고 -2 옵션은 고정 길이 비트 문자열 출력을 선택합니다.


generatemd c input.smiles -k ECFP -c ecfp_config.xml -2

Copy

DescriptorGenerator API

Using the DescriptorGenerator class, the functions getAsString() and getAsIntArray() give back the identifier list representation, while getAsBitSet() gives back the fixed-length bit string representation.

DescriptorGenerator 클래스를 사용하여 함수 getAsString() 및 getAsIntArray()는 식별자 목록 표현을 반환하고 getAsBitSet()은 고정 길이 비트 문자열 표현을 반환합니다.

ECFP API

Using the ECFP class, the functions toString(), toIntArray(), and toIdentiferSet() give back the identifier list representation, while toBinaryString() and toBitSet() give back the fixed-length bit string representation.

ECFP 클래스를 사용하여 String(), IntArray() 및 IdentiferSet()에 대한 함수는 식별자 목록 표현을 반환하고 BinaryString()과 BitSet()에 대한 함수는 고정 길이의 비트 문자열 표현을 반환합니다.

The Generation Process

The fingerprint generation process is as follows:

Initial assignment of atom identifiers

The ECFP generation process begins with the assignment of an initial integer identifier to each non-hydrogen atom of the input molecule. This identifier captures some local information about the corresponding atom in such a way that various atom properties (e.g., atomic number, connection count, etc.) are packed into a single integer value using a hash function.

The set of considered atom properties is an important configuration parameter of ECFPs, which can be fully customized (see later).

ECFP 생성 프로세스는 입력 분자의 각 비수소 원자에 초기 정수 식별자를 할당하는 것으로 시작한다. 이 식별자는 해시 함수를 사용하여 다양한 원자 속성(예: 원자 번호, 연결 수 등)이 단일 정수 값으로 채워지는 방식으로 해당 원자에 대한 일부 로컬 정보를 캡처합니다.

고려된 원자 특성 세트는 ECFP의 중요한 구성 파라미터로, 완전히 사용자 정의할 수 있습니다(나중에 참조).
Iterative updating of identifiers

After that, a number of iterations are performed to combine the initial atom identifiers with identifiers of neighboring atoms until a specified diameter is reached. Each iteration captures larger and larger circular neighborhoods around each atom, which are then encoded into single integer values using a suitable hashing method and these identifiers are collected into a list.
그 후, 반복 횟수의 번호까지 지정된 지름에 도달한 경우 인접 원자의 식별자로 초기 원자 식별자를 결합하는 것이 수행된다.각각의 원자 주위를 도는 그 단일 정수 값에 걸맞은 해시 법을 사용하여 이러한 식별자로 인코딩 되여 각 반복 캡처 점점 더 큰 원형 지역, 목록에 수록되어 있다.

Fig. 1. Illustration of the effect of iterative updating for a selected atom in a sample molecule

This iterative updating process is based on the well-known Morgan algorithm.
Duplication removal

If the identifier counts should be kept, then this step is modified to store each integer identifier as many times as the corresponding substructural feature occurs in the molecule. The final step of the generation process is the removal of multiple identifier representations of equivalent atom neighborhoods. Two neighborhoods are considered to be equivalent if they contain exactly the same set of bonds or their hashed integer identifiers are the same.
식별자 카운트를 유지해야 하는 경우, 이 단계는 해당 하위 구조 특징이 분자에 발생하는 횟수만큼 각 정수 식별자를 저장하도록 수정된다. 발전 과정의 마지막 단계는 등가 원자 주변의 복수 식별자 표현을 제거하는 것이다. 정확히 동일한 결합 집합을 포함하거나 해시된 정수 식별자가 동일한 경우 두 인접성이 동등한 것으로 간주됩니다.

The following figures illustrate the whole ECFP generation process, including the derivation of the fixed length bit string from the identifier list representation.
다음 그림은 식별자 목록 표시에서 고정 길이 비트 문자열을 파생하는 것을 포함하여 전체 ECFP 생성 프로세스를 보여 줍니다.

images/download/attachments/1806333/ecfp_generation.png

Fig. 2. ECFP generation process

images/download/attachments/1806333/ecfp_folding.png

Fig. 3. Generation of the fixed-length bit string ("folding")

ECFP configuration

ECFPs provide a rich set of configuration parameters. They can be specified in an XML file, just like the configuration of other molecular descriptors.
ECFP는 풍부한 구성 파라미터 세트를 제공합니다. 다른 분자 설명자의 구성과 마찬가지로 XML 파일에 지정할 수 있습니다.

Main parameters

The three main parameters of ECFPs are maximum diameter, fingerprint length, and identifier counts:
ECFP의 세 가지 주요 파라미터는 최대 직경, 지문 길이 및 식별자 수입니다.

Diameter

This parameter specifies the maximum diameter of the circular neighborhoods considered for each atom. The default diameter is 4. This is a dominant parameter of ECFPs, which controls the number and the maximum size of considered atom neighborhoods, thus it controls the length of the identifier list representation, as well as the number of "1" bits in the fixed length bit string representation. (More information about these two representations can be found above.)

ECFPs are usually distinguished by this parameter. For example, ECFP_4 denotes that the maximum diameter is set to 4, while ECFP_6 means diameter 6.

The appropriate value of the maximum diameter depends on the desired application. According to Rogers and Hahn, diameter 4 is typically sufficient for similarity searching and clustering, while activity learning methods often benefit from the greater structural detail available by using larger limit, for example 6 or 8.

이 파라미터는 각 원자에 대해 고려되는 원형 인접 영역의 최대 직경을 지정합니다. 기본 직경은 4입니다. 이것은 ECFP의 지배적인 파라미터로, 고려된 원자 주변의 수와 최대 크기를 제어하기 때문에 식별자 목록 표현의 길이와 고정 길이 비트 문자열 표현의 "1" 비트 수를 제어합니다. (이 두 표현에 대한 자세한 내용은 위에서 확인할 수 있습니다.)

ECFP는 일반적으로 이 파라미터로 구분됩니다. 예를 들어 ECFP_4는 최대 직경이 4로 설정되어 있음을 나타내며, ECFP_6은 직경 6을 의미합니다.

최대 직경의 적절한 값은 원하는 애플리케이션에 따라 달라집니다. 로저스와 한에 따르면, 일반적으로 직경 4는 유사성 검색과 군집화에 충분한 반면, 활동 학습 방법은 6 또는 8과 같은 큰 한계를 사용함으로써 이용할 수 있는 더 큰 구조적 세부 사항의 이점을 얻을 수 있다.
Length

This parameter specifies the length of the bit string representation. The default length is 1024.

Larger length decreases the likelihood of bit collision, and therefore decreases the information loss. However, handling larger fingerprints require more computation time and storage space.

이 매개 변수는 비트 문자열 표현의 길이를 지정합니다. 기본 길이는 1024입니다.

길이가 클수록 비트 충돌 가능성이 감소하므로 정보 손실이 감소합니다. 그러나 큰 지문을 처리하려면 더 많은 계산 시간과 저장 공간이 필요합니다.
Counts

This parameter controls whether the generated integer identifiers are stored with occurrence counts or each identifier is kept only once independently of the number of the corresponding substructural features in the input molecule. The default option is "No", that is, each identifier is stored only once.
이 매개변수는 생성된 정수 식별자를 발생 횟수와 함께 저장할지 또는 각 식별자를 입력 분자의 해당 하위 구조 특성 수와 독립적으로 한 번만 유지할지를 제어합니다. 기본 옵션은 "아니요"입니다. 즉, 각 식별자는 한 번만 저장됩니다.

The first two parameters play similar role to the maximum pattern length and fingerprint length parameters of ChemAxon Chemical Hashed Fingerprint. They have similar effects on the information content, the generation time, and the required storage space of the fingerprints.

These main parameters of ECFPs can be specified as attributes of the <Parameters/> tag in the XML configuration file, e.g.

처음 두 파라미터는 화학의 최대 패턴 길이 및 지문 길이 파라미터와 유사한 역할을 합니다.액슨 케미칼 해싱된 지문. 정보 컨텐츠, 생성 시간 및 지문의 필요한 스토리지 공간에도 비슷한 영향을 미칩니다.

ECFP의 이러한 주요 파라미터는 XML 구성 파일에서 <Parameters/> 태그의 속성으로 지정할 수 있습니다.


<Parameters Length="512" Diameter="2" Counts="Yes"/>

Copy

Atom Properties

The set of atom properties encoded in the atom identifiers is an important configuration parameter of ECFPs, which determines their characteristic. Different sets of properties result in different types of circular fingerprints targeting various applications.

The default ECFP configuration supports typical use cases, especially similarity searching based on highly specific substructural information. For each atom, the following properties are considered by default:

원자 식별자로 인코딩된 원자 특성 세트는 ECFP의 중요한 구성 파라미터로, ECFP의 특성을 결정합니다. 속성 집합이 다르면 다양한 응용 프로그램을 대상으로 하는 서로 다른 유형의 원형 지문이 생성됩니다.

기본 ECFP 구성은 일반적인 사용 사례, 특히 매우 구체적인 하위 구조 정보를 기반으로 한 유사성 검색을 지원합니다. 각 원자에 대해 다음 특성이 기본적으로 고려됩니다.

the atomic number;
the number of "heavy" (non-hydrogen) neighbor atoms;
the number of attached hydrogens (both implicit and explicit);
the formal charge;
an additional property that indicates whether the atom is part of at least one ring.

• 원자 번호;

• "무거운" (무중력) 주변 원자의 수

• 부착 수소의 수(암묵적 및 명시적)

• 공식 요금

• 원자가 적어도 하나의 고리에 속하는지 여부를 나타내는 추가 속성.

If we would like to use other identifier configuration, we can make a selection of a few built-in properties and a wide variety of custom properties defined by Chemical Terms.
다른 식별자 구성을 사용하려면 몇 가지 내장 특성 및 화학 약관에 의해 정의된 다양한 사용자 지정 특성을 선택할 수 있습니다.

Examples

The default identifier configuration can be described as follows:
•기본 식별자 구성은 다음과 같이 설명할 수 있습니다.
```
<IdentifierConfiguration>
        <Property Name="AtomicNumber" Value="1"/>
        <Property Name="HeavyNeighborCount" Value="1"/>
        <Property Name="HCount" Value="1"/>
        <Property Name="FormalCharge" Value="1"/>
        <Property Name="IsRingAtom" Value="1"/>
    </IdentifierConfiguration>
```
Copy

The built-in properties are identified by their names. The Value attribute specifies if the atom property is actually used (1) or not (0). It facilitates enabling and disabling atom properties without the permanent removal of the unused properties from the file. The default configuration file contains all built-in properties, but only the above five are enabled. The order of the properties in the XML file is arbitrary, and it has no effect on the generated fingerprints.
기본 제공 속성은 이름으로 식별됩니다. Value 특성은 원자 속성이 실제로 사용되는지 여부(1)를 지정합니다(0). 이렇게 하면 파일에서 사용되지 않는 속성을 영구적으로 제거하지 않고도 원자 속성을 활성화 및 비활성화할 수 있습니다. 기본 구성 파일에는 기본 제공 속성이 모두 포함되어 있지만 위의 5개 속성만 사용할 수 있습니다. XML 파일의 속성 순서는 임의이며 생성된 지문에는 영향을 주지 않습니다.

A configuration using other built-in properties:


<IdentifierConfiguration>
        <Property Name="MassNumber" Value="1"/>
        <Property Name="ConnectionCount" Value="1"/>
        <Property Name="HasAromaticBond" Value="1"/>
        <Property Name="IsTerminalAtom" Value="1"/>
        <Property Name="IsStereoAtom" Value="1"/>
    </IdentifierConfiguration>

Copy

A configuration using custom properties defined by Chemical Terms:
```
<IdentifierConfiguration>
        <Property Name="MyAtomicNumber" Value="1">atno()</Property>
        <Property Name="MyConnectionCount" Value="1">connections()</Property>
        <Property Name="PosCharge" Value="1"></Property>
        <Property Name="NegCharge" Value="1"></Property>
    </IdentifierConfiguration>
```
Copy

The first two properties are equivalent to the corresponding built-in properties. They are expressed using Chemical Terms only for illustration purposes. Note that the built-in properties are typically computed faster, thus they should be preferred if it is possible.

On the other hand, the partial charge can only be expressed using Chemical Terms. The latter two properties are practical examples for defining custom properties based on the partial atomic charge.

처음 두 속성은 해당하는 내장 속성과 동일합니다. 화학 용어로만 표현됩니다. 내장 특성은 일반적으로 더 빨리 계산되므로 가능하면 선호해야 한다.

반면 부분 전하는 화학 약관을 통해서만 표현할 수 있습니다. 후자의 두 특성은 부분 원자 전하를 기반으로 사용자 지정 특성을 정의하는 실질적인 예입니다.

Non-integer property values

For technical reasons, all property values are converted to integers. The true/false (boolean) values are converted to 1 and 0, respectively, while the floating point values are simply truncated to integers, that is, their fractional parts are discarded (not rounded). This rough truncation could eliminate differences that we would like to consider in the ECFP computation. In such cases, we can multiply the original floating point value by an appropriate factor.
기술상의 이유로 모든 속성 값은 정수로 변환됩니다. 참/거짓(부울) 값은 각각 1과 0으로 변환되고 부동 소수점 값은 정수로 간단히 잘립니다. 즉, 소수점 부분은 반올림되지 않은 상태로 폐기됩니다. 이렇게 거칠게 잘라내면 ECFP 계산에서 고려하고자 하는 차이가 제거될 수 있다. 이 경우 원래의 부동 소수점 값에 적절한 계수를 곱할 수 있습니다.

For example, if we would like to consider the partial charge up to one decimal digit, then we can use
예를 들어, 부분 전하를 소수점 이하 한 자리까지 고려하고자 한다면, 다음을 사용할 수 있습니다.


<Property Name="Charge" Value="1"></Property>

Copy

instead of


<Property Name="Charge" Value="1"></Property>

Copy

This latter property would not distinguish charge values 0.16, 0.32, 0.38, -0.55 etc. (all of them would be converted to 0), while the first property would convert them to 1, 3, 3, and -5, respectively.
이 후자의 속성은 전하 값 0.16, 0.32, 0.38, -0.55 등을 구분하지 않으며(모두 0으로 변환됨), 첫 번째 속성은 각각 1, 3, 3 및 -5로 변환됩니다.

Functional-Class Fingerprints (FCFPs)

The default identifier configuration of ECFP captures highly specific atomic information enabling the representation of a large set of precisely defined structural features. In some applications, however, different kinds of abstraction may be desirable. For example, a chlorine or a bromine substituent on a ring may be functionally equivalent but would be distinguished by standard ECFP. Another example that opens a new direction in the application of circular fingerprints is the pharmacophore identification of atoms which transforms ECFPs to topological pharmacophore fingerprints.

Those variants of ECFPs that applies such generalizations and have focus on the functional roles of the atoms instead of full specificity are calledFunctional-Class Fingerprints (FCFPs).

ChemAxon's ECFP implementation also supports FCFPs by custom identifier configurations, since the generation of the initial atom identifiers is the only difference between ECFPs and FCFPs. For example, the following configuration defines a functional-class circular fingerprint:

ECFP의 기본 식별자 구성은 정밀하게 정의된 많은 구조 특성 세트를 나타낼 수 있도록 매우 구체적인 원자 정보를 캡처합니다. 그러나 일부 애플리케이션에서는 다른 종류의 추상화가 바람직할 수 있습니다. 예를 들어, 링 위의 염소 또는 브롬 대체물은 기능적으로는 동일할 수 있지만 표준 ECFP에 의해 구별됩니다. 원형 지문을 적용할 때 새로운 방향을 여는 또 다른 예는 ECFP를 위상 약리 지문으로 변환하는 원자의 약리학적 식별이다.

이러한 일반화를 적용하고 완전한 사양 대신 원자의 기능적 역할에 초점을 맞춘 ECFP의 변종을 FCFP(Functional-Class Fingerprint)라고 합니다.

또한 ChemAxon의 ECFP 구현은 초기 원자 식별자를 생성하는 것이 ECFP와 FCFP 간의 유일한 차이점이기 때문에 사용자 정의 식별자 구성에 의한 FCFP를 지원합니다. 예를 들어, 다음 구성은 기능급 원형 지문을 정의합니다.


<IdentifierConfiguration>
        <Property Name="HydrogenBondAcceptor" Value="1">acceptor()</Property>
        <Property Name="HydrogenBondDonor" Value="1">donor()</Property>
        <Property Name="Aromatic" Value="1">arom()</Property>
        <Property Name="Charge" Value="1"></Property>
    </IdentifierConfiguration>

Copy

The full configuration file can be found here.

Note that abstraction of FCFPs typically results in smaller set of features than ECFPs for a given molecule, because different substructures may be functionally equivalent and hence the same identifier is generated for them.

Feature Retrieval

JChem also provides a lookup service for the features encoded in ECFP fingerprints. As we have discussed above, ECFPs are represented either as lists of integer identifiers or as fixed-length bit strings, in which the identifiers and bit positions account for particular substructural features of the input molecule. The ECFPFeatureLookup class serves the retrieval of these circular atom neighborhoods corresponding to a given integer identifier or bit position.

In some applications of ECFPs, the fingerprints of certain compounds are compared in order to identify particular feature identifiers or bit positions that seem to be important in characterizing or differentiating the molecules. In these cases, one may be interested in obtaining the actual substructures that are represented by the considered identifiers or bits.

For example, given an integer identifier id that is present in the ECFP fingerprint of a molecule mol, the corresponding substructures can be retrieved in SMARTS format with a code like this:


ECFPFeatureLookup lookup = new ECFPFeatureLookup();
   lookup.processMolecule(mol);
   for (ECFPFeature f : lookup.getFeaturesFromIdentifier(id)) {
       System.out.println(f.getSubstructure().toFormat("SMARTS"));
   }

Copy

For more information, see the API documentation of ECFPFeatureLookup .

References

Rogers, D.; Hahn, M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50(5): 742-754.
Hu, Y.; Lounkine, E.; Bajorath, J. Improving the Search Performance of Extended Connectivity Fingerprints through Activity-Oriented Feature Filtering and Application of a Bit-Density-Dependent Similarity Function. ChemMedChem 2009, 4: 540-548.
Glen, R. C.; Bender, A.; Arnby, C. H.; Carlsson, L.; Boyer, S.; Smith, J. Circular fingerprints: Flexible molecular descriptors with applications from physical chemistry to ADME. IDrugs 2006, 9(3): 199-204.
Morgan, H. L. The Generation of a Unique Machine Description for Chemical Structures - A Technique Developed at Chemical Abstracts Service. J. Chem. Doc. 1965, 5: 107-112.

Tech & Traditions

2024년 8월 25일 일요일

ECFP에 대한 소개

Introduction

Applications

Properties of ECFPs

Comparison to Path-Based Fingerprints

Representation and Generation

Representations

Generating ECFPs in different ways

The Generation Process

ECFP configuration

Main parameters

Atom Properties

Examples

Non-integer property values

Functional-Class Fingerprints (FCFPs)

Feature Retrieval

References

댓글 없음:

댓글 쓰기

태그