Ensembl Database
概述
Access and query the Ensembl genome database, a comprehensive resource for vertebrate genomic data maintained by EMBL-EBI. The database provides gene annotations, sequences, variants, regulatory information, and comparative genomics data for over 250 species. Current release is 115 (September 2025).
使用時機
適用於以下情境:
- Querying gene information by symbol or Ensembl ID
- Retrieving DNA, transcript, or protein sequences
- Analyzing genetic variants using the Variant Effect Predictor (VEP)
- Finding orthologs and paralogs across species
- Accessing regulatory features and genomic annotations
- Converting coordinates between genome assemblies (e.g., GRCh37 to GRCh38)
- Performing comparative genomics analyses
- Integrating Ensembl data into genomic research pipelines
Core Capabilities
1. Gene Information Retrieval
Query gene data by symbol, Ensembl ID, or external database identifiers.
Common operations:
- Look up gene information by symbol (e.g., "BRCA2", "TP53")
- Retrieve transcript and protein information
- Get gene coordinates and chromosomal locations
- Access cross-references to external databases (UniProt, RefSeq, etc.)
Using the ensembl_rest package:
from ensembl_rest import EnsemblClient
client = EnsemblClient()
# Look up gene by symbol
gene_data = client.symbol_lookup(
species='human',
symbol='BRCA2'
)
# Get detailed gene information
gene_info = client.lookup_id(
id='ENSG00000139618', # BRCA2 Ensembl ID
expand=True
)
Direct REST API (no package):
import requests
server = "https://rest.ensembl.org"
# Symbol lookup
response = requests.get(
f"{server}/lookup/symbol/homo_sapiens/BRCA2",
headers={"Content-Type": "application/json"}
)
gene_data = response.json()
2. Sequence Retrieval
Fetch genomic, transcript, or protein sequences in various formats (JSON, FASTA, plain text).
Operations:
- Get DNA sequences for genes or genomic regions
- Retrieve transcript sequences (cDNA)
- Access protein sequences
- Extract sequences with flanking regions or modifications
Example:
# Using ensembl_rest package
sequence = client.sequence_id(
id='ENSG00000139618', # Gene ID
content_type='application/json'
)
# Get sequence for a genomic region
region_seq = client.sequence_region(
species='human',
region='7:140424943-140624564' # chromosome:start-end
)
3. Variant Analysis
Query genetic variation data and predict variant consequences using the Variant Effect Predictor (VEP).
Capabilities:
- Look up variants by rsID or genomic coordinates
- Predict functional consequences of variants
- Access population frequency data
- Retrieve phenotype associations
VEP example:
# Predict variant consequences
vep_result = client.vep_hgvs(
species='human',
hgvs_notation='ENST00000380152.7:c.803C>T'
)
# Query variant by rsID
variant = client.variation_id(
species='human',
id='rs699'
)
4. Comparative Genomics
Perform cross-species comparisons to identify orthologs, paralogs, and evolutionary relationships.
Operations:
- Find orthologs (same gene in different species)
- Identify paralogs (related genes in same species)
- Access gene trees showing evolutionary relationships
- Retrieve gene family information
Example:
# Find orthologs for a human gene
orthologs = client.homology_ensemblgene(
id='ENSG00000139618', # Human BRCA2
target_species='mouse'
)
# Get gene tree
gene_tree = client.genetree_member_symbol(
species='human',
symbol='BRCA2'
)
5. Genomic Region Analysis
Find all genomic features (genes, transcripts, regulatory elements) in a specific region.
Use cases:
- Identify all genes in a chromosomal region
- Find regulatory features (promoters, enhancers)
- Locate variants within a region
- Retrieve structural features
Example:
# Find all features in a region
features = client.overlap_region(
species='human',
region='7:140424943-140624564',
feature='gene'
)
6. Assembly Mapping
Convert coordinates between different genome assemblies (e.g., GRCh37 to GRCh38).
Important: Use https://grch37.rest.ensembl.org for GRCh37/hg19 queries and https://rest.ensembl.org for current assemblies.
Example:
from ensembl_rest import AssemblyMapper
# Map coordinates from GRCh37 to GRCh38
mapper = AssemblyMapper(
species='human',
asm_from='GRCh37',
asm_to='GRCh38'
)
mapped = mapper.map(chrom='7', start=140453136, end=140453136)
API Best Practices
速率限制
The Ensembl REST API has rate limits. Follow these practices:
- Respect rate limits: Maximum 15 requests per second for anonymous users
- Handle 429 responses: When rate-limited, check the
Retry-Afterheader and wait - Use batch endpoints: When querying multiple items, use batch endpoints where available
- Cache results: Store frequently accessed data to reduce API calls
錯誤處理
Always implement proper error handling:
import requests
import time
def query_ensembl(endpoint, params=None, max_retries=3):
server = "https://rest.ensembl.org"
headers = {"Content-Type": "application/json"}
for attempt in range(max_retries):
response = requests.get(
f"{server}{endpoint}",
headers=headers,
params=params
)
if response.status_code == 200:
return response.json()
elif response.status_code == 429:
# Rate limited - wait and retry
retry_after = int(response.headers.get('Retry-After', 1))
time.sleep(retry_after)
else:
response.raise_for_status()
raise Exception(f"Failed after {max_retries} attempts")
安裝方式
Python Package (Recommended)
uv pip install ensembl_rest
The ensembl_rest package provides a Pythonic interface to all Ensembl REST API endpoints.
Direct REST API
No installation needed - use standard HTTP libraries like requests:
uv pip install requests
資源
references/
api_endpoints.md: Comprehensive documentation of all 17 API endpoint categories with examples and parameters
scripts/
ensembl_query.py: Reusable Python script for common Ensembl queries with built-in rate limiting and error handling
Common Workflows
Workflow 1: Gene Annotation Pipeline
- Look up gene by symbol to get Ensembl ID
- Retrieve transcript information
- Get protein sequences for all transcripts
- Find orthologs in other species
- Export results
Workflow 2: Variant Analysis
- Query variant by rsID or coordinates
- Use VEP to predict functional consequences
- Check population frequencies
- Retrieve phenotype associations
- Generate report
Workflow 3: Comparative Analysis
- Start with gene of interest in reference species
- Find orthologs in target species
- Retrieve sequences for all orthologs
- Compare gene structures and features
- Analyze evolutionary conservation
Species and Assembly Information
To query available species and assemblies:
# List all available species
species_list = client.info_species()
# Get assembly information for a species
assembly_info = client.info_assembly(species='human')
Common species identifiers:
- Human:
homo_sapiensorhuman - Mouse:
mus_musculusormouse - Zebrafish:
danio_rerioorzebrafish - Fruit fly:
drosophila_melanogaster
延伸資源
- Official Documentation: https://rest.ensembl.org/documentation
- Python Package Docs: https://ensemblrest.readthedocs.io
- EBI Training: https://www.ebi.ac.uk/training/online/courses/ensembl-rest-api/
- Ensembl Browser: https://useast.ensembl.org
- GitHub Examples: https://github.com/Ensembl/ensembl-rest/wiki