|
|
|
|
|
Frequently Asked Questions (FAQ)
|
 |
| |
 |
Answers to questions of a General or Technical nature about UniProt, the databases, and the web site.
General
What is the Universal Protein Resource?
Until recently, the EBI/SIB Swiss-Prot + TrEMBL databases and the PIR Protein Sequence Database (PIR-PSD) coexisted
with differing protein sequence coverage and annotation priorities. In 2002, EBI, SIB, and PIR joined forces as the UniProt consortium.
The primary mission of the consortium is to support biological research by maintaining a high quality database that serves
as a stable, comprehensive, fully classified, richly and accurately annotated protein sequence knowledgebase, with extensive
cross-references and querying interfaces freely accessible to the scientific community. UniProt builds upon the solid foundations
laid by the consortium members over many years.
The UniProt databases consist of three database layers:
The UniProt Archive (UniParc) provides a stable,
comprehensive sequence collection without
redundant sequences by storing the complete
body of publicly available protein sequence data.
The UniProt Knowledgebase (UniProtKB) provides the central
database of protein sequences with accurate,
consistent, rich sequence and functional annotation.
The UniProt
Reference Clusters (UniRef) databases provide condensed data collections based on the UniProt
Knowledgebase in order to obtain complete coverage
of sequence space at several resolutions.
What is UniParc?
UniParc is the UniProt Archive, the most comprehensive publicly accessible non-redundant protein sequence database available.
New and updated protein sequences are loaded daily from public databases including Swiss-Prot, TrEMBL, PIR-PSD, EMBL,
Ensembl, IPI, PDB, RefSeq, FlyBase, WormBase, and European, American, and Japanese Patent Office proteins. To avoid redundancy,
each unique sequence is stored only once and assigned a unique UniParc identifier. These identifiers are stable once created they
are never deleted or reassigned. As a protein sequence is extracted from a source database, a cross-reference to that database is created in
UniParc. A single protein
may be referenced by more than one database cross-reference. The database cross-references therefore link a UniParc sequence to its unique identifiers in the source databases. If the source database contains protein sequence versions, used
to indicate changes in the sequence, they are stored as part of the database cross-reference. As only few databases have
protein sequence versions, a UniParc sequence version is made available as part of each database cross-reference. It is incremented
each time the sequence pointed by a database cross-reference changes, and makes it possible to observe sequence changes
in all source databases.
What is the UniProt Knowledgebase?
Consisting of richly-annotated entries, the UniProt Knowledgebase is the centerpiece of the consortium
activities. Initially, the Knowledgebase derived from the merge of Swiss-Prot, TrEMBL and PIR-PSD
protein sequences with annotations of sequence and functional information. Future Knowledgebase entries
will be derived from the UniProt Archive sequences we see as essential for the UniProt Knowledgebase.
For example, sequences for which novel functional, structural, and biochemical data has been published
have high annotation priority. The UniProt Knowledgebase consists of two parts, a section containing fully
manually-annotated records resulting from information extracted from literature and curator-evaluated computational
analyses, and a section with computationally-analysed records awaiting full manual annotation. For the sake of
continuity and name recognition, the two sections are referred to as "UniProtKB/Swiss-Prot" and "UniProtKB/TrEMBL", respectively.
What are the UniRefs?
With the help of automatic procedures, three UniProt Reference Clusters
(UniRef) databases, UniRef100, UniRef90 and UniRef50 are created from
UniProt Knowledgebase and
selected UniParc records.
The databases provide complete coverage of sequence space while hiding
redundant sequences from view. The non-redundancy allows faster sequence
similarity searches by using UniRef90 and UniRef50 (which merge all
records from all source organisms with mutual sequence identity of >90%
or >50%, respectively, into a single record). Unlike the case in
UniParc, sequence fragments are merged in UniRef.
What will happen to Swiss-Prot, TrEMBL and PIR-PSD?
UniProtKB/Swiss-Prot and UniProtKB/TrEMBL are the two sections of the UniProt Knowledgebase. UniProtKB/Swiss-Prot is the smaller part, containing
fully annotated records which include curator-evaluated computational analysis as well as information extracted
from the literature. UniProtKB/TrEMBL is the larger part, containing computationally analysed records waiting for full
manual annotation.
The PIR-PSD data were imported into UniParc, and bi-directional cross-references between Swiss-Prot + TrEMBL and PIR-PSD were created to allow easy tracking of former PIR-PSD entries into UniProtKB. All suitable sequences in PIR-PSD that are missing from Swiss-Prot + TrEMBL are being incorporated into the TrEMBL section of UniProt Knowledgebase.
Additionally, all valid references and experimentally verified data present in PIR-PSD, but missing from Swiss-Prot + TrEMBL, are also being transferred to the relevant UniProtKB records.
To avoid duplication of work within UniProtKB, PIR-PSD release 80.00 (31-Dec-2004) was the final.
Is a commercial license required to use UniProt?
No commercial license is required to use UniProt, but some restrictions apply. Please see Terms of Use for more details.
What is UniProt's policy regarding copyright and database distribution?
We have chosen to apply the Creative Commons
Attribution-NoDerivs License to all copyrightable
parts of our databases. This means that you are free to copy, distribute, display and make commercial use of these databases, provided you give us credit. However, if you intend to distribute a modified version of one of our databases, you must ask us for permission first. We provide all data on an 'as-is' basis. We make no warranties regarding the correctness of the data, and disclaim liability for damages resulting from its use. We cannot provide unrestricted permission regarding the use of the data, as some data may be covered by patents or other rights.
Why is UniParc not available via ftp?
UniParc is a non-redundant protein sequence archive, containing both active and dead sequences, and it is species-merged
since sequences are handled just as strings - all sequences 100% identical over the whole length of the sequence between
species are merged. UniParc records do not have any annotation since the annotation will be only true in the real context
of the sequence. For example, proteins with the same sequence may have different functions depending on species, tissue,
developmental stage, etc. All this context dependent information (if known at all) is not present in UniParc, and it is
the purpose of the UniProt Knowledgebase to provide this. Therefore, we show in the sequence similarity search results
against UniParc links to the UniProt Knowledgebase, which people can go to look at the annotations. The unavailability
of annotation, the merging of sequences into one single record and presence of both active and inactive sequences in
UniParc makes it unsuitable for any kind of large scale parsing and manipulation. Hence, UniParc is not made available via ftp.
How can I link to UniProt entries?
The standard way of linking to UniProt entries, displaying the UniProt “basic”
view as HTML, is:
http://www.uniprot.org/entry/<id or ac number>
Examples:
http://www.uniprot.org/entry/cyc_human
http://www.uniprot.org/entry/P99999
http://www.uniprot.org/entry/UniRef100_P99999
http://www.uniprot.org/entry/UniRef90_P99999
http://www.uniprot.org/entry/UniRef50_P99999
http://www.uniprot.org/entry/UPI00000002E4
Technical
What is the difference between an accession number and an ID?
An accession number (AC) is assigned to each sequence upon inclusion into UniProtKB.
Accession numbers are stable from release to release. If several UniProt Knowledgebase
entries are merged into one, for reasons of minimizing redundancy, the accession numbers
of all relevant entries are kept. Each entry has one primary AC and optional secondary ACs.
An ID is a unique identifier, often containing biologically relevant information. It is
sometimes necessary, for reasons of consistency, to change IDs (for example to ensure that
related entries have similar names). Another common cause for changing an ID is when an entry
is promoted from UniProtKB's TrEMBL section (with computationally-annotated records) to the
Swiss-Prot section (with fully curated records).
However, an accession number is always conserved, and therefore allows unambiguous citation of UniProt entries.
Does the UniProtKB flat file format differ from the Swiss-Prot/TrEMBL flat file format?
No, the UniProtKB flat file format is identical with the former Swiss-Prot and TrEMBL
format.
What is the meaning of the different lines of data in a UniProtKB entry flat file?
The following table summarizes the information contained within UniProtKB flat file entries.
Each line begins with a two-character line code. Not all entries contain all line
codes. Detailed information about each line code is available within the User
Manual.
| Code |
Meaning |
Description |
| ID |
Identification |
Contains identifying information and characteristics of the sequence |
| AC |
Accession number(s) |
Release-to-release stable identifiers |
| DT |
Date |
When the entry was created, or when the sequence or annotation was modified |
| DE |
Description |
The name of the protein, often a function indicator |
| GN |
Gene name(s) |
The gene(s) that code for the protein |
| OS |
Organism species |
The organism from which the sequence is derived |
| OG |
Organelle |
If the sequence is non-chromosomal in origin |
| OC |
Organism classification |
The taxonomic class to which the organism belongs |
| OX |
Taxonomy cross-reference(s) |
The NCBI TaxID for the OC line |
| RN |
Reference number |
The sequential number of the literature citation within the entry |
| RP |
Reference position |
The type of data, and the position in the sequence to which the citation refers |
| RC |
Reference comment(s) |
Comments relevant to the reference cited |
| RX |
Reference cross-reference(s) |
Bibliographic cross-reference, such as PubMed ID |
| RA |
Reference authors |
Authors of the citation |
| RT |
Reference title |
Title of the citation |
| RL |
Reference location |
Source of the citation, such as journal, book, or unpublished data |
| CC |
Comments |
Free text notes about the protein |
| DR |
Database cross-references |
Pointers to sources or related information for the entry |
| KW |
Keywords |
Indexable indicator of function, structure, or other information |
| FT |
Feature table |
Annotation of specific residues of the sequence |
| SQ |
Sequence header |
Marks the beginning of the sequence and provides summary data |
| (no code) |
Sequence data |
The sequence itself |
| // |
Termination line |
End of entry |
How can I reduce the apparent sequence redundancy for BLAST?
After a BLAST, many sequences are essentially the same (fragments,
identical or near-identical sequences from different sources, different
names). The apparent redundancy of the results can be reduced by using
BLAST against sequence-condensed databases such as UniRef100, UniRef90
or UniRef50. The UniRef
databases combine closely related sequences into a single record,
thereby hiding highly similar entries from the results. From the BLAST
result page you can still view all the collapsed sequences by clicking
on the UniRef ID.
|
This document was last modified on June 1, 2005.
If you have any comments or questions please contact
UniProt help.
|
|
|
|
|