This document contains the
instructions on how to prepare your data
before
launching the checking or the conversion (transform / update) process.
If you are only installing a Web portal with no local data, then you are not concerned by this document.
Use by
default the 'data'
directory
to place in your flat file, XML or text/CSV reports, as well as your
map
images and other specific files (like the hidden* files, cf. below). You may also opt for a
different
directory to look for your data (to be defined within the configuration
files). In such case, use your directory path in the following guidelines
rather than the
default
'data'
directory.
A test database has
been prepared for you to test the tool. The
data set, combined with some additional examples,
are to be found in
the data_test directory, so you can
also
check the nature of data to be used. Please, note that this is a
totally fictive set of data with no real biological significance. If
you are testing the flat file format, the database name within the
postgreSQL
catalog will be by default 'test_2dpage', otherwise define the database
name yourself during the configuration process.
You can have a look at this test database by clicking here.
Note: Only since version
1.0 it becomes possible to provide your data in various formats (simple
text reports, spreadsheets or XML files), in addition to the previously
required flat file format (the SWISS-2DPAGE-like text file listing sequentially your proteins). Nevertheless, a format
similar to the flat file format is still used *internally* by the tool.
By providing your data in any other format, the tool would still generate
automatically an intermediate flat file to work with. A copy of this
generated file will be placed in your data directory (as well as
a similar copy in the temp
directory). This file is named "last_generated_flat_file.dat".
During the conversion process, you may have a look at it. If you wish
to, you may even interrupt the process to edit this file for any personal add or change. You should then restart your
process again by defining your data source to be read form this flat file.
The images
For each map, create 2 corresponding images. One
with exactly
the same dimensions (width and height) as the original map
image,
and a small one (with approximately a size of 100 pixels x 100 pixels).
It is possible to use images of different size than the original ones,
and even to shift their origin, which by default is located at the
top-left corner (more details are available in the configuration
document, see Readme: Configuration).
Both of the two image files should have exactly the same name as
the
original map name referred to in the IM line of your flat file - the
text file listing
sequentially your proteins in case you are providing a SWISS-2DPAGE like
flat file - or/and as listed in your 'existing.maps' file (see below),
except that the small image name should be
preceded by the prefix 'small_'. Map names should be upper-cased and should not
contain any spaces (e.g. LIVER_MOUSE, PLASMA_4-7). Use any valid
graphic type
(gif,
tif, png, jpg..) and add its extension to the image name. Using 'png'
or
'jpg' image format would enhance the speed of your images display,
while using 'tif' images would slow it down.
example for a map called PLASMA: (e.g. "IM
PLASMA"
line in a SWISS-2DPAGE like flat file or simply a map called PLASMA in
your 'existing.maps' file)
create, for example, 'PLASMA.png'
(usually the same dimension
as the original map image) and 'small_PLASMA.png'(the 'small_MapName'
image)
Put all your images in the 'data' directory.
Or place them respectively in two sub-directories called 'images'
and 'small_images'
Tip: If you have a logo image that you want to display on
your
Web interface, then put it also in this 'data' directory.
Note:
Melanie / ImageMasterTM 2D Platinum 5.0 users working with
XML exported reports
do not explicitly need to perform this task, except that it offers much
more annotations for your maps than the ones that can be extracted from
the exported XML files.
Important:
During the configuration process (perl
make2db.pl -m config),
you are asked if you
want to generate a maps' file (third choice from the very first level).
You will then be guided to input several
annotations for each of your maps. It is highly recommended to generate
your file this way, as this offers richer annotations for your maps.
The new version of the tool lets you annotate the following fields:
the map name itself, one single upper-cased word, e.g.
PLASMA or PLASMA_4-7; all your subsequent gel text reports - if any -
should have
the same name used in here, plus a '.txt' extension (required parameter)
a more descriptive and longer name (optional parameter)
width of the image in pixel / X-coordinates (required)
height of the image in pixel / Y-coordinates (required)
pI start and end values (optional)
Mw start and end values (optional)
taxonomy ID (optional)
a description of the species strain if needed (optional)
a tissue name - only names listed in the UniProtKB
tissue list are accepted (the ID or the SY line, e.g. Abdomen) -
you may contact us if you wish to add any tissue not present in this
list (optional)
a list of mapping (identification) methods applied to the whole map for the
spots'
identification (optional)
URL (uri) for both the preparation and informatics parts
(optional)
local documents for both the preparation and informatics parts,
e.g. PSI-MIAPE documents, PSI-Gel documents (optional)
short comments for both the preparation and informatics
parts (optional)
software used for the detection (optional)
any related comments (optional)
number of detected spots for statistical data; this will
override the number of detected spots read from the Melanie XML reports
when given (optional)
shift the X position of the image in pixel (optional) -- note: this value will be overriden by any *defined* shifting in '2d_include.pl:map_shift_left', which acts on *all* maps all together
shift the Y position of the image in pixel (optional) -- note: this value will be overriden by any *defined* shifting in '2d_include.pl:map_shift_down', which acts on *all* maps all together
adapt spots position horizontlly using a ratio value (optional) -- note: this value will be overriden by any *defined* ratio in '2d_include.pl:map_x_ratio', which acts on *all* maps all together
adapt spots position vertically using a ratio value (optional) -- note: this value will be overriden by any *defined* ratio in '2d_include.pl:map_y_ratio', which acts on *all* maps all together
example of a
maps' file generated during the configuration process (note that, for
each map, parameters are separated by TABs and are all in a single
line).
Be sure to place your generated 'existing.maps' file into your 'data' directory.
The following old format, still
accepted but deprecated, has been
kept only for compatibility:
The old way:
Create a text
file called "existing.maps" containing
the list of all your map images. Each line should contain details for
one
map. The minimal syntax to follow is:
map_short_name map_long_name
width
height
"map_short_name" is the name of your map (e.g.
PLASMA) and the corresponding images files (if you are
not a Melanie / ImageMasterTM 2D Platinum 5.0 user, this
name, combined with the name of your database will be also used as a
unique identifier
for your map). "map_long_name" is a more descriptive name for
the
map
to be displayed. Spaces are now allowed to separate words. width
and height are the X and Y dimensions in pixels
of
your original map image. Finally, separate fields with a tabulation.
Example of an existing.maps file (spaces could be tabs or
just
spaces):
LIVER
Human Liver original Geneva
600 800 PLASMA Human Plasma number 11
1200
1600
When using a flat file:
remember that each map name should be written exactly
as it is written in your data flat file IM fields.
Finally you may
also add a tissue name at the end of each line (optional). Only tissues
listed on the tisslist.txt, the tisslist_initial.txt or the
tisslist_aliases.txt will be retained; example. Put your exisiting.maps file in
the 'data' directory.
The spots
and entries
identification and annotation
This section describes how to prepare your spots' data.
The term 'entry' (or entries) is commonly used in this document as a synoym for 'protein'.
Before going any further, you should first remember that there are three
manners to provide this data to the tool. You have the choice
between
the following three options:
A
Flat File: A SWISS-2DPAGE-like
text file, listing sequentially your proteins. This file has to be combined either with simple spot lists
(defining the spots position), or with some Melanie /
ImageMasterTM 2D Platinum text reports or
XML exports.
Spreadsheets
(CSV / tab-delimited text
files, e.g. EXCEL exports)
Melanie /
ImageMasterTM 2D Platinum XML exports alone.
Depending on your choice, you will have different levels of granularity
for your annotations. Internally, the tool will always partially rely
on a flat file, be it provided by the user, or generated by the tool
itself from the spreadsheets or the Melanie exports.
The flat file offers the more structured manner to provide data for the
tool. It has the advantage of being strict (in the positive way) and
can be extremely rich. The drawback is that it is hard to manually
generate a
flat file highly annotated and correctly formated.
The spreadsheets have the advantage to be simple to generate. Many
laboratories do already work with this format to store their 2D data.
Using spreadsheets the user is totally free to define the extent of his
annotations from very basic annotations to extremely rich and user
defined annotations. The main drawback is that many user defined
annotation categories make it harder to link data between researchers (a semantic problem),
specially that no unambiguous ontology has been defined yet.
The Melanie XML exports is, among the three options, the easiest way to
generate data, assuming of course that the maps are accurately
annotated with this software. In the meanwhile, those annotations are
currently quite limited and the XML schema itself does not follow any
wide-spread standard, as such a standard does not exist yet.
We will detail separately each of those options. You may retain that if
you are not providing your personal flat file, you will always have the
possibility to work with the automatically generated one (to be found
within your data
directory, as well as in the temp directory,
under the name of last_generated_flat_file.dat
whenever you run the tool in the other two modes). You may want
sometime to edit manually this generated file and decide to restart
your conversion process based on your edited copy (by switching to the
flat file mode from your configuration files and defining the$db_file variable equal
to 'last_generated_flat_file.dat' or any other name if you save your
modified copy under another name).
One more remark. During the data/syntax checking, you may encounter
error messages complaining about some data inconsistency. Some of those
messages would point to a section from a flat file where the error has
been detected, even if you have not provided any flat file. As a major
part of your data is translated internally into the flat file format,
inconsistency in this flat file may be traced back so you may find the
source of error in your original spreadsheets or text reports.
Finally, it is important to signal that a major part of the external
updates rely on some protein index, which is the Swiss-Prot/UniProtKB accession
numbers. By providing such identifiers for your identified proteins (to
be your
accession numbers or as cross-references), you ensure to get the
maximum profit from this feature, and to be more "visible" to other
remote Make2D-DB II databases (the tool creates dynamic
cross-references between the remote databases based on this index or
the SWISS-2DPAGE one).
(Melanie /
ImageMasterTM 2D Platinum users working with XML exported reports
or with text exported reports, combined with a flat file, do not
need to read this first sub-section, you may directly go to "Other
supported report format")
The spots reports are text files that list the spots'
coordinates
within a map image. There should be one report per map in the 'data'
directory. This report should be given the name of the corresponding
map exactly
as it is written in the 'IM' line of your flat file. It should also
have
a '.txt' extension (e.g: PLASMA.txt).
Each report should contain a line for each identified spot on the
corresponding
map, indicating the spot identifier (spot's name) and its position
on the image (given in pixel). Spaces could be a tab or just simple
spaces.
Actually,
there are several accepted line syntaxes.
Tip: many 2D-PAGE software should let you easily export
this
type of report files.
Once generated, put all of your reports in the 'data'
directory.
Make sure they have been saved in text format.
General syntax of a report line:
Spot_ID x_position y_position [%Od] [%Vol]
Separate fields with spaces (or tabs). Spot_IDis the
identifier
given to the spot/band, x_position is the spot position (in pixels)
from
left to right, y_position is the spot position from top
to
bottom. In 1D maps (SDS) you can even omit the x_position field, a
default
value will then be read from the configuration file (see Readme: Configuration).
You have also the choice to include values for both the relative
optical density (%Od) and the relative volume (%Vol) for
each
spot (expressed in %). If you give one single value it will be
interpreted
as being the %Vol value.
- Example of a MAP.txt file (header lines containing double quotes
around
field names are optional and will be ignored). Spaces can be a tab or
just simple
spaces:
423 120 210 424 100 120 ...
or
425 300 400
0.012
0.0345 ...
- for an SDS.txt file (minimal data):
426 300
Note: If you want to use any field other than the SpotID
as your spot identifier (e.g. using SWISS-2DPAGE like SerialNumbers),
then simply replace the SpotID field by the desired annotation
field
you want to use as your spot identifier (thought, make sure this
annotation
field is unique per spot), e.g.
2D-ABC123 120 210
Other examples: PLASMA.txt, PLASMA2.txt from the
test_2dpage database. In the second example, you may notice that some
extra annotations, i.e.
"pi:4.85 mw:22158" (syntax
is "pi:value
mw:value", separated by spaces or a tabulation)
(full syntax: "Spot_ID x_position y_position
[%Od] [%Vol] [pi:value] [mw:value]",
square brackets mean values within are optional)
were added at the end of the line.
Those parameters (pI and Mw) are
optional as the flat file already contains this information (see
below). Defining those values inside the spots' report will make the
tool ignore those read from the flat file and even accept the following
syntax for a spot within your flat file: "2D -!-
PI/MW: SPOT spotID", without values for pI and Mw. It is your
own choice to
decide to list those values here (reducing redundancy) or not.
By defining the configuration variable '$Melanie = 1' in your include.cfg
configuration file, you tell the tool to look for some Melanie reports.
The tool will start by searching any file with the extension .xml (XML files) inside your data directory (e.g. anything.xml) and will parse them.
If none is present, then it will look for text files (.txt) corresponding to the different
maps listed in your database (e.g. PLASMA.txt
and PLASMA2.txt).
You may use
the default text spot reports generated by Melanie:
If you are using the free of
chargeMelanie
/ ImageMaster
Viewer (tested up to version 5.02), you can directly use the
generated
spot report which also exports the following data by default (make sure that
they are listed in the following order):
"GelName" SpotID
X Y Pi
Mw
Od Area Vol %Od %Vol Circularity/Saliency.
Those reports can be read and treated
directly by the tool
with
no need to manipulate them.
To use another annotation (SerialNumber)
instead of the default SpotID to be your
spot identifiers, simply add (export) this annotation as an additional last
field in each
of your report lines:
e.g.
"GelName" SpotID X Y Pi Mw
Od Area Vol %Od %Vol Circularity/Saliency
"SerialNumber/your_annotation"
You
may also work in combination with the Melanie XML exports:
Make sure you have the common
perl XML::Parser
and libxml-perl modules installed on your system (the tool will need to use the XML::Parser::PerlSAX
perl module). If not, ask your system
administrator to install a recent version, or simply prepare your
reports as described in the sub-sections above.
The Make2D-DB MelanieXMLParser module will extract the name of the
gel from the Melanie XML file. If the original Melanie Image file name
is different from the gel name used in the flat file IM lines, then
name the corresponding xml files to the appropriate gel names (e.g.
PLASMA.xml and PLASMA2.xml), one
gel per file. Otherwise, you may use any name and group several
gels inside one single XML file, provided it has the extension .xml.
It is not strictly required to prepare an "existing.maps" file (if
present, it will override Melanie XML values). The pI/Mw values will
then be
read from
the Melanie export, overriding any values given in your flat
file (for your flat file, the syntax "2D
-!-
PI/MW: SPOT spotID" without any given pI/Mw values will be then
accepted).
All graphically detected spots will be integrated into your database
(being annotated/identified or not) if you set the variable "$include_not_identified_spots = 1" in your configuration file include.cfg.
-- deprecated -- If you
are a Melanie 4 /
ImageMasterTM 2D Platinum 5.0 user (or higher) and you don't
wish to
export those spots' reports yourself, then you simply
do not create them. By not finding the spots' reports, the tool will
try
to analyze the Melanie / ImageMasterTM 2D Platinum
5.0 maps
themselves to extract the spots'
positions. If your maps are saved in the Melanie II or the Melanie 3
format, and you
do not have a copy of Melanie 4 or ImageMasterTM 2D Platinum
5.0 (or higher), you can still convert your maps
using the ImageMaster/Melanie
Viewer
(version 4.08 and up). By doing
so, you should be aware that the tool will rely on the Melanie
/ ImageMasterTM 2D Platinum
5.0 SpotID
field to refer to your spots, and to the pI/Mw values given
in your flat
file. -- deprecated --
To extract spot annotations, the tool will try at first to read any
exported Melanie XML file with a .xml
extension, provided it has been configured to read Melanie /
ImageMaster data ("$Melanie = 1" in include.cfg).
It will then look on all text reports (for files named MAP.txt, where MAP is the different map names
given in your
database
flat file).
If no reports are found, the tool will try to directly extract
annotations from the Melanie images themselves (it is recommended not
to rely on this step) as this option is being deprecated!
The
Database Flat File
Create and place in the data
directory your database flat file (text file) containing one
entry per protein. Entries are separated by a //line. The
usual
headers used with the first version of make2ddb are optional and will
be
ignored except for the database name if no name has been given in the
configuration
file.
Before going any further, please, make
sure you are familiar with the syntax described in the SWISS-2DPAGE
user manual which lists in more details a large part of the syntax to be
adopted.
Compared to the syntax described in the
above link, the tool offers
much more tolerance vis-a-vis of the syntax. It lets you also define a list of
default values to be applied whenever a required information is
missing. Finally, some extra specific additions have been adopted for
the
Mass Spectrometry annotations.
Example of a simple xxx.dat file (fictive entry /
some
Make2D-DB II optional keywords are not displayed for simplification) :
ID
HC_HUMAN; STANDARD;
2DG.
AC P02760; P02759; P00977;
DE Alpha-1-microglobulin/
Inter-alpha-trypsin
inhibit or light chain
DE (PROTEIN HC) (HI30).
IM LIVER, PLASMA.
RN [1]
RP MAPPING ON GEL.
RX MEDLINE; 78094420.
RA Anderson N.L., Anderson N.G.;
RT "High Resolution 2-DE of human
Liver";
RL Proc. Natl. Acad. Sci. U.S.A.
74:5421-5425(1977).
2D -!- MASTER: LIVER;
2D -!- PI/MW: SPOT
1=5.12/30851;
2D -!- PI/MW: SPOT
2=5.07/29736;
2D -!- MASTER: PLASMA;
2D -!- PI/MW: SPOT
1=4.86/33544;
2D -!- PI/MW: SPOT
2=4.96/32167;
2D -!- PI/MW: SPOT
3=5.07/31046;
DR Swiss-Prot; P02760; HC_HUMAN.
//
ID
CRP_HUMAN;
PRELIMINARY; 2DG.
AC P02741;
DE C-reactive protein precursor.
IM PLASMA.
RN [1]
RP MAPPING ON GEL.
RA Anderson N.L.;
RL Personal Communication(1993).
CC -!- SUBUNIT: HOMOPENTAMER.
2D -!- MASTER: PLASMA;
2D -!- PI/MW: SPOT
999=5.12/23908;
DR Swiss-Prot; P02741; CRP_HUMAN.
DR SWISS-2DPAGE; P02741; CRP_HUMAN.
//
Notes:
The image names used in the existing.maps
file, in the IM
line and in the ' 2D -!- MASTER ' line should be
exactly
the same (e.g. PLASMA). Consequently, the image names should be upper-cased, and they should not have
any
characters other than letters, underscores, digits and '-'.
The database text file is structured to be readable by
humans as
well as
by computer programs. The different lines describing one entry begins
with
a two-character line code, which indicates the type of data contained
in
this line. The remaining part of the line should follow the given
rules,
otherwise the conversion will not work properly (errors are signaled).
Especially, in case they are provided, for the lines described
extensively below: the given
structure
should be strictly respected:
The ID line (optional):
The ID (IDentification)
line is the first line of an entry.
The general form of the ID line is: ID Entry_Name; ENTRY_CLASS; 2DG.
Entry_Class and 2DG are optional.
If you omit the ID line, the AC value will be also
taken
as an ID Entry_Name, until the external data integration is
performed over your data.
The AC
line:
The AC (ACcession
number) line lists the accession numbers
associated
with an entry. The accession numbers are separated by semicolons and
the
list is terminated by a semicolon. If necessary, more than one AC line
will be used. An example of an accession number line is shown below: AC P07237; P30037; P32079;
Entries will have more than one accession number if they have been
merged or split. For example, when two entries are merged into one, a
new
accession number goes at the start of the AC line, and those from the
merged
entries are listed after this one. Similarly, if an existing entry is
split
into two or more entries, the original accession
number
list is retained in all the derived entries.
The DE line (optioanl):
The DE (DEscription)
lines contain general descriptive
information
about the protein stored. This information is generally sufficient to
identify
the protein precisely. The format of the DE lines is: DE Description of my protein.
The description is given in ordinary English and is free-text.
In some cases, more than one DE line are necessary; in this case, the
text
is divided only between words and only the last DE line is terminated
by
a period.
The
IM line (optional):
The IM (IMages)
line lists the 2-D PAGE images which are
associated
with the entry. The images are separated by commas, and the list is
terminated
by a period. An images line example is shown here: IM LIVER, PLASMA. This line is not necessary anymore, as the map names are read either from the 2D sections or from a given default value.
The RA lines (optional if a default bibliographic
reference is defined in the configuration files):
The RA (Reference Author)
lines list the authors of the
paper (or any other type of work) cited. All of the authors are included, and are
listed
in the order given in the paper. The names are listed surname first
followed
by a blank followed by initial(s) with periods. The authors' names are
separated by commas and terminated by a semicolon. Author names are not
split between lines. An example of the use of RA lines is shown below: RA Edwards J., Anderson N.G., Nance S.L., RA Anderson N.L.;
As many RA lines as necessary are included for each reference.
The DR lines (optional):
The DR (Database
cross-Reference) lines are used as
pointers
to information related to an entry and found in other databases. The
format
of the DR line is: DR DATABASE; PRIMARY_IDENTIFIER;
SECONDARY_IDENTIFIER.
Examples of complete DR lines are shown here: DR Swiss-Prot; P00352; DHAC_HUMAN. DR ECO2DBASE; G052.0; 6TH EDITION. DR HSC-2DPAGE; P47985; HUMAN. DR YEPD; 4270; -.
The // line:
The // (terminator) line
contains no data or comments. It designates
the end of an entry.
For Make2D-DB II, many lines (ID, DE, DT, GN, OS, OC,
OX, IM, RP, RX, RA, RL, RT, CC, DR,..)
are
not explicitly required within your database text file. Meanwhile, you
may need to set up default values for some of them (DT, OS,
OC, OX, RP, RA, RL) in the configuration file 'include.cfg'
(see Readme: Configuration).
The fields IM and
MA can be totally omitted
from the
database text file. IM field
is internally
evaluated, when missing, by reading the 2D -!- Master
lines. If you define a Taxonomy
ID value for one or more of your maps within your existing.maps
file, then entries belonging to those maps will also adopt their TaxID
(except when you force a specific species annotation for some
individual entry by
defining for it a specific OX field).
A different set of entries, forming the test database are listed in
this flat file (test.dat). The first entry, Z02760
(HC_HUMAN) is an extended entry. A "minimal entry" text ( with the minimal
required data) is shown within this test database (test.dat). It has the accession
number "ZI|GI.MINIMAL" and
only
contains 3 types of lines: AC,
2D and DR. The tool tries then to add some
missing values based on the given configuration files and the extracted
external data related to the UniProtKB (Swiss-Prot or TrEMBL) entry
given by the UniProtKB (Swiss-Prot or TrEMBL) DR
cross-reference line.
Entry "P12345" has even the
very strict minimum required for an entry (one AC line and two 2D lines for 1 spot location). The
tool recognizes that "P12345" is a UniProtKB/Swiss-Prot accession
number and
automatically cross-references the entry based on this identifier.
Compared to the original SWISS-2DPAGE
manual, some syntax
modifications on the 2D lines
have been
adopted by the tool to suite the need for a more elaborate
annotation for PMF lines
(peptides
fingerprinting) and MS/MS
lines (tandem mass spectrometry) combined with peptide sequences. The rules are:
- All the standard syntax is still perfectly sufficient, e.g. for a PMF
list:
The ParentPetideMass and the ParentPeptideCharge are optional. If present they are separated by a colon and given inside square brackets. If just one value is given, it is considered to be the parent charge. The syntax for the masses and their intensities are similar to the PMF syntax. A final period '.' is required at the end of the very last line of the section.
- For both "MASS SPECTROMETRY" and
"PEPTIDE MASSES" we may
separate the experimental data (all peaks) from those retained being
significant for
the identification (analysis) part by double colons, e.g.
2D -!- MASS SPECTROMETRY: [1200.7:1-] 869.468(3.09);524.448(2.67);635.708(3.17);712.129(1.2)::777.77(3.7);888.48(2.8);...
The left part is supposed to be the
significant values for the
identification (analysis), while the right part lists all values, or just the additional
other values not
retained for the identification.
- You can include related local MS files to be displayed, or
external URLs if data is stored on the Web (e.g. on some repository).
The keywords to use are: file, ident-file,
uri
and ident-uri.
A colon separate the keywords from their value (a file path or
a Web address).
file for a local MS file, e.g.
2D -!- MASS SPECTROMETRY: SPOT 89: [1723.9581:1+] 270.074448 (491.94);...; file:/some_path/msms.pkl.
ident-file
for a local MS identification report, e.g (a Mascot report).
2D -!- MASS SPECTROMETRY: SPOT 89: [1723.9581:1+] 270.074448 (491.94);...; ident-file:/some_path/msIdentResults.dat.
uri
for a MS file located on the Web, e.g.
2D -!- MASS SPECTROMETRY: SPOT 89: [1723.9581:1+] 270.074448 (491.94);...; uri:http://www.ebi.ac.uk/pride/search.do?someID.
ident-uri
for a MS identification report located on the Web, e.g.
2D -!- MASS SPECTROMETRY: SPOT 89: [1723.9581:1+] 270.074448 (491.94);...; ident-uri:http://www.ebi.ac.uk/pride/search.do?someID.
All those document annotations are optional and may be combined in any order
(separate document annotations by spaces).
Remember to always terminate the section with a final period.
e.g.
2D -!- TANDEM MASS SPECTROMETRY:
SPOT 111: [630.878:1+] 86.1001 (6.2857); 120.0644 (29.8095); 120.1283
(2.1905);
2D
file:msms.pkl
uri:http://www.ebi.ac.uk/pride/search.do?directLink=true&experimentAccessionNumber=1
2D ident-file:msIdentResults.dat
You will probably not need
to give any MASS_LIST when pointing to some file (as those files
should contain the peak list values themselves).
Nevertheless, you should still give an Enzyme Name when dealing with
"PEPTIDE MASSES" (PMF) data.
- A keyword to tell Maked2D-DB II that the identification is to be
hidden from public access (by default this keyword is 'private')
may be added between brackets before the final period.
2D -!- MASS SPECTROMETRY: [1723.9581:1+] 270.074448 (491.94);...; file:/some_path/msms.pkl {private}.
- When listing several lists of "MASS SPECTROMETRY" and their
corresponding identified
"PEPTIDE SEQUENCES", the order of correspondence between the MS
data section and the identified peptides section follows the same order
in which they are given. e.g. the first "PEPTIDE SEQUENCES" list
correlates with the
first "MASS SPECTROMETRY" list, and so on...
- The
mapping (identification) methods are
vocabulary controlled and are defined in the
editable
basic_include.pl main
configuration file inside the
%mapping_methods_description
list. You may redefine or add your own mapping methods within this list
(contact us if any help is needed).
By using the spreadsheets mode, users
have the choice to work with a large range of pre-defined annotations,
but also with any number of their own personal free-text annotations.
The spreadsheets mode mean any text report with fields separated by
tabulators (tab-delimited files/CSV). Those are, for example,
spreadsheet
software exports (e.g. EXCEL) into text
files. When you export such reports, make sure to select the tabulator to be your delimiter!
Being simple text files, it is also possible to write manually such
reports in any text editor, taking care to separate fields with tabs,
and to save in simple text format.
You instruct Make2D-DB II to work in
the spreadsheets mode by defining
in your config.cfg
configuration file the$db_file variable (the
flat file name) to be empty ($db_file = "") and by setting the $Melanie
variable to null ($Melanie = 0).
You should provide a separate report file for each of your maps. The
report file name should be written exactly like the Gel name you would
have given in your existing.maps
configuration file. You should always use the extension '.txt' (e.g. PLASMA.txt, PLASMA2.txt or PLASMA3.txt)
Thefirst line of your report
should contain the headers for
the various columns. Those headers will be used by the tool to know
what is the annotation category of each column. Headers can
follow any order in your report, except for the very first header which has
always to be the "SPOT"
header.
Do not duplicate any header, instead, check below for each header category how to separate different elements.
All headers will be upper-cased by the tool. They may be contained
inside double quotes or not.
There are three main
categories of headers:
The mandatory headers:
those are required headers, the tool will
complain if they are missing, if the values in the columns are not
defined or if they are syntactically incorrect
The pre-defined headers:
those are optional headers, the tool
will only complain if the values are not following the expected syntax
The free-text headers:
those are defined by the user, they fall
into 2 different classes: the "2D" and the "COMMENT" class, no syntax
check is applied
The mandatory headers
There are four required headers. They
are:
"SPOT"
header: The column for this header should be the first column to be
defined. It contains the spot ID. You may use any single word for the
values (e.g. 900 or 2D-TWX222).
"X"
header: This is the x-coordinates of the spot on the gel image (the
width value) in pixel.
Values should be positive or 0.
"Y"
header: This is the y-coordinates of the spot on the gel image (the
height value) in pixel.
Values should be positive or 0.
"MW"
header: The apparent molecular weight of the spot on the gel. Values
are given in Dalton*. Use only integer numbers.
Only one single value per data line is admitted for these headers.
You may have several lines
with the same spot ID. This is useful when you want to include several
annotations for the same spot (like when you have several identified
proteins for the same spot, or when you have several independent MS
analysis, etc..). When a spot is listed more than once, its X/Y
coordinates, as well as its pI/Mw values, are only retained from their
last occurrence. It is also not necessary to give again the X , Y, MW
and PI values for a spot after they have been already given in a precedent line (c.fPLASMA2.txt).
The origin to evaluate the X/Y positions is the top-left corner of the
image.
*You may also give MW values
in kDa. The tool will assume they are in kDa if their values are low
enough not to be in Dalton (e.g. 20.5).
The pre-defined headers
Those headers do have a special
definition. They are optional but their values is restricted to some
associated syntax. You may use any combination of them, in any order, without duplicating any of them:
"PI"
header: The apparent pI of the spot on the gel. If this
column is not present then we are in presence of a SDS gel (bands). Otherwise,
define a positive value starting from 0 (use real numbers, e-g. 7.443).
The tool expects to find a defined value for all spots or no value at
all for all of them.
"AC"
header: This is the column to hold the identified protein
accession numbers (if known). Give a Swiss-Prot (UniProtKB) accession
number for best results. Leave blank if no protein has been identified.
When several proteins are identified for the same spot, write an
independent line for each of them (e.g. spot 397 from the PLASMA.txt report).
"MAPPING
METHODS" header: You may use this column to list the
different mapping (identification) methods used for the spot's
identification. The
mapping methods are vocabulary controlled and are defined in the
editable
basic_include.pl
main configuration file inside the
%mapping_methods_description list. You may use here the keywords
separated by commas (e.g. "MS/MS, Gm, Co" to display 'Tandem mass
spectrometry', 'Gel matching' and 'Comigration' within your entries).
You may redefine or add your own mapping / identification methods
within this
list (contact us if any help is needed).
"OD"
header (alias "%OD"):
Relative optical densities (%Od)
are listed here. Values range from 0.0 up to 100.0 (use real numbers,
e.g. 0.32112).
"VOL"
header (alias "%VOL"):
Relative volumes (%Vol) are
listed here. Values range from 0.0 up to 100.0 (use real numbers, e.g.
0.32112).
"AMINO
ACID" header: This column is used to list the experimental
analysis results by amino acid composition. The syntax follows the one
shown in the
SWISS-2DPAGE 2D lines manual for the "AMINO ACID COMPOSITION"
"PMF"
header: Peptide fingerprinting peak lists are listed here and
follows basically the
SWISS-2DPAGE 2D lines manual syntax for "PEPTIDE MASSES". You may
also include the intensities of the pics following the intensity rule and the ident data rule given in the previous
section.
"MS"
header (alias "MS/MS"
or "MASS
SPECTROMETRY"): Tandem mass
sepctrometry peak lists are listed here and follows the Mass Spectrometry rule, as well as the intensity rule
and the ident data
rule given in the previous section.
"PMF FILE"
header: Instead of listing your PMF peak lists yourself,
you may just give the absolute
or relative path for your local PMF experimental data file (e.g. a pmf.dta file) in
this
column. The tool will execute the appropriate conversion over your
files
to include their content within your database.
"MS FILE"
header: Instead of listing your tandem MS peak lists
yourself, you may just give the absolute or relative path
for your local MS experimental data file (e.g. a msms.mgf file) in this column. The tool
will execute the appropriate conversion over your files to include
their
content within your database. The tool usually rely on the file
extension to "guess" its format. You will need, depending on the format
you are using, to explicitly tell Make2D-DB II what is the used format.
Read the note entitled "Input formats for MS/MS" below
for more details.
"PMF URI"
header: Here you can give a URL (namely URI) pointing
to your experimental data to be viewed if the later is stored in some repository
(e.g. PRIDE) or is accessible from the Web. You can still populate the
column "PMF" with peak list data if you wish to.
"MS URI"
header: Here you can give a URL (namely URI) pointing to
your
eperimental data if the later is stored in some repository (e.g. PRIDE) or is
accessible from the Web. You can still populate the column "MS" with
peak list data if you wish to.
"PMF
IDENT-FILE" header: PMF Analysis documents/reports can be given
here (e.g. a Mascot search report) when they are
present. Give an absolute or relative path for your local files.
"MS
IDENT-FILE" header: MS Analysis documents/reports can be given
here (e.g. a PSI AnalysisXML or a Phenyx search report) when they are
present. Give an absolute or relative path for your local files.
"PMF
IDENT-URI" header: Like the "PMF URI" header, you may give
URLs pointing to some repository or any Web location where your PMF
identification/analysis report may be viewed.
"MS
IDENT-URI" header: Like the "MS URI" header, you may give
URLs pointing to some repository or any Web location where your
MS identification/analysis report may be viewed.
"PEPTIDES"
header: The peptides are the identified peptide sequences related to
the MS/MS data. The syntax do exactly follow the one given in the
SWISS-2DPAGE 2D lines manual "PEPTIDE SEQUENCES".
Input formats for MS/MS: idj, mzdata, mzxml,
btdx, dta, mgf, peptMatches, pkl. The Tool will rely on the
extension of your given file to "guess" what is its format. When
dealing with PSI mzData, or mzXML formats (who both usually have the
extension .xml), you should precise their format by giving the format
name, followed by a colon before the path to your file, e.g. "mzdata:/some_path/my_MS_file.xml" or "mzxml:/some_path/my_MS_file.xml".
This is also perfectly fine with files having the same extension as
their files, which mean that "/some_path/some_MS_file.pkl"
and "pkl:/some_path/some_MS_file.pkl"
are both correct.
The file report PLASMA2.txt
gives many examples of MS annotations.
Listing several PMF/MS files or URIs:
In order to list more than one element under the PMF/MS file and URI categories (headers 9 to 16), simply separate them by spaces.
To ensure correspondance between elements across different categories (e.g. between analysis and identifcation files), respect the order they are listed with across the different columns.
"REFERENCE"
header: By listing your bibliographic references following the SWISS-2DPAGE
format in a separate file that you call 'reference.txt'
in your data
directory (example), you can list
in this column the reference numbers related to each entry. Many
references can be given separated by commas, (e.g. 1,2,8). e.g. PLASMA2.txt. Remember
that RP, RA (or RG) and RL lines - respectively the 'Reference
Position', the 'Reference Author' (or the 'Reference Group') and the
'Reference
Location' - must be defined in all references, all the other lines are
optional (and no need for a RN line).
"XREF"
header (alias "CROSS-REFERENCES"):
If a protein has been identified for your spot, you may list here as
many cross-references to external ressources as you wish. The syntax to
follow is "Xref_Database
ID1 & Xref_Database ID1; ID2 & ..." (e.g. "Swiss-Prot P04040 & SWISS-2DPAGE P04040").
Only if your main accession number is already a UniProtKB (Swiss-Prot
or
TrEMBL) identifier that a large collection of cross-references will be
automatically integrated, with no need to define anything for the XREF
field. In the other hand, if your identifier is not a UniProtKB AC, you
may find it very useful to define here a cross-reference to UniProtKB
(Swiss-Prot or TrEMBL) to activate external data retrieval relatred to
the UniProtKB. For more information on the cross-reference database
list available with this tool, see cross-references.
The free-text headers
You may include as many free-text
columns as you wish. Two classes are though distinct:
- The "COMMENT" class: Whenever your header begins with the
keyword "COMMENT:"
then it is considered a general comment related to the identified
protein (e.g. "COMMENT: SUBUNIT"
or "COMMENT: MISCELLANEOUS"
columns in PLASMA.txt).
No syntax
check is applied.
- The "2D" class: All the other free-text headers will fall into
this class. Those are considered as free-text 2D annotations. (e.g.
"PATHOLOGY LEVEL" or "EXPRESSION" columns in PLASMA3.txt). No syntax
check is applied.
A free 2D annotation is applied specifically to the spot it is given for.
A convenient manner to apply a free 2D annotation to all spots of a map all at once is to precede the header name of the annotation by a star '*', e.g."* EXPRESSION". If we would like to only apply the annotation to all the spots related to a particular protein, then precede the annotation itself by a star '*', and define the annotation for only one of the spots related to this protein, e.g. "* method not applicable on this protein".
For completion purpose, we should mention that the older format for
spreadsheets is still accepted by the tool. This older format is much
more restricted and does not support headers. It has 2 possible
syntaxes:
the short syntax (without identification annotations)
Spot X Y pI Mw [AC1 AC2 AC3]
and the long one (with
ordered identification annotations, e.g. PLASMA3_noheaders.txt)
Spot X Y pI Mw [AC1] [IdentMethod1,IdentMethod2,..] [PMF] [MS/MS] [AMINO ACID COMPOSITION] [%od] [%vol]
Based on your CSV reports, the tool will generate an
intermediate 'last_created_flat_file.dat'
file. You may then choose to
continue, or to interrupt the process of conversion. If you interrupt
the process, you will be able to manually edit the
'last_created_flat_file.dat'
now present in your data directory
if you wish to add more annotations or to change others. You should then save the edited
file under another name (e.g. newFlatFile.dat)
and define the falt file variable $db_file
to be equal to this new
file name (without any path) before resuming your installation. This will then switch you to the flat file mode.
Otherwise, continue to
proceed without interruption.
Make sure you
have the common
perl XML::Parser
and libxml-perl modules installed on your system (the tool will need to use the XML::Parser::PerlSAX
perl module). If not, ask your system
administrator to install a recent version.
By giving a void string to the$db_file variable in your
config.cfg
file and a positive value for the $Melanie
variable ($Melanie = 1) you are telling Make2D-DB II to work in the
Melanie/Image Master XML mode. The Make2D-DB MelanieXMLParser module
will consider
the name of the gel image file exported within the Melanie file to be
the gel name
to use (e.g. PLASMA or PLASMA2), so make sure before exporting your XML
files with Melanie that the name of the gel image file exported is
exactly written as you would like your gel to be called within the new
database (the tool will automatically truncate the path and the
extensions '.tif' or '.mel' from the gel file name). You may use as
many Melanie XML separated files as you wish; the tool parses all files
it founds in the data
directory which have the extension '.xml'.
It is not strictely required to prepare
an
"existing.maps" file (but, if present, this
one will override Melanie XML values). Though, an "exsiting.maps" file
will give you the opportunity to attach much more annotations to your
maps.
Based on your Melanie XML exports, the tool will generate an
intermediate 'last_created_flat_file.dat'
file. You may then choose to
continue, or to interrupt the process of conversion. If you interrupt
the process, you will be able to manually edit the
'last_created_flat_file.dat'
now present in your data directory
if you wish to add more annotations or to change others. You should then save the edited
file under another name (e.g. newFlatFile.dat)
and define the falt file variable $db_file
to be equal to this new
file name (without any path) before resuming your installation. This will then switch you to the flat file mode.
Otherwise, continue to
proceed without interruption.
The Make2D-DB II tool lets you control
which data is to be displayed for public users, and which data should
be
restricted to administrators and privilaged/private users. You may use
three
distinct files to control which of your entire gels are to be private,
which of your protein entries are to be private and which of your spots
experimental identification data and analysis are to be private:
The hiddenGels.txt
file: This file takes the list of gels to be hidden from public users.
The hiddenEntries.txt
file: This file takes the list of protein accession numbers to be
hidden from public users.
The hiddenSpots.txt
file: This file controls if an association between a spot and an
identified
protein is to be shown or not. It also controls if identification data from 'MS/MS' (tandem mass spectrometry), 'PMF' (peptide
mass fingerprinting) or 'Aa' (amino acid composition) are to be
displayed for public users or not.
You may generate those files yourself from scratch (comment lines
beginning
with a '#'
character are ignored) or use the master files in the readme
directory. Place them with their respective names
inside your data
directory. Those files will then be read by the tool and will be also
copied to your server directories, so you may decide at any moment later to modify
them to activate back some of your hidden data, or instead to make some
more data hidden. There is a section in the Web administration
interface explaining how to manage this task.
All those three master files can be found in the readme
directory. You may copy them to your data directory
and then edit them using any text editor. The three master files fully
describe the syntax to follow. Here are three examples of edited
files located in the data_test
directory: (hiddenGels.txt
example, hiddenEntries.txt
example and
hiddenSpots.txt example)
The administrator will always have full
access to private data. He might also give a password that privileged
users should provide to access such data. This password is configurable
within
the generated server configuration file 2d_include.pl.
A test dataset containing various data
source formats is included within this package in the data_test
directory. You should read the package content section from the Readme: Main page which describes in
details the content of both the test_data
directory, and its sub-directory examples. You
may try different combination of settings, like for example using the
spreadsheets mode1 (with PLASMA.txt and PLASMA2.txt), using
the
Melanie Export.xml file2 or using the test.dat flat file3
combined with the text
reports (PLASMA_example_report.txt and PLASMA2_example_report.txt, to
be copied to data_test and renamed to PLASMA.txt and PASLAM2.txt). You
may also try the flat file mode combined with the Melanie XML export4
(both Export.xml and test.dat) as your source data. You may edit the
different hidden*.txt files to control the effect on the query
interface, and so on. To try those different approaches, you will have
first to configure adequate configuration files, like described in the Readme: Configuration page.
in include.cfg: set
$db_file = "" and $Melanie = 0;
in include.cfg: set
$db_file = "" and $Melanie = 1;
in include.cfg: set
$db_file = "test.dat" and $Melanie = 0;
in include.cfg: set
$db_file = "test.dat" and $Melanie = 1;
You may also edit the file subtitle.html1 wich will be displayed in the Web interface as a subtitle section.
This file can be a simple text file, or a HTML tagged file (without
headers!), and may then contain any HTML tags including images and
external links. Have also a look at the file references.txt which lists some
bibliographic references cited from within the spreadsheet report
PLASMA2.txt.
The tool will always look for the presence of a file called "subtitle.html" in your 'data'
directory to include it as a subtitle in your Web interface. So, it is
a good place to write some description of your database, your
institution, to include some logos, and so on.
A file listing some URL links to
different database cross-references
(mainly
for the DR lines) is provided within this package (in the 'text' directory).
The file name is 'DbCrossRefs.txt' (this file is only
present if you allow the tool
to extract data from the Expasy server). Otherwise, the tool will use
the file called 'links.txt'.
You can let the tool use this file as it is, or choose to edit it
yourself to add or update URLs.
If you edit directly this file from the 'text' directory, the changes will apply
to all your subsequent installations, but your changes may not remain
permanent (because the file is automatically made up-to-date by contacting the Expasy server).
It is recommended that you update this file specifically for one
installation by editing it, after your installation
is complete, from your Web server directory where it has been copied
(by
default the copy of this file should be found in '/www/var/cgi-bin/2d/inc/links.txt'
or similar).
See Readme:
Main for
more
details.