File parsers

File parsers.

XML based file formats

Specs File Format

This file is used to parse XPS and ISS data from XML files from the SPECS program.

In this file format the spectra (called regions) are containd in region groups inside the files. This structure is mirrored in the data structure below where classes are provided for the 3 top level objects:

Files -> Region Groups -> Regions

The parser is strict, in the sense that it will throw an exception if it encounters anything it does not understand. To change this behavior set the EXCEPTION_ON_UNHANDLED module variable to False.

Usage examples

To use the file parse, simply feed the top level data structure a path to a data file and start to use it:

from PyExpLabSys.file_parsers.specs import SpecsFile
import matplotlib.pyplot as plt

file_ = SpecsFile('path_to_my_xps_file.xml')
# Access the regions groups by iteration
for region_group in file_:
    print '{} regions groups in region group: {}'.format(
        len(region_group), region_group.name)

# or by index
region_group = file_[0]

# And again access regions by iteration
for region in region_group:
    print 'region: {}'.format(region.name)

# or by index
region = region_group[0]

# or you can search for them from the file level
region = list(file_.search_regions('Mo'))[0]
print region
# NOTE the search_regions method returns a generator of results, hence the
# conversion to list and subsequent indexing

# From the regions, the x data can be accessed either as kinetic
# or binding energy (for XPS only) and the y data can be accessed
# as averages of the counts, either as pure count numbers or as
# counts per second. These options works independently of each
# other.

# counts as function of kinetic energy
plt.plot(region.x, region.y_avg_counts)
plt.show()

# cps as function of binding energy
plt.plot(region.x_be, region.y_avg_cps)
plt.show()

# Files also have a useful str representation that shows the hierachi
print file_

Notes

The file format seems to basically be a dump, of a large low level data structure from the implementation language. With an appropriate mapping of low level data structure types to python types (see details below and in the simple_convert function), this data structure could have been mapped in its entirety to python types, but in order to provide a more clear data structure a more object oriented approach has been taken, where the top most level data structures are implemented as classes. Inside of these classes, the data is parsed into numpy arrays and the remaining low level data structures are parsed in python data structures with the simple_convert function.

Module Documentation

PyExpLabSys.file_parsers.specs.simple_convert(element)[source]

Converts a XML data structure to pure python types.

Parameters:element (xml.etree.ElementTree.Element) – The XML element to convert
Returns:A hierachi of python data structure
Return type:object

Simple element types are converted as follows:

XML type | Python type
string str
ulong long
double float
boolean bool
struct dict
sequence list

Arrays are converted to numpy arrays, wherein the type conversion is:

XML type | Python type
ulong numpy.uint64
double numpy.double

Besides these types there are a few special elements that have a custom conversion.

  • Enum are simply converted into their value, since enums are considered to be a program implementation detail whose information is not relavant for a data file parser
  • Any is skipped and replaced with its content
class PyExpLabSys.file_parsers.specs.SpecsFile(filepath, encoding=None)[source]

Bases: list

This is the top structure for a parsed file which represents a list of RegionGroups

The class contains a ‘filepath’ attribute.

__init__(filepath, encoding=None)[source]

Parse the XML and initialize the internal variables

regions_iter

Returns a iteration over the regions

search_regions_iter(search_term)[source]

Returns an generator of search results for regions by name

Parameters:search_term (str) – The term to search for (case sensitively)
Returns:An iterator of maching regions
Return type:generator
search_regions(search_term)[source]

Returns an list of search results for regions by name

Parameters:search_term (str) – The term to search for (case sensitively)
Returns:A list of matching regions
Return type:list
unix_timestamp

Returns the unix timestamp of the first region

get_analysis_method()[source]

Returns the analysis method of the file

Raises:ValueError – If more than one analysis method is used
class PyExpLabSys.file_parsers.specs.RegionGroup(xml)[source]

Bases: list

Class that represents a region group, which consist of a list of regions

The class contains a ‘name’ and and ‘parameters’ attribute.

__init__(xml)[source]

Initializes the region group

Expects to find 3 subelement; the name, regions and parameters. Anything else raises an exception.

Parsing parameters is not supported and therefore logs a warning if there are any.

class PyExpLabSys.file_parsers.specs.Region(xml)[source]

Bases: object

Class that represents a region

The class contains attributes for the items listed in the ‘information_names’ class variable.

Some useful ones are:
  • name: The name of the region
  • region: Contains information like, dwell_time, analysis_method, scan_delta, excitation_energy etc.

All auxiliary information is also available from the ‘info’ attribute.

__init__(xml)[source]

Parse the XML and initialize internal variables

Parameters:xml (xml.etree.ElementTree.Element) – The region XML element
x

Returns the kinetic energy x-values as a Numpy array

x_be

Returns the binding energy x-values as a Numpy array

iter_cycles

Returns a generator of cycles

Each cycle is in itself a generator of lists of scans. To iterate over single scans do:

for cycle in self.iter_cycles:
    for scans in cycle:
        for scan in scans:
            print scan

or use iter_scans, which do just that.

iter_scans

Returns an generator of single scans, which in themselves are Numpy arrays

y_avg_counts

Returns the average counts as a Numpy array

y_avg_cps

Returns the average counts per second as a Numpy array

unix_timestamp

Returns the unix timestamp of the first cycle

exception PyExpLabSys.file_parsers.specs.NotXPSException[source]

Bases: exceptions.Exception

Exception for trying to interpret non-XPS data as XPS data

Binary File Formats

File parser for Chemstation files

Note

This file parser went through a large re-write on ??? which changed the data structures of the resulting objects. This means that upon upgrading it will be necessary to update code. The re-write was done to fix some serious errors from the first version, like relying on the Report.TXT file for injections summaries. These are now fetched from the more ordered CSV files.

exception PyExpLabSys.file_parsers.chemstation.NoInjections[source]

Bases: exceptions.Exception

Exception raised when there are no injections in the sequence

class PyExpLabSys.file_parsers.chemstation.Sequence(sequence_dir_path)[source]

Bases: object

The Sequence class for the Chemstation data format

Parameters:
  • injections (list) – List of Injection’s in this sequence
  • sequence_dir_path (str) – The path of this sequence directory
  • metadata (dict) – Dict of metadata
__init__(sequence_dir_path)[source]

Instantiate object properties

Parameters:sequence_dir_path (str) – The path of the sequence
full_sequence_dataset(column_names=None)[source]

Generate peak name specific dataset

This will collect area values for named peaks as a function of time over the different injections.

Parameters:column_names (dict) – A dict of the column names needed from the report lines. The dict should hold the keys: ‘peak_name’, ‘retention_time’ and ‘area’. It defaults to: column_names = {‘peak_name’: ‘Compound Name’, ‘retention_time’: ‘Retention Timemin’, ‘area’: ‘Area’}
Returns:Mapping of signal_and_peak names and the values
Return type:dict
class PyExpLabSys.file_parsers.chemstation.Injection(injection_dirpath, load_raw_spectra=True, read_report_txt=True)[source]

Bases: object

The Injection class for the Chemstation data format

Parameters:
  • injection_dirpath (str) – The path of the directory of this injection
  • reports (defaultdict) –

    Signal -> list_of_report_lines dict. Each report line is dict of column headers to type converted column content. E.g:

    {u'Area': 22.81, u'Area %': 0.24, u'Height': 12.66,
     u'Peak Number': 1, u'Peak Type': u'BB', u'Peak Widthmin':
     0.027, u'Retention Timemin': 5.81}
    

    The columns headers are also stored in :attr`~metadata` under the columns key.

  • reports_raw (defaultdict) – Same as reports except the content is not type converted.
  • metadata (dict) – Dict of metadata
  • raw_files (dict) – Mapping of ch_file_name -> CHFile objects
  • report_txt (str or None) – The content of the Report.TXT file from the injection folder is any
__init__(injection_dirpath, load_raw_spectra=True, read_report_txt=True)[source]

Instantiate Injection object

Parameters:
  • injection_dirpath (str) – The path of the injection directory
  • load_raw_spectra (bool) – Whether to load raw spectra or not
  • read_report_txt (bool) – Whether to read and save the Report.TXT file
PyExpLabSys.file_parsers.chemstation.parse_utf16_string(file_, encoding=u'UTF16')[source]

Parse a pascal type UTF16 encoded string from a binary file object

class PyExpLabSys.file_parsers.chemstation.CHFile(filepath)[source]

Bases: object

Class that implementats the Agilent .ch file format version 179

Warning

Not all aspects of the file header is understood, so there may and probably is information that is not parsed. See the method _parse_header_status() for an overview of which parts of the header is understood.

Note

Although the fundamental storage of the actual data has change, lots of inspiration for the parsing of the header has been drawn from the parser in the ImportAgilent.m file in the chemplexity/chromatography project project. All credit for the parts of the header parsing that could be reused goes to the author of that project.

values

The internsity values (y-value) or the spectrum. The unit for the values is given in metadata[‘units’]

Type:numpy.array
metadata

The extracted metadata

Type:dict
filepath

The filepath this object was loaded from

Type:str
__init__(filepath)[source]

Instantiate object

Parameters:filepath (str) – The path of the data file
times

The time values (x-value) for the data set in minutes