io.arff.arffread

Module Contents

Classes

ArffError()
ParseArffError()
MetaData(self,rel,attr) Small container to keep useful informations on a ARFF dataset.

Functions

parse_type(attrtype) Given an arff attribute value (meta data), returns its type.
get_nominal(attribute) If attribute is nominal, returns a list of the values
read_data_list(ofile) Read each line of the iterable and put it in a list.
get_ndata(ofile) Read the whole file to get number of data attributes.
maxnomlen(atrv) Given a string containing a nominal type definition, returns the
get_nom_val(atrv) Given a string containing a nominal type, returns a tuple of the
get_date_format(atrv)
go_data(ofile) Skip header.
tokenize_attribute(iterable,attribute) Parse a raw string in header (eg starts by @attribute).
tokenize_single_comma(val)
tokenize_single_wcomma(val)
read_header(ofile) Read the header of the iterable ofile.
safe_float(x) given a string x, convert it to a float. If the stripped string is a ?,
safe_nominal(value,pvalue)
safe_date(value,date_format,datetime_unit)
loadarff(f) Read an arff file.
_loadarff(ofile)
basic_stats(data)
print_attribute(name,tp,data)
test_weka(filename)
class ArffError
class ParseArffError
parse_type(attrtype)

Given an arff attribute value (meta data), returns its type.

Expect the value to be a name.

get_nominal(attribute)

If attribute is nominal, returns a list of the values

read_data_list(ofile)

Read each line of the iterable and put it in a list.

get_ndata(ofile)

Read the whole file to get number of data attributes.

maxnomlen(atrv)

Given a string containing a nominal type definition, returns the string len of the biggest component.

A nominal type is defined as seomthing framed between brace ({}).

atrv : str
Nominal type definition
slen : int
length of longest component

maxnomlen(“{floup, bouga, fl, ratata}”) returns 6 (the size of ratata, the longest nominal value).

>>> maxnomlen("{floup, bouga, fl, ratata}")
6
get_nom_val(atrv)

Given a string containing a nominal type, returns a tuple of the possible values.

A nominal type is defined as something framed between braces ({}).

atrv : str
Nominal type definition
poss_vals : tuple
possible values
>>> get_nom_val("{floup, bouga, fl, ratata}")
('floup', 'bouga', 'fl', 'ratata')
get_date_format(atrv)
go_data(ofile)

Skip header.

the first next() call of the returned iterator will be the @data line

tokenize_attribute(iterable, attribute)

Parse a raw string in header (eg starts by @attribute).

Given a raw string attribute, try to get the name and type of the attribute. Constraints:

  • The first line must start with @attribute (case insensitive, and space like characters before @attribute are allowed)
  • Works also if the attribute is spread on multilines.
  • Works if empty lines or comments are in between
attribute : str
the attribute string.
name : str
name of the attribute
value : str
value of the attribute
next : str
next line to be parsed

If attribute is a string defined in python as r”floupi real”, will return floupi as name, and real as value.

>>> iterable = iter([0] * 10) # dummy iterator
>>> tokenize_attribute(iterable, r"@attribute floupi real")
('floupi', 'real', 0)

If attribute is r“‘floupi 2’ real”, will return ‘floupi 2’ as name, and real as value.

>>> tokenize_attribute(iterable, r"  @attribute 'floupi 2' real   ")
('floupi 2', 'real', 0)
tokenize_single_comma(val)
tokenize_single_wcomma(val)
read_header(ofile)

Read the header of the iterable ofile.

safe_float(x)

given a string x, convert it to a float. If the stripped string is a ?, return a Nan (missing value).

x : str
string to convert
f : float
where float can be nan
>>> safe_float('1')
1.0
>>> safe_float('1\\n')
1.0
>>> safe_float('?\\n')
nan
safe_nominal(value, pvalue)
safe_date(value, date_format, datetime_unit)
class MetaData(rel, attr)

Small container to keep useful informations on a ARFF dataset.

Knows about attributes names and types.

data, meta = loadarff('iris.arff')
# This will print the attributes names of the iris.arff dataset
for i in meta:
    print(i)
# This works too
meta.names()
# Getting attribute type
types = meta.types()

Also maintains the list of attributes in order, i.e. doing for i in meta, where meta is an instance of MetaData, will return the different attribute names in the order they were defined.

__init__(rel, attr)
__repr__()
__iter__()
__getitem__(key)
names()

Return the list of attribute names.

types()

Return the list of attribute types.

loadarff(f)

Read an arff file.

The data is returned as a record array, which can be accessed much like a dictionary of numpy arrays. For example, if one of the attributes is called ‘pressure’, then its first 10 data points can be accessed from the data record array like so: data['pressure'][0:10]

f : file-like or str
File-like object to read from, or filename to open.
data : record array
The data of the arff file, accessible by attribute names.
meta : MetaData
Contains information about the arff file such as name and type of attributes, the relation (name of the dataset), etc…
ParseArffError
This is raised if the given file is not ARFF-formatted.
NotImplementedError
The ARFF file has an attribute which is not supported yet.

This function should be able to read most arff files. Not implemented functionality include:

  • date type attributes
  • string type attributes

It can read files with numeric and nominal attributes. It cannot read files with sparse data ({} in the file). However, this function can read files with missing data (? in the file), representing the data points as NaNs.

>>> from scipy.io import arff
>>> from io import StringIO
>>> content = \"\"\"
... @relation foo
... @attribute width  numeric
... @attribute height numeric
... @attribute color  {red,green,blue,yellow,black}
... @data
... 5.0,3.25,blue
... 4.5,3.75,green
... 3.0,4.00,red
... \"\"\"
>>> f = StringIO(content)
>>> data, meta = arff.loadarff(f)
>>> data
array([(5.0, 3.25, 'blue'), (4.5, 3.75, 'green'), (3.0, 4.0, 'red')],
      dtype=[('width', '<f8'), ('height', '<f8'), ('color', '|S6')])
>>> meta
Dataset: foo
\twidth's type is numeric
\theight's type is numeric
\tcolor's type is nominal, range is ('red', 'green', 'blue', 'yellow', 'black')
_loadarff(ofile)
basic_stats(data)
print_attribute(name, tp, data)
test_weka(filename)