how to import and export files in Rapidminer

Sample Image

Introduction

RapidMiner is one of the leading data mining software suites. With more than 400 data mining modules or operators, it is one of the most comprehensive and most flexible data mining tools available. With over 10,000 downloads from SourceForge.net each month and more than 300,000 downloads in total, it is also one of the most widespread-used data mining tools. According to polls by the popular data mining web portal KDnuggets.com among several hundred data mining experts, RapidMiner was the most widely used open source data mining tool and among the top three data mining tools overall in 2007 and 2008.

RapidMiner supports all steps of the data mining process from data loading, pre-processing, visualization, interactive data mining process design and inspection, automated modeling, automated parameter and process optimization, automated feature construction and feature selection, evaluation, and deployment. RapidMiner can be used as stand-alone program on the desktop with its graphical user interface (GUI), on a server via its command line version, or as data mining engine for your own products and Java library for developers.

Background

For my recent BI project I need a tool to transform a SQL Server 2008 database to PostgreSQL database with all data. After working with several tools I chose RapidMiner for the task and it shows its full color for this purpose.

ETL

ETL stands for Extract, Transform, Load. For example, you receive files or other data from vendors or other third parties which you need to manipulate in some way and then insert into your own database.

What you Need

Microsoft® SQL Server® 2008 Express

 

SQL Server 2008 Express is a free edition of SQL Server that is an ideal data platform for learning and building desktop and small server applications, and for redistribution by ISVs. You can download from http://www.microsoft.com/en-us/download/details.aspx?id=1695[^]

I assume you have installed Microsoft® SQL Server® 2008 Express. For further information you can visithttp://msdn.microsoft.com/en-us/library/dd981045(v=sql.100).aspx.

AdventureWorks database

You can download from http://msftdbprodsamples.codeplex.com/downloads/get/478218[^] and attach:

Sample Image

RapidMiner

You can download from http://sourceforge.net/projects/rapidminer/files/1.%20RapidMiner/5.2/rapidminer-5.2.008×32-install.exe/download[^].

I assume you have installed RapidMiner. For further information you can visit http://rapid-i.com/wiki/index.php?title=RapidMiner_Installation_Guide.

ETL Process

Step 1:

Create a new process from File->New.

Step 2:

Create a new process from File->New.

Step 3:

Click Operators->Import->Data->Read Database.

Sample Image

Step 4:

Now we have to create a SQL Server connection for reading.

Sample Image

After creating the connection, assign the connection to read Database link.

Sample Image

Step 5:

Click Build Query and assign your Table.

Sample Image

Step 6:

Click Operators->Export->Data->Write CSV. Select Write CSV icon and assign value.

CSV file: Location of the file.

Column separator: Column separator character.

Connect out from “Read Database” to inp from “Write CSV”.

Sample Image

Behind the scenes

 Collapse | Copy Code
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
    <process expanded="true" height="370" width="614">
      <operator activated="true" class="read_database" compatibility="5.2.008" 
              expanded="true" height="60" name="Read Database" width="90" x="179" y="75">
        <parameter key="connection" value="MSSQLServer-AdventureWorks2008"/>
        <parameter key="query" value="SELECT *
FROM "HumanResources"."Employee""/>
        <enumeration key="parameters"/>
      </operator>
      <operator activated="true" class="write_csv" compatibility="5.2.008" 
             expanded="true" height="76" name="Write CSV" width="90" x="380" y="75">
        <parameter key="csv_file" value="C:\Users\Masud\Desktop\exportcsv.csv"/>
        <parameter key="column_separator" value=","/>
      </operator>
      <connect from_op="Read Database" from_port="output" to_op="Write CSV" to_port="input"/>
      <connect from_op="Write CSV" from_port="through" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Step 7:

Now press Run or F11 you will see that all data will be exported as a CSV file in your destination. And output will be like:

Sample Image

Conclusion

RapidMiner Community Edition is an excellent tool for ETL.

Advertisements

Text Processing Tutorial with RapidMiner

     I know that a while back it was requested (on either Piazza or in class, can’t remember) that someone post a tutorial about how to process a text document in RapidMiner and no one posted back. In this tutorial, I will try to fulfill that request by showing how to tokenize and filter a document into its different words and then do a word count for each word in a text document (I am essentially showing how to do the same assignment in HW 2 (plus filtering) but through RapidMiner and not AWS).

1) I first downloaded my document (The Entire Works of Mark Twain) through Project Gutenberg’s websiteas a text document. Save the document in a file on your computer.

2) Open RapidMiner and click “New Process”. On the left hand pane of your screen, there should be a tab that says “Operators”- this is where you can search and find all of the operators for RapidMiner and its extensions. By searching the Operators tab for “read”, you should get an output like this (you can double click on the images below to enlarge them):

There are multiple read operators depending on which file you have, and most of them work the same way. If you scroll down, there is a “Read Documents” operator. Select this operator and enter it into your Main Process window by dragging it. When you select the Read Documents operator in the Main Process window, you should see a file uploader in the right-hand pane.

Select the text file you want to use.

3) After you have chosen your file, make sure that the output port on the Read Documents operator is connected to the “res” node in your Main Process. Click the “play” button to check that your file has been received correctly. Switch to the results perspective by clicking the icon that looks like a display chart above the “Process” tab at the top of the Main Process pane. Click the “Document (Read Document)” tab. Your output text should look something like this depending on the file you have chosen to process:

4) Now we will move on to processing the document to get a list of its different words and their individual count. Search the Operators list for “Process Documents”. Drag this operator the same way as you did for the “Read Documents” operator into the main pane.
Double click the Process Documents operator to get inside the operator. This is where we will link operators together to take the entire text document and split it down into its word components. This consists of several operators that can be chosen by going into the Operator pane and looking at the Text Processing folder. You should see several more folders such as “Tokenization”, “Extraction”, “Filtering”, “Stemming”, “Transformation”, and “Utility”. These are some of the descriptions of what you can do to your document.  The first thing that you would want to do to your document is to tokenize it. Tokenization creates a “bag of words” that are contained in your document. This allows you to do further filtering on your document. Search for the “Tokenize” operator and drag it into the “Process Documents” process.
Connect the “doc” node of the process to the “doc” input node of the operator if it has not automatically connected already. Now we are ready to filter the bag of words. In “Filtering” folder under the “Text Processing” operator folder, you can see the various filtering methods that you can apply to your process. For this example, I want to filter certain words out of my document that don’t really have any meaning to the document itself (such as the words a, and, the, as, of, etc.); therefore, I will drag the “Filter Stopwords (English)” into my process because my document is in English. Also, I want to filter out any remaining words that are less than three characters. Select “Filter Tokens by Length” and set your parameters as desired (in this case, I want my min number of characters to be 3, and my max number of characters to be an arbitrarily large number since I don’t care about an upper bound). Connect the nodes of each subsequent operator accordingly as in the picture.

After I filtered the bag of words by stopwords and length, I want to transform all of my words to lowercase since the same word would be counted differently if it was in uppercase vs. lowercase. Select the operator “Transform Cases” and drag it into the process.

5) Now that I have the sufficient operators in my process for this example, I check all of my node connections and click the “Play” button to run my process. If all goes well, your output should look like this in the results view:

Congrats! You are now able to see a word list containing all the different words in your document and their occurrence count next to it in the “Total Occurences” column. If you do not get this output, make sure that all of your nodes are connected correctly and also to the right type. Some errors are because your output at one node does not match the type expected at the input of the next node of an operator. If you are still having trouble, please comment or check out the Rapid-i support forum.

 

how to export Rapidminer process

This PostgreSQL tutorial is designed for intermediate level users who are trying to export stored data as a CSV file.  This style of exporting can be achieved using a few different COPY TO queries, which can help you separate out the valuable information required for each data export.  By completing this tutorial users will also gain the skills for importing CSV data directly into their PostgreSQL database using the COPY FROM SQL syntax.

 

The CSV file format can be much more useful then other spreadsheet formats such as Microsoft Excel (XLS) because the CSV doesn’t have file length limitations included with it.  By exporting your PostgreSQL data into a CSV file you have the ability to store 100,000 or millions of rows of data, whereas XLS can only store up to ~ 65000 rows.  Therefore, if you wish to export a larger data set (more then 65,000 records), all your information can be saved into a single file using a COPY TO query.

 

Before we can start exporting the data to CSV file, we must make sure that we have setup a location on your Linux server that has the correct file permissions.  Navigate to the pgsql/data folder on your system, ex. /usr/local/pgsql/data.

 

cd /usr/local/pgsql

 

Enter the following commands to setup a new directory called csv, which is owned by the postgres user.

 

mkdir csv
chown postgres csv
chmod 755 csv
ls -la

 

Now that we have a csv directory which is owned by the postgres user, you have the necessary file permissions drwxr-xr-x postgres root, which are required for exporting CSV files.

 

PostgreSQL COPY TO: Creating a CSV File

 

Export all data in a table

COPY tablename
TO '/usr/local/pgsql/data/csv/alldata_export.csv'

 

Export defined columns in a table

COPY tablename(column1, column2, column3, column4)
TO '/usr/local/pgsql/data/csv/definedcolumns_export.csv'

 

Export with defined delimiter

COPY tablename(column1, column2, column3, column4)
TO '/usr/local/pgsql/data/csv/defineddelimiter_export.csv'
WITH DELIMITER ‘,’

 

Export with 1st row column name headers

COPY tablename(column1, column2, column3, column4)
TO '/usr/local/pgsql/data/csv/header_export.csv'
WITH DELIMITER ‘,’
CSV HEADER

 

Export using a SELECT WHERE query

COPY (SELECT column1, column2, column3, column4
FROM tablename
WHERE column1 = ‘value’)
TO '/usr/local/pgsql/data/csv/selectwhere_export.csv'
WITH DELIMITER ','
CSV HEADER

 

Export using an ordered query

COPY (SELECT column1, column2, column3, column4
FROM tablename
WHERE column1 = ‘value’
ORDER BY column4 DESC)
TO '/usr/local/pgsql/data/csv/ordered_export.csv'
WITH DELIMITER ','
CSV HEADER

 

Export maximum 100 records using a limit query

COPY (SELECT column1, column2, column3, column4
FROM tablename
WHERE column1 = ‘value’
ORDER BY column4 DESC
LIMIT 100)
TO '/usr/local/pgsql/data/csv/limit_export.csv'
WITH DELIMITER ','
CSV HEADER

 

Note: if your pgsql/data directory is not located under /usr/local/pgsql/data please update the path according to your server setup.

 

PostgreSQL COPY FROM: Loading a CSV File into a Table

 

When you are trying to import data into PostgreSQL via CSV file, it requires a little preparation before you can execute the COPY FROM query.  In order for the COPY FROM to succeed, we have to first create an empty table schema capable of storing the data.  Download the sample csv data here and open it up.  This data lists the 10 largest municipalities in British Columbia according to the 1996 population value.

It contains 4 columns:

gid prov name pop1996

 

Looking at the data you can determine the data types required for each column of data.  These data types are required when running the CREATE TABLE query.

 

CREATE TABLE bctop10 (gid numeric, prov character varying(2),"name" character varying(68), pop1996 numeric)

 

Now that the empty table schema is ready, you must upload the bctop10.csv file to your server and move the file into the csv directory.

 

mv bctop10.csv /usr/local/pgsql/data/csv

 

Once the file is in your csv directory you can import the data into your empty table schema using the following query.

 

COPY bctop10(gid, prov, "name", pop1996)
FROM '/usr/local/pgsql/data/csv/bctop10.csv'
WITH DELIMITER ','
CSV HEADER

 

Note: if your pgsql/data directory is not located under /usr/local/pgsql/data please update the path according to your server setup.

HOW TO IMPORT DATA TO RAPID MINER

You  just copy and paste this code into your rapid miner XML view(remove the previous xml code), click the green check symbol and switch back to the normal Process (diagram) view.

Code:
<?xml version=”1.0″ encoding=”UTF-8″ standalone=”no”?>
<process version=”5.2.008″>
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated=”true” compatibility=”5.2.008″ expanded=”true” name=”Process”>
<process expanded=”true” height=”145″ width=”279″>
<operator activated=”true” compatibility=”5.2.008″ expanded=”true” height=”60″ name=”Read CSV” width=”90″ x=”179″ y=”75″>
<parameter key=”csv_file” value=”C:\iris.data”/>
<parameter key=”column_separators” value=”,”/>
<parameter key=”first_row_as_names” value=”false”/>
<list key=”annotations”/>
<parameter key=”encoding” value=”windows-1252″/>
<list key=”data_set_meta_data_information”>
<parameter key=”0″ value=”att1.true.real.attribute”/>
<parameter key=”1″ value=”att2.true.real.attribute”/>
<parameter key=”2″ value=”att3.true.real.attribute”/>
<parameter key=”3″ value=”att4.true.real.attribute”/>
<parameter key=”4″ value=”att5.true.polynominal.attribute”/>
</list>
</operator>
<connect from_op=”Read CSV” from_port=”output” to_port=”result 1″/>
<portSpacing port=”source_input 1″ spacing=”0″/>
<portSpacing port=”sink_result 1″ spacing=”0″/>
<portSpacing port=”sink_result 2″ spacing=”0″/>
</process>
</operator>
</process>
The second way is that belowhttp://www.youtube.com/watch?v=cVjyJ9Ag0_0

http://rapid-i.com/videos/tutorials/02_rm5_data_import/rm5_data_import.html