how to import and export files in Rapidminer

Sample Image

Introduction

RapidMiner is one of the leading data mining software suites. With more than 400 data mining modules or operators, it is one of the most comprehensive and most flexible data mining tools available. With over 10,000 downloads from SourceForge.net each month and more than 300,000 downloads in total, it is also one of the most widespread-used data mining tools. According to polls by the popular data mining web portal KDnuggets.com among several hundred data mining experts, RapidMiner was the most widely used open source data mining tool and among the top three data mining tools overall in 2007 and 2008.

RapidMiner supports all steps of the data mining process from data loading, pre-processing, visualization, interactive data mining process design and inspection, automated modeling, automated parameter and process optimization, automated feature construction and feature selection, evaluation, and deployment. RapidMiner can be used as stand-alone program on the desktop with its graphical user interface (GUI), on a server via its command line version, or as data mining engine for your own products and Java library for developers.

Background

For my recent BI project I need a tool to transform a SQL Server 2008 database to PostgreSQL database with all data. After working with several tools I chose RapidMiner for the task and it shows its full color for this purpose.

ETL

ETL stands for Extract, Transform, Load. For example, you receive files or other data from vendors or other third parties which you need to manipulate in some way and then insert into your own database.

What you Need

Microsoft® SQL Server® 2008 Express

 

SQL Server 2008 Express is a free edition of SQL Server that is an ideal data platform for learning and building desktop and small server applications, and for redistribution by ISVs. You can download from http://www.microsoft.com/en-us/download/details.aspx?id=1695[^]

I assume you have installed Microsoft® SQL Server® 2008 Express. For further information you can visithttp://msdn.microsoft.com/en-us/library/dd981045(v=sql.100).aspx.

AdventureWorks database

You can download from http://msftdbprodsamples.codeplex.com/downloads/get/478218[^] and attach:

Sample Image

RapidMiner

You can download from http://sourceforge.net/projects/rapidminer/files/1.%20RapidMiner/5.2/rapidminer-5.2.008×32-install.exe/download[^].

I assume you have installed RapidMiner. For further information you can visit http://rapid-i.com/wiki/index.php?title=RapidMiner_Installation_Guide.

ETL Process

Step 1:

Create a new process from File->New.

Step 2:

Create a new process from File->New.

Step 3:

Click Operators->Import->Data->Read Database.

Sample Image

Step 4:

Now we have to create a SQL Server connection for reading.

Sample Image

After creating the connection, assign the connection to read Database link.

Sample Image

Step 5:

Click Build Query and assign your Table.

Sample Image

Step 6:

Click Operators->Export->Data->Write CSV. Select Write CSV icon and assign value.

CSV file: Location of the file.

Column separator: Column separator character.

Connect out from “Read Database” to inp from “Write CSV”.

Sample Image

Behind the scenes

 Collapse | Copy Code
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.2.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
    <process expanded="true" height="370" width="614">
      <operator activated="true" class="read_database" compatibility="5.2.008" 
              expanded="true" height="60" name="Read Database" width="90" x="179" y="75">
        <parameter key="connection" value="MSSQLServer-AdventureWorks2008"/>
        <parameter key="query" value="SELECT *
FROM "HumanResources"."Employee""/>
        <enumeration key="parameters"/>
      </operator>
      <operator activated="true" class="write_csv" compatibility="5.2.008" 
             expanded="true" height="76" name="Write CSV" width="90" x="380" y="75">
        <parameter key="csv_file" value="C:\Users\Masud\Desktop\exportcsv.csv"/>
        <parameter key="column_separator" value=","/>
      </operator>
      <connect from_op="Read Database" from_port="output" to_op="Write CSV" to_port="input"/>
      <connect from_op="Write CSV" from_port="through" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Step 7:

Now press Run or F11 you will see that all data will be exported as a CSV file in your destination. And output will be like:

Sample Image

Conclusion

RapidMiner Community Edition is an excellent tool for ETL.

Advertisements

Text Processing Tutorial with RapidMiner

     I know that a while back it was requested (on either Piazza or in class, can’t remember) that someone post a tutorial about how to process a text document in RapidMiner and no one posted back. In this tutorial, I will try to fulfill that request by showing how to tokenize and filter a document into its different words and then do a word count for each word in a text document (I am essentially showing how to do the same assignment in HW 2 (plus filtering) but through RapidMiner and not AWS).

1) I first downloaded my document (The Entire Works of Mark Twain) through Project Gutenberg’s websiteas a text document. Save the document in a file on your computer.

2) Open RapidMiner and click “New Process”. On the left hand pane of your screen, there should be a tab that says “Operators”- this is where you can search and find all of the operators for RapidMiner and its extensions. By searching the Operators tab for “read”, you should get an output like this (you can double click on the images below to enlarge them):

There are multiple read operators depending on which file you have, and most of them work the same way. If you scroll down, there is a “Read Documents” operator. Select this operator and enter it into your Main Process window by dragging it. When you select the Read Documents operator in the Main Process window, you should see a file uploader in the right-hand pane.

Select the text file you want to use.

3) After you have chosen your file, make sure that the output port on the Read Documents operator is connected to the “res” node in your Main Process. Click the “play” button to check that your file has been received correctly. Switch to the results perspective by clicking the icon that looks like a display chart above the “Process” tab at the top of the Main Process pane. Click the “Document (Read Document)” tab. Your output text should look something like this depending on the file you have chosen to process:

4) Now we will move on to processing the document to get a list of its different words and their individual count. Search the Operators list for “Process Documents”. Drag this operator the same way as you did for the “Read Documents” operator into the main pane.
Double click the Process Documents operator to get inside the operator. This is where we will link operators together to take the entire text document and split it down into its word components. This consists of several operators that can be chosen by going into the Operator pane and looking at the Text Processing folder. You should see several more folders such as “Tokenization”, “Extraction”, “Filtering”, “Stemming”, “Transformation”, and “Utility”. These are some of the descriptions of what you can do to your document.  The first thing that you would want to do to your document is to tokenize it. Tokenization creates a “bag of words” that are contained in your document. This allows you to do further filtering on your document. Search for the “Tokenize” operator and drag it into the “Process Documents” process.
Connect the “doc” node of the process to the “doc” input node of the operator if it has not automatically connected already. Now we are ready to filter the bag of words. In “Filtering” folder under the “Text Processing” operator folder, you can see the various filtering methods that you can apply to your process. For this example, I want to filter certain words out of my document that don’t really have any meaning to the document itself (such as the words a, and, the, as, of, etc.); therefore, I will drag the “Filter Stopwords (English)” into my process because my document is in English. Also, I want to filter out any remaining words that are less than three characters. Select “Filter Tokens by Length” and set your parameters as desired (in this case, I want my min number of characters to be 3, and my max number of characters to be an arbitrarily large number since I don’t care about an upper bound). Connect the nodes of each subsequent operator accordingly as in the picture.

After I filtered the bag of words by stopwords and length, I want to transform all of my words to lowercase since the same word would be counted differently if it was in uppercase vs. lowercase. Select the operator “Transform Cases” and drag it into the process.

5) Now that I have the sufficient operators in my process for this example, I check all of my node connections and click the “Play” button to run my process. If all goes well, your output should look like this in the results view:

Congrats! You are now able to see a word list containing all the different words in your document and their occurrence count next to it in the “Total Occurences” column. If you do not get this output, make sure that all of your nodes are connected correctly and also to the right type. Some errors are because your output at one node does not match the type expected at the input of the next node of an operator. If you are still having trouble, please comment or check out the Rapid-i support forum.

 

how to export Rapidminer process

This PostgreSQL tutorial is designed for intermediate level users who are trying to export stored data as a CSV file.  This style of exporting can be achieved using a few different COPY TO queries, which can help you separate out the valuable information required for each data export.  By completing this tutorial users will also gain the skills for importing CSV data directly into their PostgreSQL database using the COPY FROM SQL syntax.

 

The CSV file format can be much more useful then other spreadsheet formats such as Microsoft Excel (XLS) because the CSV doesn’t have file length limitations included with it.  By exporting your PostgreSQL data into a CSV file you have the ability to store 100,000 or millions of rows of data, whereas XLS can only store up to ~ 65000 rows.  Therefore, if you wish to export a larger data set (more then 65,000 records), all your information can be saved into a single file using a COPY TO query.

 

Before we can start exporting the data to CSV file, we must make sure that we have setup a location on your Linux server that has the correct file permissions.  Navigate to the pgsql/data folder on your system, ex. /usr/local/pgsql/data.

 

cd /usr/local/pgsql

 

Enter the following commands to setup a new directory called csv, which is owned by the postgres user.

 

mkdir csv
chown postgres csv
chmod 755 csv
ls -la

 

Now that we have a csv directory which is owned by the postgres user, you have the necessary file permissions drwxr-xr-x postgres root, which are required for exporting CSV files.

 

PostgreSQL COPY TO: Creating a CSV File

 

Export all data in a table

COPY tablename
TO '/usr/local/pgsql/data/csv/alldata_export.csv'

 

Export defined columns in a table

COPY tablename(column1, column2, column3, column4)
TO '/usr/local/pgsql/data/csv/definedcolumns_export.csv'

 

Export with defined delimiter

COPY tablename(column1, column2, column3, column4)
TO '/usr/local/pgsql/data/csv/defineddelimiter_export.csv'
WITH DELIMITER ‘,’

 

Export with 1st row column name headers

COPY tablename(column1, column2, column3, column4)
TO '/usr/local/pgsql/data/csv/header_export.csv'
WITH DELIMITER ‘,’
CSV HEADER

 

Export using a SELECT WHERE query

COPY (SELECT column1, column2, column3, column4
FROM tablename
WHERE column1 = ‘value’)
TO '/usr/local/pgsql/data/csv/selectwhere_export.csv'
WITH DELIMITER ','
CSV HEADER

 

Export using an ordered query

COPY (SELECT column1, column2, column3, column4
FROM tablename
WHERE column1 = ‘value’
ORDER BY column4 DESC)
TO '/usr/local/pgsql/data/csv/ordered_export.csv'
WITH DELIMITER ','
CSV HEADER

 

Export maximum 100 records using a limit query

COPY (SELECT column1, column2, column3, column4
FROM tablename
WHERE column1 = ‘value’
ORDER BY column4 DESC
LIMIT 100)
TO '/usr/local/pgsql/data/csv/limit_export.csv'
WITH DELIMITER ','
CSV HEADER

 

Note: if your pgsql/data directory is not located under /usr/local/pgsql/data please update the path according to your server setup.

 

PostgreSQL COPY FROM: Loading a CSV File into a Table

 

When you are trying to import data into PostgreSQL via CSV file, it requires a little preparation before you can execute the COPY FROM query.  In order for the COPY FROM to succeed, we have to first create an empty table schema capable of storing the data.  Download the sample csv data here and open it up.  This data lists the 10 largest municipalities in British Columbia according to the 1996 population value.

It contains 4 columns:

gid prov name pop1996

 

Looking at the data you can determine the data types required for each column of data.  These data types are required when running the CREATE TABLE query.

 

CREATE TABLE bctop10 (gid numeric, prov character varying(2),"name" character varying(68), pop1996 numeric)

 

Now that the empty table schema is ready, you must upload the bctop10.csv file to your server and move the file into the csv directory.

 

mv bctop10.csv /usr/local/pgsql/data/csv

 

Once the file is in your csv directory you can import the data into your empty table schema using the following query.

 

COPY bctop10(gid, prov, "name", pop1996)
FROM '/usr/local/pgsql/data/csv/bctop10.csv'
WITH DELIMITER ','
CSV HEADER

 

Note: if your pgsql/data directory is not located under /usr/local/pgsql/data please update the path according to your server setup.

facebook and mapping

facebook-high-res-friendship-world-map-paul-butler

 

Facebook intern Paul Butler has made a courageous attempt in revealing the electronic social connections worldwide. Turns out we do not need geographers to help us decide where to draw country borders, some clever social scientists with Facebook data should do.

Instead of taking the whole (secret?) Facebook social graph consisting of about 500 million people, the data is based on a sample of “about ten million pairs” of friends, which is combined with their home location. From this information, Paul was able to calculate the relative strength between pairs of cities, which was then normalized by their relative distance. By drawing lines between these pairs on top of each other, and fine-tuning their brightness, a map of visually distinguishable countries and continents naturally appeared. In fact, the map might not be that different from a view on the Earth at night, one commenter at Mashable remarked.

I began by taking a sample of about ten million pairs of friends from Apache Hive, our data warehouse. I combined that data with each user’s current city and summed the number of friends between each pair of cities. Then I merged the data with the longitude and latitude of each city.

 

At that point, I began exploring it in R, an open-source statistics environment. As a sanity check, I plotted points at some of the latitude and longitude coordinates. To my relief, what I saw was roughly an outline of the world. Next I erased the dots and plotted lines between the points. After a few minutes of rendering, a big white blob appeared in the center of the map. Some of the outer edges of the blob vaguely resembled the continents, but it was clear that I had too much data to get interesting results just by drawing lines. I thought that making the lines semi-transparent would do the trick, but I quickly realized that my graphing environment couldn’t handle enough shades of color for it to work the way I wanted.

 

Instead I found a way to simulate the effect I wanted. I defined weights for each pair of cities as a function of the Euclidean distance between them and the number of friends between them. Then I plotted lines between the pairs by weight, so that pairs of cities with the most friendships between them were drawn on top of the others. I used a color ramp from black to blue to white, with each line’s color depending on its weight. I also transformed some of the lines to wrap around the image, rather than spanning more than halfway around the world.”

 

 

 

 

http://www.facebookstories.com/

Führung

“THERE’S A LOT OF BURIED TREASURE HERE”

“One of the most interesting / unexpected finds was seeing how closely tied Brazil is to Japan,” Facebook’s project lead Mandy Zibart says. “Brazilians are the third largest immigrant group to Japan… There’s a lot of buried treasure inside here.” When you click one of the multi-colored circles representing countries, related countries burst to life and enlarge based on how many friendships the two countries share. In many cases the results seem strange, so Facebook worked in coordination with an international relations researcher to develop “Closer Looks” for many of these relationships — hypotheses about why Nepal might have a very high number of friendships with Australia, for instance. In that case, Facebook says, the number of friendships could be due to the tens of thousands of student visas Australia has issued to Nepalese students within the last few years.

Facebook_stories_closer_look

Facebook offers a source linked below each hypothesis that substantiates the company’s claim. The default view for examining friendships is “by continent,” but you can also click “by language” to view the world color-coded by the primary language each country speaks. It’s easy to explore French colonization in this view, since you can see at a glance which countries speak primarily French.

facebookstories

 

things might affecting neighborhood definition

Discussions about the traits of strong downtowns and what makes them succeed usually focus on larger cities such as Vancouver, BC, Portland, OR, New York, NY or Charleston, SC. However, a lot can also be learned by looking at things on a smaller scale. This happened to the authors, when we recently looked at downtowns in two small Wisconsin communities. What we learned from them is applicable to many other communities of comparable size.

Our experiences in these two communities certainly confirmed that two basic and broadly held revitalization tenets are just as applicable to small communities as they are to large ones: the need for a comprehensive approach to downtown revitalization and the need to focus on leveraging existing assets. The focus here will be on three other topics that evidence these tenets and deserve our attention:

  • The surprisingly complex economic development challenges that many small downtowns typically face
  • Providing jobs, especially in more rural areas, is a chronic and seemingly intractable problem
  • These small communities too often lack the resources and full range of professionals to initiate and manage broad economic changes.

we again found an economy with numerous economic components and related markets that would have to be analyzed:

  •  Retail and restaurants
  •  Personal services
  •  Educational facilities
  •  A medical clinic
  •  A seniors’ home
  •  A high tech manufacturer

Complex Land Use and Transportation Issues. Even more surprising than the number of markets we had to investigate in Sherwood and the depth of the analyses they required were the complex land use and transportation issues that were hurting the downtown:

  • A high degree of dispersion that might be more readily expected in a larger, more urban community. Even with its small population, Sherwood has four commercial nodes including a growing highway node that intercepts a lot of residents before they reach the downtown and where significant new businesses want to locate, e.g. a supermarket, a childcare center, restaurants. There is really poor economic agglomeration, and in a small economy economic assets benefit even more from agglomeration
  • The downtown is “unfriendly” to pedestrians – it lacks “walkability.” It has significant traffic with lots of trucks. It lacks a solid building wall front and adequate parking spaces. Many of its businesses are closed to shoppers during the day
  • An inability to benefit from a nearby “captive market.” Access to an abutting popular state park was changed so visitors no longer had to drive through the downtown – or Sherwood
  • An underdeveloped local roadway system that does not bring residents in newer parts of town naturally to the downtown. Also, the State recently proposed a highway expansion through the heart of downtown, which would have demolished several businesses and undermined what little pedestrian activity currently exists.

our team found a number of complex land use and transportation issues to address. However, unlike Sherwood, which faces growing pains associated with exurban growth, Village X is facing strong, complex and seemingly intractable challenges, characteristic of other small, often more rural communities and their downtowns:

  • Its region is sparsely populated and has little or no growth
  • The regional economy has long been problematic
  •  Attracting or creating firms that can provide new jobs is tough.

http://www.danth.com/category/commercial-nodes

Characteristics and Guidelines of Great Neighborhoods

Characteristics and Guidelines of Great Neighborhoods

http://jpprojectthree.blogspot.com/2013/01/define-liveable-neighbourhood.html

A neighborhood can be based on a specific plan or the result of a more organic process.

Neighborhoods of different kinds are eligible — downtown, urban, suburban, exurban, town, small village — but should have a definable sense of boundary.

Neighborhoods selected for a Great Neighborhood designation must be at least 10 years old.

Description of the Neighborhood

It is important to identify the geographic, demographic, and social characteristics of the neighborhood. Tell us about its location (i.e. urban, suburban, rural, etc.), density (i.e. dwelling units per acre), or street layout and connectivity; economic, social, and ethnic diversity; and functionality (i.e. residential, commercial, retail, etc.). We also want to know whether a plan or specific planning efforts contributed to or sustained the character of the neighborhood, or if the neighborhood formed more organically and not through a formal planning process.

Neighborhood Form and Composition

How does the neighborhood …

  • Capitalize on building design, scale, architecture, and proportionality to create interesting visual experiences, vistas, or other qualities?
  • Accommodate multiple users and provide access (via walking, bicycling, or public transit) to multiple destinations that serve its residents?
  • Foster social interaction and create a sense of community and neighborliness?
  • Promote security from crime is made safe for children and other users (i.e. traffic calming, other measures)?
  • Use, protect, and enhance the environment and natural features?

Neighborhood Character and Personality

How does the neighborhood …

  • Reflect the community’s local character and set itself apart from other neighborhoods?
  • Retain, interpret, and use local history to help create a sense of place?

Neighborhood Environment and Sustainable Practices

How does the neighborhood …

  • Promote or protect air and water quality, protect groundwater resources, and respond to the growing threat of climate change? What forms of “green infrastructure” are used (e.g., local tree cover mitigating heat gain)?
  • Utilize measures or practices to protect or enhance local biodiversity or the local environment?

Great Neighborhoods – Characteristics and Guidelines for Designation

A neighborhood can be based on a specific plan or the result of a more organic process. Neighborhoods of different kinds are eligible — downtown, urban, suburban, exurban, town, small village — but should have a definable sense of boundary. Neighborhoods selected for a Great Neighborhood designation must be at least 10 years old.

Characteristics of a Great Neighborhood include:

  1. Has a variety of functional attributes that contribute to a resident’s day-to-day living (i.e. residential, commercial, or mixed-uses).
  1. Accommodates multi-modal transportation (i.e. pedestrians, bicyclists, drivers).
  1. Has design and architectural features that are visually interesting.
  1. Encourages human contact and social activities.
  1. Promotes community involvement and maintains a secure environment.
  1. Promotes sustainability and responds to climatic demands.
  1. Has a memorable character.

Description of the Neighborhood

  1. When was the neighborhood first settled?
  1. Where is the neighborhood located: in a downtown, urban area, suburb, exurban area (i.e., on the fringes of a metropolitan area), village, or small town? What is the neighborhood’s approximate density (e.g., in dwelling units per acre, or other)?
  1. What is the neighborhood’s location, its physical extent, and layout?  What are the boundaries of the neighborhood? Are these boundaries formal, defined by an institution or jurisdiction (i.e., wards or other political boundaries, neighborhood associations, other entities) or is the neighborhood defined informally?
  1. How large a geographic area does the neighborhood encompass (number of blocks, acres, or other measurement)?
  1. What is the layout (e.g., grid, curvilinear) of the streets? Is there street connectivity; is it easy to get from one place to another by car, foot, or bike within or beyond the neighborhood without going far out of one’s way?
  1. What is the mix of residential, commercial, retail and other uses?
  1. What activities and facilities support everyday life (e.g., housing, schools, stores, parks, green space, businesses, churches, public or private facilities, common streets, transit, etc.)?
  1. Is there diversity amongst the residents, including economic, social, ethnic, and demographic? Describe the neighborhood’s homogeneity or heterogeneity in those terms.
  1. How has a plan or planning contributed to or sustained the character of the neighborhood? Or did the neighborhood form more organically and not through a formal planning process?

Guidelines for Great Neighborhoods

1.0 Neighborhood Form and Composition

1.1 Does the neighborhood have an easily discernable locale? What are its borders?

1.2 How is the neighborhood fitted to its natural setting and the surrounding environs?

1.3 What is the proximity between different places in the neighborhood? Are these places within walking or biking distances? Does walking or bicycling within the neighborhood serve multiple purposes? Describe (access to transit, parks, public spaces, shopping, schools, etc.). How are pedestrians and bicyclists accommodated (sidewalks, paths or trails, designated bike lanes, share-the-road signage, etc.).

1.4 How does the neighborhood foster social interaction and promote human contact? How is a sense of community and neighborliness created?

1.5 Does the neighborhood promote security from crime, and is it perceived as safe? How are streets made safe for children and other users (e.g., traffic calming, other measures)?

1.6 Is there consistency of scale between buildings (i.e., are buildings proportional to one another)?

2.0 Neighborhood Character and Personality

2.1 What makes the neighborhood stand out? What makes it extraordinary or memorable? What elements, features, and details reflect the community’s local character and set the neighborhood apart from other neighborhoods?

2.2 Does the neighborhood provide interesting visual experiences, vistas, natural features, or other qualities?

2.3 How does the architecture of houses and other buildings create visual interest? Are the houses and buildings designed and scaled for pedestrians?

2.4 How is local history retained, interpreted, and used to help create a sense of place?

2.5 How has the neighborhood adapted to change? Include specific examples.

3.0  Neighborhood Environment and Sustainable Practices

3.1 How does the neighborhood respond to the growing threat of climate change? (e.g., local tree cover mitigating heat gain)?

3.2 How does the neighborhood promote or protect air and water quality, protect groundwater resources if present, and minimize or manage stormwater runoff? Is there any form of “green infrastructure”?

3.3 What measures or practices exist to protect or enhance local biodiversity or the local environment?