Showing posts with label R. Show all posts
Showing posts with label R. Show all posts

Finding the last entry

 Let's say we have a table like below. The same product run on the same machine more than one time, re-work or re-processed.




What we are interested in the last run. 


Using base command aggregate 

laste_entry_max <- aggregate(RUN_Number ~ ID, last_entry, max)




By doing a merge you can do the subset the table.

laste_entry_max_all_c <- merge(laste_entry_max, last_entry, by = c('ID', 'RUN_Number'), all.x = T) 



If the run number information is not there we can use Time_Stamp


You can also use dplyr package

library(dplyr) 

laste_entry_max <- last_entry %>%  

                      group_by(ID) %>% 

                          slice(which.max(RUN_Number)) 



Handling big numbers in R

One of the issue with R big (long numbers). You can find very detail explanation on what is the issue. But in this post we discuss in a very simple way. Mainly how to work around when faced with this issue.

A simple example

Let's say need to read text file with varying length numbers.


number
1234567891234567
123456789123456789
9988776655443322111111
54767686587697697696679565664656
2423546567899009052342313244568798978
1000000000000000006666666666660000000000000003333333333


Using red csv

aa <- read.csv("C:\\data\\rblog\\number.txt")


What we get is this.



Checking the length....

nchar(aa$number)
[1] 16 18 20 20 20  5


The numbers are automatically changed to scientific formatting.

To avoid scientific formatting you can use,

options(scipen=999)






First looks it seems to be OK, but if you look at the last digits you can see they are changed, except for the first row.


To avoid this we need to force the data as character. 

aa <- read.csv("C:\\data\\rblog\\number.txt",  colClasses= 'character')

Then we get.




If we force the character to convert to numeric,

aa$number <- as.numeric(aa$number)

The issue of changing last digits will reappear. 





Working with Json file

R may be not the best software to handle JSON files. But for those want to insist on using R instead of Python or some other programs, let see how to work with JSON files.


Read JSON file

The example used here is from https://support.oneskyapp.com/hc/en-us/articles/208047697-JSON-sample-files

lets use a simple JSON file

{
    "fruit": "Apple",
    "size": "Large",
    "color": "Red"
}

In R data is normally handled using data frames. 
So lets convert the JSON to data frame.
The package used to do that is ‘jsonlite’. 

The code
library(jsonlite)
# point the location of the file
json_file1 <-"C:\\data\\rblog\\example_1.json"
# read the file using fronJSON function
json_data1 <- fromJSON(json_file,  flatten = FALSE  )





# convert to data frame
json_data_frame1 <- data.frame(json_data)




More complicated JSON files cannot be handled in the same method.
Lets look at the second example.
json_file2 <-"C:\\data\\rblog\\example_2.json"
json_data2 <- fromJSON(json_file2, flatten = FALSE)




Up to this point no issue, but when try to convert to dataframe, the following error occurs.
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE,  : 
  arguments imply differing number of rows: 1, 0, 801, 684, 60

The reason is very clear, the data contains the lists of different length and dataframe.
But it is possible to create separate  dataframes.



For example,
json_data_frame3 <- data.frame(json_data2$afmlist)





Plotting in R Part II - Grid Lines

We will start from end of Part I.

Recap:

The libraries and data set needed. 
Library: ggplot2, stringr
Dataset: iris (to be loaded from local drive)
IDE: RStudio

# First load the library
library(ggplot2)

# The set your working directory
setwd("C:/R_Train") # please replace with your own directory

# Read the file and create a data frame
tbl <- read.csv("Sample Data/iris.csv")


Get the csv file  from here.
data source 


Modify the grid lines

The grid can be modified individually using panel.grid.major.x,   panel.grid.major.y,  panel.grid.minor.x and panel.grid.minor.y.


  • change the color of the major grid of y axis.

Code
ggplot(tbl, aes(x=Name, y=SepalLength, fill = Name)) + 
  geom_bar(stat="identity", width = 0.3) +
  theme(axis.title.x = element_text(size = 20, color = "blue", face = "italic")) +
  theme(axis.text.x = element_text(size = 10, color = "firebrick3", face = "bold")) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 10)) + 
  theme(axis.title.y = element_text(size = 20, color = "blue", face = "italic")) +
  scale_y_continuous(limits = c(0,350), breaks = seq(0,350,50)) +
  scale_fill_manual("legend", values = c("Iris-setosa"  = "indianred3", "Iris-versicolor" = "lightcyan2", "Iris-virginica" = "darkolivegreen2")) +
  theme(panel.grid.major.y = element_line(colour = "black"))

Output

  • Change the color of the minor grid of y axis.

Code

ggplot(tbl, aes(x=Name, y=SepalLength, fill = Name)) + 
  geom_bar(stat="identity", width = 0.3) +
  theme(axis.title.x = element_text(size = 20, color = "blue", face = "italic")) +
  theme(axis.text.x = element_text(size = 10, color = "firebrick3", face = "bold")) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 10)) + 
  theme(axis.title.y = element_text(size = 20, color = "blue", face = "italic")) +
  scale_y_continuous(limits = c(0,350), breaks = seq(0,350,50)) +
  scale_fill_manual("legend", values = c("Iris-setosa"  = "indianred3", "Iris-versicolor" = "lightcyan2", "Iris-virginica" = "darkolivegreen2")) +
  theme(panel.grid.major.y = element_line(colour = "black"), panel.grid.minor.y = element_line(colour = "blue"))

Output


  • Change the color of the grid of x axis
Code
ggplot(tbl, aes(x=Name, y=SepalLength, fill = Name)) + 
  geom_bar(stat="identity", width = 0.3) +

  theme(axis.title.x = element_text(size = 20, color = "blue", face = "italic")) +
  theme(axis.text.x = element_text(size = 10, color = "firebrick3", face = "bold")) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 10)) + 
  theme(axis.title.y = element_text(size = 20, color = "blue", face = "italic")) +
  scale_y_continuous(limits = c(0,350), breaks = seq(0,350,50)) +
  scale_fill_manual("legend", values = c("Iris-setosa"  = "indianred3", "Iris-versicolor" = "lightcyan2", "Iris-virginica" = "darkolivegreen2")) +
  theme(panel.grid.major.y = element_line(colour = "black"), panel.grid.minor.y = element_line(colour = "blue")) +
  theme(panel.grid.major.x = element_line(colour = "black"))


Output




  • Change major grids color at once.

Code

ggplot(tbl, aes(x=Name, y=SepalLength, fill = Name)) + 
  geom_bar(stat="identity", width = 0.3) +
  theme(axis.title.x = element_text(size = 20, color = "blue", face = "italic")) +
  theme(axis.text.x = element_text(size = 10, color = "firebrick3", face = "bold")) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 10)) + 
  theme(axis.title.y = element_text(size = 20, color = "blue", face = "italic")) +
  scale_y_continuous(limits = c(0,350), breaks = seq(0,350,50)) +
  scale_fill_manual("legend", values = c("Iris-setosa"  = "indianred3", "Iris-versicolor" = "lightcyan2", "Iris-virginica" = "darkolivegreen2")) +
  theme(panel.grid.major = element_line(colour = "black"))

Output
  • Change all the grids at once

Code
ggplot(tbl, aes(x=Name, y=SepalLength, fill = Name)) + 
  geom_bar(stat="identity", width = 0.3) +
  theme(axis.title.x = element_text(size = 20, color = "blue", face = "italic")) +
  theme(axis.text.x = element_text(size = 10, color = "firebrick3", face = "bold")) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 10)) + 
  theme(axis.title.y = element_text(size = 20, color = "blue", face = "italic")) +
  scale_y_continuous(limits = c(0,350), breaks = seq(0,350,50)) +
  scale_fill_manual("legend", values = c("Iris-setosa"  = "indianred3", "Iris-versicolor" = "lightcyan2", "Iris-virginica" = "darkolivegreen2")) +
  theme(panel.grid = element_line(colour = "black"))


Output


But in some cases you don't want the grid lines. 

  • remove x grid lines
Code
ggplot(tbl, aes(x=Name, y=SepalLength, fill = Name)) + 
  geom_bar(stat="identity", width = 0.3) +
  theme(axis.title.x = element_text(size = 20, color = "blue", face = "italic")) +
  theme(axis.text.x = element_text(size = 10, color = "firebrick3", face = "bold")) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 10)) + 
  theme(axis.title.y = element_text(size = 20, color = "blue", face = "italic")) +
  scale_y_continuous(limits = c(0,350), breaks = seq(0,350,50)) +
  scale_fill_manual("legend", values = c("Iris-setosa"  = "indianred3", "Iris-versicolor" = "lightcyan2", "Iris-virginica" = "darkolivegreen2")) +
  theme(panel.grid.major.x = element_blank())


Output



  • remove the y minor grid lines
Code

ggplot(tbl, aes(x=Name, y=SepalLength, fill = Name)) + 
  geom_bar(stat="identity", width = 0.3) +
  theme(axis.title.x = element_text(size = 20, color = "blue", face = "italic")) +
  theme(axis.text.x = element_text(size = 10, color = "firebrick3", face = "bold")) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 10)) + 
  theme(axis.title.y = element_text(size = 20, color = "blue", face = "italic")) +
  scale_y_continuous(limits = c(0,350), breaks = seq(0,350,50)) +
  scale_fill_manual("legend", values = c("Iris-setosa"  = "indianred3", "Iris-versicolor" = "lightcyan2", "Iris-virginica" = "darkolivegreen2")) +
  theme(panel.grid.major.x = element_blank(), panel.grid.minor.y = element_blank())

Output


  • remove all the grid lines
Code
ggplot(tbl, aes(x=Name, y=SepalLength, fill = Name)) + 
  geom_bar(stat="identity", width = 0.3) +
  theme(axis.title.x = element_text(size = 20, color = "blue", face = "italic")) +
  theme(axis.text.x = element_text(size = 10, color = "firebrick3", face = "bold")) +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 10)) + 
  theme(axis.title.y = element_text(size = 20, color = "blue", face = "italic")) +
  scale_y_continuous(limits = c(0,350), breaks = seq(0,350,50)) +
  scale_fill_manual("legend", values = c("Iris-setosa"  = "indianred3", "Iris-versicolor" = "lightcyan2", "Iris-virginica" = "darkolivegreen2")) +
  theme(panel.grid = element_blank())

Output



That is the end of part 2 of Plotting In R. 

If you have any specific request configuring ggplot2, please leave a comment. I will try to add to in the current posts or cover that in future posts. 


Plotting in R Part I - Axis




One of the strong points of R is plotting. It is possible to make highly presentable plots with the help of ggplot2 library. There are many plots available in ggplot2. The aim of this tutorial series more on configuring the different aspects of a plot, like background, grid, axis, title, border and so on. We will se how to make a simple plot and make it into nice looking plot.

The libraries and data set needed.
Library: ggplot2, stringr
Dataset: iris (to be loaded from local drive)
IDE: RStudio

# First load the library
library(ggplot2)

# The set your working directory
setwd("C:/R_Train") # please replace with your own directory

# Read the file and create a data frame
tbl <- read.csv("Sample Data/iris.csv")



Get the csv file  from here.
data source 

For this tutorial Bar chart is used to illustrate the modifications/enhancements you can do.

Base plot 

Code

ggplot(tbl, aes(x=Name, y=SepalLength)) +
   geom_bar(stat="identity")


From the documentation “ Sometimes, bar charts are used not as a distributional summary, but instead of a dotplot. Generally, it's preferable to use a dotplot (see geom_point) as it has a better data-ink ratio. However, if you do want to create this type of plot, you can set y to the value you have calculated, and use stat='identity' ”

Output




The graph is basic and nothing fancy about it. Let’s configure or improve the appearance.

Let’s start with x-axis. First modify the Axis title.

  • Increase the Font size

Code

ggplot(tbl, aes(x=Name, y=SepalLength)) +
   geom_bar(stat="identity") +
   theme(axis.title.x = element_text(size = 20))


Output


  • Change font color

Code

ggplot(tbl, aes(x=Name, y=SepalLength)) +
   geom_bar(stat="identity") +
   theme(axis.title.x = element_text(size = 20, color = "blue"))


Output




  • Change font face

Code

ggplot(tbl, aes(x=Name, y=SepalLength)) +
   geom_bar(stat="identity") +
   theme(axis.title.x = element_text(size = 20, color = "blue", face = "italic"))


Output



Modify the axis text.

The element modified here is axis.text.x
Note the difference between axis.title.x and axis.text.x
Change the font size, color and face

Code

ggplot(tbl, aes(x=Name, y=SepalLength)) +
   geom_bar(stat="identity") +
   theme(axis.title.x = element_text(size = 20, color = "blue", face = "italic")) +
   theme(axis.text.x = element_text(size = 10, color = "firebrick3", face = "bold"))


Output



  • Change the angle
If your axis text is long and overlapping each other, its possible to change the angle.

Code

ggplot(tbl, aes(x=Name, y=SepalLength)) +
   geom_bar(stat="identity") +
   theme(axis.title.x = element_text(size = 20, color = "blue", face = "italic")) +
   theme(axis.text.x = element_text(size = 10, color = "firebrick3", face = "bold", angle = 90,    hjust = 0.5))


hjust = 0.5 to keep the text at the center of the bar.

Output



  • Wrap the text

Other option is to wrap the text. For this we need library stringr. Using the function str_wrap to do this wrapping.

Code

ggplot(tbl, aes(x=Name, y=SepalLength)) +
   geom_bar(stat="identity") +
   theme(axis.title.x = element_text(size = 20, color = "blue", face = "italic")) +
   theme(axis.text.x = element_text(size = 10, color = "firebrick3", face = "bold")) +
   scale_x_discrete(labels = function(x) str_wrap(x, width = 10))


Output




All the actions that has been explained above can be done on for axis y. But there is no axis text.


Code

ggplot(tbl, aes(x=Name, y=SepalLength, fill = Name)) +
   geom_bar(stat="identity") +
   theme(axis.title.x = element_text(size = 20, color = "blue", face = "italic")) +
   theme(axis.text.x = element_text(size = 10, color = "firebrick3", face = "bold")) +
   scale_x_discrete(labels = function(x) str_wrap(x, width = 10)) +
   theme(axis.title.y = element_text(size = 20, color = "blue", face = "italic"))

Output



Modify the scale of y axis

It is possible to manually change the scale.

Code

ggplot(tbl, aes(x=Name, y=SepalLength)) +
   geom_bar(stat="identity") +
   theme(axis.title.x = element_text(size = 20, color = "blue", face = "italic")) +
   theme(axis.text.x = element_text(size = 10, color = "firebrick3", face = "bold")) +
   scale_x_discrete(labels = function(x) str_wrap(x, width = 10)) +
   theme(axis.title.y = element_text(size = 20, color = "blue", face = "italic")) +
   scale_y_continuous(limits = c(0,350), breaks = seq(0,350,50))

Output





  • Modify the bars
    • Change the colors


By adding “fill = Name” in aes you can change the colors

Code

ggplot(tbl, aes(x=Name, y=SepalLength, fill = Name)) +
   geom_bar(stat="identity") +
   theme(axis.title.x = element_text(size = 20, color = "blue", face = "italic")) +
   theme(axis.text.x = element_text(size = 10, color = "firebrick3", face = "bold")) +
   scale_x_discrete(labels = function(x) str_wrap(x, width = 10))

Output




The colors are pre-defined, if you want to change the colors as per your choice the need to do a manual override.

To do that you need to now all the categories in x.
Easily done using unique function

unique(tbl$Name)
[1] "Iris-setosa" "Iris-versicolor" "Iris-virginica"


Code

ggplot(tbl, aes(x=Name, y=SepalLength, fill = Name)) +
   geom_bar(stat="identity") +
   theme(axis.title.x = element_text(size = 20, color = "blue", face = "italic")) +
   theme(axis.text.x = element_text(size = 10, color = "firebrick3", face = "bold")) +
   scale_x_discrete(labels = function(x) str_wrap(x, width = 10)) +
   theme(axis.title.y = element_text(size = 20, color = "blue", face = "italic")) +
   scale_fill_manual("legend", values = c("Iris-setosa" = "indianred3", "Iris-versicolor" =    "lightcyan2", "Iris-virginica" = "darkolivegreen2"))

Output




Change the width of the bars It is easily done using width = in Geom_bar

Code

ggplot(tbl, aes(x=Name, y=SepalLength, fill = Name)) +
   geom_bar(stat="identity", width = 0.3) +
   theme(axis.title.x = element_text(size = 20, color = "blue", face = "italic")) +
   theme(axis.text.x = element_text(size = 10, color = "firebrick3", face = "bold")) +
   scale_x_discrete(labels = function(x) str_wrap(x, width = 10)) +
   theme(axis.title.y = element_text(size = 20, color = "blue", face = "italic")) +
   scale_fill_manual("legend", values = c("Iris-setosa" = "indianred3", "Iris-versicolor" = "lightcyan2", "Iris-virginica" = "darkolivegreen2"))

Output




In the next post I will be discussing further configuration/modification/enhancement you can do with ggplot2.


Update I:
The y axis 0 is not starting at x axis. There is a gap.

Code

ggplot(tbl, aes(x=Name, y=SepalLength, fill = Name)) +
   geom_bar(stat="identity", width = 0.3) +
   theme(axis.title.x = element_text(size = 20, color = "blue", face = "italic")) +
   theme(axis.text.x = element_text(size = 10, color = "firebrick3", face = "bold")) +
   scale_x_discrete(labels = function(x) str_wrap(x, width = 10)) +
   theme(axis.title.y = element_text(size = 20, color = "blue", face = "italic")) +
   scale_y_continuous(expand = c(0, 0), limits = c(0,350), breaks = seq(0,350,50)) +
   scale_fill_manual("legend", values = c("Iris-setosa" = "indianred3", "Iris-versicolor" = "lightcyan2", "Iris-virginica" = "darkolivegreen2"))

Output






Sorting descending or ascending a column based on another column in R

Lets start with typical sorting. Sorting a table in R is fairly straight forward.

sample data (tbl)

ID SUB ID Value
A sub_3 55.28
C sub_3 55.89
A sub_3 56.61
B sub_3 56.67
A sub_3 59.01
C sub_3 63.71
C sub_4 41.06
D sub_4 42.43
C sub_4 43.07
A sub_4 43.29
D sub_4 53.56
C sub_4 53.61
C sub_4 53.93
D sub_4 55.06
D sub_4 55.62
B sub_5 54.9
B sub_5 55.92
B sub_5 55.94
D sub_5 56.78
C sub_5 56.91
D sub_5 59.65
B sub_5 60.01
C sub_6 35.08
D sub_6 37.03
D sub_6 40.97
C sub_6 41.68
C sub_6 43.49
D sub_6 45.92


Lets sort by column 'Value'.

Attach the table.
> attach(tbl)


Then sort, here i create another table tbl1
> tbl1 <- tbl[order( Value),]


You will get a ascending order 'Value'



If we want Decending order then the code will be,

> tbl1 <- tbl[order( -Value),]


Now  there is a need to sort the 'Value' Ascending when the SUB.ID is odd and descending when the SUB.ID is even.

To do this we need to new column, 'S_Value'
For SUB.IDs sub_4 and sub_4 change the 'Value' to negative

> tbl$S_Value <- ifelse((tbl$SUB.ID == 'sub_4' | tbl$SUB.ID == 'sub_6') , -tbl$Value, tbl$Value)


R Operators





Attach again
> attach(tbl)


Now sort by two columns

>  tbl3 <- tbl[order(SUB.ID, S_Value),]

We will get what we want..





































That's it.




Plotting with R: geom_violin with color based on two column values (using GGPLOT2)

In this post i am trying to explain how to color code the ggplot plots automatically when the number of data points is varying. I did this when I had an issue with lack of contrast between the default colors.

I am using the sample data as below



Load the data into R.
The idea is to have 'ID' and 'SUB ID' in X axis and 'Value'  in Y axis and to have different fill colors based on 'ID'.

1. One way to do is to use 'facet_grid' 


I am calling the the table as 'tbl'.
ggplot(tbl, aes(x=SUB.ID, y=Value )) + geom_violin(aes(fill = ID)) + facet_grid(. ~ tbl$ID)

and the result is,






2. But if you want to do something like what you do in JMP, both ID and Sub ID at bottom of the plot then there is a method (long shot).

First create a a new column by using 'paste'.

tbl$ID_SUB_ID <- paste(tbl$ID,tbl$SUB.ID, sep = "_")




Then plot using X=ID_SUB_ID

ggplot(tbl, aes(x=ID_SUB_ID, y=Value )) + geom_violin(aes(fill = ID))




Lets define the colors manually (based on ID).
This is the good option when you doing this automatically and when the number of IDs are varying every time.


The idea is to have a dark and light colors alternatively.
For example. Brown, beige, darkolivegreen, khaki1, midnightblue,magenta, seagreen4, papayawhip


Important to note  that "Aesthetics must be either length 1 or the same as the data". In other words colors need to be defined to all the rows.


Lets create a data.frame with color names.

com <- data.frame(c('brown', 'beige', 'darkolivegreen', 'khaki1', 'midnightblue','magenta', 'seagreen4', 'papayawhip'))
colnames(com)[1] <- 'color'



Lets create an serial number (index) to the table.
com$index <- seq.int(nrow(com))




As the color is based on ID the colors need to be matched against unique ids.

uID <- data.frame(unique(tbl$ID, incomparables = FALSE))
colnames(uID)[1] <- 'ID'

Lets create an serial number (index) to the table.
uID$index <- seq.int(nrow(uID))



Merge the tables to have colors matched to IDs.

ID_color <- merge(uID,com, by = 'index', all.x = TRUE)



Now merge the data table and the color table by ID.



Now plot again..

ggplot(tblc, aes(x=ID_SUB_ID, y=Value)) + geom_violin(aes(fill = color))



Lets remove the legend as it is not what we want.

ggplot(tblc, aes(x=ID_SUB_ID, y=Value)) + geom_violin(aes(fill = color)) + guides(fill=FALSE)



That's it.
I will discuss how to beautify the plot in another post.