10 datasets found

Convert Text to Pandas
kaggle.com
zip
Updated Sep 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zeyad Usf (2024). Convert Text to Pandas [Dataset]. https://www.kaggle.com/datasets/zeyadusf/convert-text-to-pandas
Explore at:
zip(4333134 bytes)Available download formats
Dataset updated
Sep 22, 2024
Authors
Zeyad Usf
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
kaggle notebook
Github Repo

I found two datasets about converting text with context to pandas code on Hugging Face, but the challenge is in the context. The context in both datasets is different which reduces the results of the model. First let's mention the data I found and then show examples, solution and some other problems.

Rahima411/text-to-pandas:

The data is divided into Train with 57.5k and Test with 19.2k.

The data has two columns as you can see in the example:

"Input": Contains the context and the question together, in the context it shows the metadata about the data frame.

"Pandas Query": Pandas code txt Input | Pandas Query -----------------------------------------------------------|------------------------------------------- Table Name: head (age (object), head_id (object)) | result = management['head.age'].unique() Table Name: management (head_id (object), | temporary_acting (object)) | What are the distinct ages of the heads who are acting? |

hiltch/pandas-create-context:

It contains 17k rows with three columns:

question : text .

context : Code to create a data frame with column names, unlike the first data set which contains the name of the data frame, column names and data type.

answer : Pandas code.

question | context | answer ----------------------------------------|--------------------------------------------------------|--------------------------------------- What was the lowest # of total votes? | df = pd.DataFrame(columns=['_number_of_total_votes']) | df['_number_of_total_votes'].min()

As you can see, the problem with this data is that they are not similar as inputs and the structure of the context is different . My solution to this problem was: - Convert the first data set to become like the second in the context. I chose this because it is difficult to get the data type for the columns in the second data set. It was easy to convert the structure of the context from this shape Table Name: head (age (object), head_id (object)) to this head = pd.DataFrame(columns=['age','head_id']) through this code that I wrote. - Then separate the question from the context. This was easy because if you look at the data, you will find that the context always ends with "(" and then a blank and then the question. You will find all of this in this code. - You will also notice that more than one code or line can be returned to the context, and this has been engineered into the code. ```py def extract_table_creation(text:str)->(str,str): """ Extracts DataFrame creation statements and questions from the given text.

Args: text (str): The input text containing table definitions and questions. Returns: tuple: A tuple containing a concatenated DataFrame creation string and a question. """ # Define patterns table_pattern = r'Table Name: (\w+) $([\w\s,()]+)$' column_pattern = r'(\w+)\s*$(object|int64|float64)$' # Find all table names and column definitions matches = re.findall(table_pattern, text) # Initialize a list to hold DataFrame creation statements df_creations = [] for table_name, columns_str in matches: # Extract column names columns = re.findall(column_pattern, columns_str) column_names = [col[0] for col in columns] # Format DataFrame creation statement df_creation = f"{table_name} = pd.DataFrame(columns={column_names})" df_creations.append(df_creation) # Concatenate all DataFrame creation statements df_creation_concat = '

'.join(df_creations)

# Extract and clean the question question = text[text.rindex(')')+1:].strip() return df_creation_concat, question

After both datasets were similar in structure, they were merged into one set and divided into _72.8K_ train and _18.6K_ test. We analyzed this dataset and you can see it all through the **[`notebook`](https://www.kaggle.com/code/zeyadusf/text-2-pandas-t5#Exploratory-Data-Analysis(EDA))**, but we found some problems in the dataset as well, such as > - `Answer` : `df['Id'].count()` has been repeated, but this is possible, so we do not need to dispense with these rows. > - `Context` : We see that it contains `147` rows that do not contain any text. We will see Through the experiment if this will affect the results negatively or positively. > - `Question` : It is ...

Google Data Analytics Case Study Cyclistic

kaggle.com

zip

Updated Sep 27, 2022

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Udayakumar19 (2022). Google Data Analytics Case Study Cyclistic [Dataset]. https://www.kaggle.com/datasets/udayakumar19/google-data-analytics-case-study-cyclistic/suggestions

Explore at:

zip(1299 bytes)Available download formats

Dataset updated

Sep 27, 2022

Authors

Udayakumar19

Description

Introduction

Welcome to the Cyclistic bike-share analysis case study! In this case study, you will perform many real-world tasks of a junior data analyst. You will work for a fictional company, Cyclistic, and meet different characters and team members. In order to answer the key business questions, you will follow the steps of the data analysis process: ask, prepare, process, analyze, share, and act. Along the way, the Case Study Roadmap tables — including guiding questions and key tasks — will help you stay on the right path.

Scenario

You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

Ask

How do annual members and casual riders use Cyclistic bikes differently?

Guiding Question:

What is the problem you are trying to solve?
  How do annual members and casual riders use Cyclistic bikes differently?
How can your insights drive business decisions?
  The insight will help the marketing team to make a strategy for casual riders

Prepare

Guiding Question:

Where is your data located?
  Data located in Cyclistic organization data.

How is data organized?
  Dataset are in csv format for each month wise from Financial year 22.

Are there issues with bias or credibility in this data? Does your data ROCCC? 
  It is good it is ROCCC because data collected in from Cyclistic organization.

How are you addressing licensing, privacy, security, and accessibility?
  The company has their own license over the dataset. Dataset does not have any personal information about the riders.

How did you verify the data’s integrity?
  All the files have consistent columns and each column has the correct type of data.

How does it help you answer your questions?
  Insights always hidden in the data. We have the interpret with data to find the insights.

Are there any problems with the data?
  Yes, starting station names, ending station names have null values.

Process

Guiding Question:

What tools are you choosing and why?
  I used R studio for the cleaning and transforming the data for analysis phase because of large dataset and to gather experience in the language.

Have you ensured the data’s integrity?
 Yes, the data is consistent throughout the columns.

What steps have you taken to ensure that your data is clean?
  First duplicates, null values are removed then added new columns for analysis.

How can you verify that your data is clean and ready to analyze? 
 Make sure the column names are consistent thorough out all data sets by using the “bind row” function.

Make sure column data types are consistent throughout all the dataset by using the “compare_df_col” from the “janitor” package.
Combine the all dataset into single data frame to make consistent throught the analysis.
Removed the column start_lat, start_lng, end_lat, end_lng from the dataframe because those columns not required for analysis.
Create new columns day, date, month, year, from the started_at column this will provide additional opportunities to aggregate the data
Create the “ride_length” column from the started_at and ended_at column to find the average duration of the ride by the riders.
Removed the null rows from the dataset by using the “na.omit function”
Have you documented your cleaning process so you can review and share those results? 
  Yes, the cleaning process is documented clearly.

Analyze Phase:

Guiding Questions:

How should you organize your data to perform analysis on it? The data has been organized in one single dataframe by using the read csv function in R Has your data been properly formatted? Yes, all the columns have their correct data type.

What surprises did you discover in the data?
  Casual member ride duration is higher than the annual members
  Causal member widely uses docked bike than the annual members
What trends or relationships did you find in the data?
  Annual members are used mainly for commute purpose
  Casual member are preferred the docked bikes
  Annual members are preferred the electric or classic bikes
How will these insights help answer your business questions?
  This insights helps to build a profile for members

Guiding Quesions:

Were you able to answer the question of how ...

n
Data from: Generalizable EHR-R-REDCap pipeline for a national...
data.niaid.nih.gov
datadryad.org
zip
Updated Jan 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller (2022). Generalizable EHR-R-REDCap pipeline for a national multi-institutional rare tumor patient registry [Dataset]. http://doi.org/10.5061/dryad.rjdfn2zcm
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.rjdfn2zcm
Dataset updated
Jan 9, 2022
Dataset provided by
Harvard Medical School
Massachusetts General Hospital
Authors
Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.

Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.

Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.

Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR data is successfully transformed, and bulk-loaded/imported into a REDCap-based national registry to execute real-world data analysis and interoperability.

Methods eLAB Development and Source Code (R statistical software):

eLAB is written in R (version 4.0.3), and utilizes the following packages for processing: DescTools, REDCapR, reshape2, splitstackshape, readxl, survival, survminer, and tidyverse. Source code for eLAB can be downloaded directly (https://github.com/TheMillerLab/eLAB).

eLAB reformats EHR data abstracted for an identified population of patients (e.g. medical record numbers (MRN)/name list) under an Institutional Review Board (IRB)-approved protocol. The MCCPR does not host MRNs/names and eLAB converts these to MCCPR assigned record identification numbers (record_id) before import for de-identification.

Functions were written to remap EHR bulk lab data pulls/queries from several sources including Clarity/Crystal reports or institutional EDW including Research Patient Data Registry (RPDR) at MGB. The input, a csv/delimited file of labs for user-defined patients, may vary. Thus, users may need to adapt the initial data wrangling script based on the data input format. However, the downstream transformation, code-lab lookup tables, outcomes analysis, and LOINC remapping are standard for use with the provided REDCap Data Dictionary, DataDictionary_eLAB.csv. The available R-markdown ((https://github.com/TheMillerLab/eLAB) provides suggestions and instructions on where or when upfront script modifications may be necessary to accommodate input variability.

The eLAB pipeline takes several inputs. For example, the input for use with the ‘ehr_format(dt)’ single-line command is non-tabular data assigned as R object ‘dt’ with 4 columns: 1) Patient Name (MRN), 2) Collection Date, 3) Collection Time, and 4) Lab Results wherein several lab panels are in one data frame cell. A mock dataset in this ‘untidy-format’ is provided for demonstration purposes (https://github.com/TheMillerLab/eLAB).

Bulk lab data pulls often result in subtypes of the same lab. For example, potassium labs are reported as “Potassium,” “Potassium-External,” “Potassium(POC),” “Potassium,whole-bld,” “Potassium-Level-External,” “Potassium,venous,” and “Potassium-whole-bld/plasma.” eLAB utilizes a key-value lookup table with ~300 lab subtypes for remapping labs to the Data Dictionary (DD) code. eLAB reformats/accepts only those lab units pre-defined by the registry DD. The lab lookup table is provided for direct use or may be re-configured/updated to meet end-user specifications. eLAB is designed to remap, transform, and filter/adjust value units of semi-structured/structured bulk laboratory values data pulls from the EHR to align with the pre-defined code of the DD.

Data Dictionary (DD)

EHR clinical laboratory data is captured in REDCap using the ‘Labs’ repeating instrument (Supplemental Figures 1-2). The DD is provided for use by researchers at REDCap-participating institutions and is optimized to accommodate the same lab-type captured more than once on the same day for the same patient. The instrument captures 35 clinical lab types. The DD serves several major purposes in the eLAB pipeline. First, it defines every lab type of interest and associated lab unit of interest with a set field/variable name. It also restricts/defines the type of data allowed for entry for each data field, such as a string or numerics. The DD is uploaded into REDCap by every participating site/collaborator and ensures each site collects and codes the data the same way. Automation pipelines, such as eLAB, are designed to remap/clean and reformat data/units utilizing key-value look-up tables that filter and select only the labs/units of interest. eLAB ensures the data pulled from the EHR contains the correct unit and format pre-configured by the DD. The use of the same DD at every participating site ensures that the data field code, format, and relationships in the database are uniform across each site to allow for the simple aggregation of the multi-site data. For example, since every site in the MCCPR uses the same DD, aggregation is efficient and different site csv files are simply combined.

Study Cohort

This study was approved by the MGB IRB. Search of the EHR was performed to identify patients diagnosed with MCC between 1975-2021 (N=1,109) for inclusion in the MCCPR. Subjects diagnosed with primary cutaneous MCC between 2016-2019 (N= 176) were included in the test cohort for exploratory studies of lab result associations with overall survival (OS) using eLAB.

Statistical Analysis

OS is defined as the time from date of MCC diagnosis to date of death. Data was censored at the date of the last follow-up visit if no death event occurred. Univariable Cox proportional hazard modeling was performed among all lab predictors. Due to the hypothesis-generating nature of the work, p-values were exploratory and Bonferroni corrections were not applied.
d
Data from: An assessment of wheat yield sensitivity and breeding gains in...
datadryad.org
data.niaid.nih.gov
zip
Updated Mar 5, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sharon M. Gourdji; Ky L. Mathews; Matthew Reynolds; Jose Crossa; David B. Lobell (2013). An assessment of wheat yield sensitivity and breeding gains in hot environments [Dataset]. http://doi.org/10.5061/dryad.525vm
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.525vm
Dataset updated
Mar 5, 2013
Dataset provided by
Dryad
Authors
Sharon M. Gourdji; Ky L. Mathews; Matthew Reynolds; Jose Crossa; David B. Lobell
Time period covered
Nov 8, 2012
Description
regression.dat_nurseries_ADW2.RdatThis R data frame contains 1353 rows corresponding to the international trials in the CIMMYT database used in this study. The column names should be self-descriptive, and contain all the predictors used for this regression analysis.
Case study: Cyclistic bike-share analysis
kaggle.com
zip
Updated Mar 25, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jorge4141 (2022). Case study: Cyclistic bike-share analysis [Dataset]. https://www.kaggle.com/datasets/jorge4141/case-study-cyclistic-bikeshare-analysis
Explore at:
zip(131490806 bytes)Available download formats
Dataset updated
Mar 25, 2022
Authors
Jorge4141
Description
Introduction

This is a case study called Capstone Project from the Google Data Analytics Certificate.

In this case study, I am working as a junior data analyst at a fictitious bike-share company in Chicago called Cyclistic.

Cyclistic is a bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike.

Scenario

The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, our team will design a new marketing strategy to convert casual riders into annual members.

****Primary Stakeholders:****

1: Cyclistic Executive Team

2: Lily Moreno, Director of Marketing and Manager

ASK

How do annual members and casual riders use Cyclistic bikes differently?

Why would casual riders buy Cyclistic annual memberships?

How can Cyclistic use digital media to influence casual riders to become members?

# Prepare

The last four quarters were selected for analysis which cover April 01, 2019 - March 31, 2020. These are the datasets used:

Divvy_Trips_2019_Q2 Divvy_Trips_2019_Q3 Divvy_Trips_2019_Q4 Divvy_Trips_2020_Q1

The data is stored in CSV files. Each file contains one month data for a total of 12 .csv files.

Data appears to be reliable with no bias. It also appears to be original, current and cited.

I used Cyclistic’s historical trip data found here: https://divvy-tripdata.s3.amazonaws.com/index.html

The data has been made available by Motivate International Inc. under this license: https://ride.divvybikes.com/data-license-agreement

Limitations

Financial information is not available.

Process

Used R to analyze and clean data

After installing the R packages, data was collected, wrangled and combined into a single file.

Columns were renamed.

Looked for incongruencies in the dataframes and converted some columns to character type, so they can stack correctly.

Combined all quarters into one big data frame.

Removed unnecessary columns

Analyze

Inspected new data table to ensure column names were correctly assigned.

Formatted columns to ensure proper data types were assigned (numeric, character, etc).

Consolidated the member_casual column.

Added day, month and year columns to aggregate data.

Added ride-length column to the entire dataframe for consistency.

Deleted trip duration rides that showed as negative and bikes out of circulation for quality control.

Replaced the word "member" with "Subscriber" and also replaced the word "casual" with "Customer".

Aggregated data, compared average rides between members and casual users.

Share

After analysis, visuals were created as shown below with R.

Act

Conclusion:

Data appears to show that casual riders and members use bike share differently.

Casual riders' average ride length is more than twice of that of members.

Members use bike share for commuting, casual riders use it for leisure and mostly on the weekends.

Unfortunately, there's no financial data available to determine which of the two (casual or member) is spending more money.

Recommendations

Offer casual riders a membership package with promotions and discounts.
d
Data from: How scientists perceive the evolutionary origin of human traits:...
datadryad.org
data.niaid.nih.gov
zip
Updated Jan 11, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hanna Tuomisto; Matleena Tuomisto; Jouni T. Tuomisto (2019). How scientists perceive the evolutionary origin of human traits: results of a survey study [Dataset]. http://doi.org/10.5061/dryad.s9r98
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.s9r98
Dataset updated
Jan 11, 2019
Dataset provided by
Dryad
Authors
Hanna Tuomisto; Matleena Tuomisto; Jouni T. Tuomisto
Time period covered
Jan 10, 2018
Description
survey1Full data of the survey, except column Country is left empty to prevent identification of individuals. Columns 124-137 are not from the questionnaire but calculated based on other columns (see file survey1_questions).survey1_questionsTitle names and explanations for columns in the surveySurvey data as R objects on Opasnet. Analysis code.Data of the study survey. Data is available as R objects (data.frames) from Opasnet repository by using the code on the linked page. The page also contains the codes that were used to perform the statistical analyses and draw figures of the article.
Social Contacts
kaggle.com
zip
Updated Apr 29, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patrick (2020). Social Contacts [Dataset]. https://www.kaggle.com/bitsnpieces/social-contacts
Explore at:
zip(33056 bytes)Available download formats
Dataset updated
Apr 29, 2020
Authors
Patrick
Description
Inspiration

Which countries have the most social contacts in the world? In particular, do countries with more social contacts among the elderly report more deaths caused by a pandemic caused by a respiratory virus?

Context

With the emergence of the COVID-19 pandemic, reports have shown that the elderly are at a higher risk of dying than any other age groups. 8 out of 10 deaths reported in the U.S. have been in adults 65 years old and older. Countries have also began to enforce 2km social distancing to contain the pandemic.

To this end, I wanted to explore the relationship between social contacts among the elderly and its relationship with the number of COVID-19 deaths across countries.

Content

This dataset includes a subset of the projected social contact matrices in 152 countries from surveys Prem et al. 2020. It was based on the POLYMOD study where information on social contacts was obtained using cross-sectional surveys in Belgium (BE), Germany (DE), Finland (FI), Great Britain (GB), Italy (IT), Luxembourg (LU), The Netherlands (NL), and Poland (PL) between May 2005 and September 2006.

This dataset includes contact rates from study participants ages 65+ for all countries from all sources of contact (work, home, school and others).

I used this R code to extract this data:

load('../input/contacts.Rdata') # https://github.com/kieshaprem/covid19-agestructureSEIR-wuhan-social-distancing/blob/master/data/contacts.Rdata View(contacts) contacts[["ALB"]][["home"]] contacts[["ITA"]][["all"]] rowSums(contacts[["ALB"]][["all"]]) out1 = data.frame(); for (n in names(contacts)) { x = (contacts[[n]][["all"]])[16,]; out <- rbind(out, data.frame(x)) } out2 = data.frame(); for (n in names(contacts)) { x = (contacts[[n]][["all"]])[15,]; out <- rbind(out, data.frame(x)) } out3 = data.frame(); for (n in names(contacts)) { x = (contacts[[n]][["all"]])[14,]; out <- rbind(out, data.frame(x)) } m1 = data.frame(t(matrix(unlist(out1), nrow=16))) m2 = data.frame(t(matrix(unlist(out2), nrow=16))) m3 = data.frame(t(matrix(unlist(out3), nrow=16))) rownames(m1) = names(contacts) colnames(m1) = c("00_04", "05_09", "10_14", "15_19", "20_24", "25_29", "30_34", "35_39", "40_44", "45_49", "50_54", "55_59", "60_64", "65_69", "70_74", "75_79") rownames(m2) = rownames(m1) rownames(m3) = rownames(m1) colnames(m2) = colnames(m1) colnames(m3) = colnames(m1) write.csv(zapsmall(m1),"contacts_75_79.csv", row.names = TRUE) write.csv(zapsmall(m2),"contacts_70_74.csv", row.names = TRUE) write.csv(zapsmall(m3),"contacts_65_69.csv", row.names = TRUE)

Rows names correspond to the 3 letter country ISO code, e.g. ITA represents Italy. Column names are the age groups of the individuals contacted in 5 year intervals from 0 to 80 years old. Cell values are the projected mean social contact rate.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F1139998%2Ffa3ddc065ea46009e345f24ab0d905d2%2Fcontact_distribution.png?generation=1588258740223812&alt=media" alt="">

Acknowledgements

Thanks goes to Dr. Kiesha Prem for her correspondence and her team for publishing their work on social contact matrices.

References

The effect of control strategies to reduce social mixing on outcomes of the COVID-19 epidemic in Wuhan, China: a modelling study

Projecting social contact matrices in 152 countries using contact surveys and demographic data

Social Contacts and Mixing Patterns Relevant to the Spread of Infectious Diseases (POLYMOD study)

Related resources

My starter notebook

http://www.socialcontactdata.org/

https://www.kaggle.com/tsubasatwi/close-contact-status-of-corona-in-japan

Facebook Data for Good Mobility Dashboard
Divvy Tripdata 2022_04 To 2023_03
kaggle.com
zip
Updated May 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Erik Entenmann (2023). Divvy Tripdata 2022_04 To 2023_03 [Dataset]. https://www.kaggle.com/erikentenmann/divvy-tripdata-2022-04-2023-03
Explore at:
zip(211278092 bytes)Available download formats
Dataset updated
May 11, 2023
Authors
Erik Entenmann
Description
Licensing for the data: https://ride.divvybikes.com/data-license-agreement

This dataset is historical data from a 12-month period, beginning April 2022 and then going into the end of March 2023. This was used as a case study for the Google Data Analytics Certificate, and I plan on revisiting this dataset at some point in the future for practicing with R.

Some notes about this dataset - all column names are uniform throughout all of the CSV files so they should be easy to clean and merge together into an aggregate data frame.
FacialRecognition
kaggle.com
zip
Updated Dec 1, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TheNicelander (2016). FacialRecognition [Dataset]. https://www.kaggle.com/petein/facialrecognition
Explore at:
zip(121674455 bytes)Available download formats
Dataset updated
Dec 1, 2016
Authors
TheNicelander
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description

#https://www.kaggle.com/c/facial-keypoints-detection/details/getting-started-with-r #################################

###Variables for downloaded files data.dir <- ' ' train.file <- paste0(data.dir, 'training.csv') test.file <- paste0(data.dir, 'test.csv') #################################

###Load csv -- creates a data.frame matrix where each column can have a different type. d.train <- read.csv(train.file, stringsAsFactors = F) d.test <- read.csv(test.file, stringsAsFactors = F)

###In training.csv, we have 7049 rows, each one with 31 columns. ###The first 30 columns are keypoint locations, which R correctly identified as numbers. ###The last one is a string representation of the image, identified as a string.

###To look at samples of the data, uncomment this line:

head(d.train)

###Let's save the first column as another variable, and remove it from d.train: ###d.train is our dataframe, and we want the column called Image. ###Assigning NULL to a column removes it from the dataframe

im.train <- d.train$Image d.train$Image <- NULL #removes 'image' from the dataframe

im.test <- d.test$Image d.test$Image <- NULL #removes 'image' from the dataframe

################################# #The image is represented as a series of numbers, stored as a string #Convert these strings to integers by splitting them and converting the result to integer

#strsplit splits the string #unlist simplifies its output to a vector of strings #as.integer converts it to a vector of integers. as.integer(unlist(strsplit(im.train[1], " "))) as.integer(unlist(strsplit(im.test[1], " ")))

###Install and activate appropriate libraries ###The tutorial is meant for Linux and OSx, where they use a different library, so: ###Replace all instances of %dopar% with %do%.

install.packages('foreach')

library("foreach", lib.loc="~/R/win-library/3.3")

###implement parallelization im.train <- foreach(im = im.train, .combine=rbind) %do% { as.integer(unlist(strsplit(im, " "))) } im.test <- foreach(im = im.test, .combine=rbind) %do% { as.integer(unlist(strsplit(im, " "))) } #The foreach loop will evaluate the inner command for each row in im.train, and combine the results with rbind (combine by rows). #%do% instructs R to do all evaluations in parallel. #im.train is now a matrix with 7049 rows (one for each image) and 9216 columns (one for each pixel):

###Save all four variables in data.Rd file ###Can reload them at anytime with load('data.Rd')

save(d.train, im.train, d.test, im.test, file='data.Rd')

load('data.Rd')

#each image is a vector of 96*96 pixels (96*96 = 9216). #convert these 9216 integers into a 96x96 matrix: im <- matrix(data=rev(im.train[1,]), nrow=96, ncol=96)

#im.train[1,] returns the first row of im.train, which corresponds to the first training image. #rev reverse the resulting vector to match the interpretation of R's image function #(which expects the origin to be in the lower left corner).

#To visualize the image we use R's image function: image(1:96, 1:96, im, col=gray((0:255)/255))

#Let’s color the coordinates for the eyes and nose points(96-d.train$nose_tip_x[1], 96-d.train$nose_tip_y[1], col="red") points(96-d.train$left_eye_center_x[1], 96-d.train$left_eye_center_y[1], col="blue") points(96-d.train$right_eye_center_x[1], 96-d.train$right_eye_center_y[1], col="green")

#Another good check is to see how variable is our data. #For example, where are the centers of each nose in the 7049 images? (this takes a while to run): for(i in 1:nrow(d.train)) { points(96-d.train$nose_tip_x[i], 96-d.train$nose_tip_y[i], col="red") }

#there are quite a few outliers -- they could be labeling errors. Looking at one extreme example we get this: #In this case there's no labeling error, but this shows that not all faces are centralized idx <- which.max(d.train$nose_tip_x) im <- matrix(data=rev(im.train[idx,]), nrow=96, ncol=96) image(1:96, 1:96, im, col=gray((0:255)/255)) points(96-d.train$nose_tip_x[idx], 96-d.train$nose_tip_y[idx], col="red")

#One of the simplest things to try is to compute the mean of the coordinates of each keypoint in the training set and use that as a prediction for all images colMeans(d.train, na.rm=T)

#To build a submission file we need to apply these computed coordinates to the test instances: p <- matrix(data=colMeans(d.train, na.rm=T), nrow=nrow(d.test), ncol=ncol(d.train), byrow=T) colnames(p) <- names(d.train) predictions <- data.frame(ImageId = 1:nrow(d.test), p) head(predictions)

#The expected submission format has one one keypoint per row, but we can easily get that with the help of the reshape2 library:

install.packages('reshape2')

library(...
Time Series Forecasting Using Prophet in R
kaggle.com
zip
Updated Jul 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vikram amin (2023). Time Series Forecasting Using Prophet in R [Dataset]. https://www.kaggle.com/datasets/vikramamin/time-series-forecasting-using-prophet-in-r
Explore at:
zip(9000 bytes)Available download formats
Dataset updated
Jul 25, 2023
Authors
vikram amin
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Main objective : To forecast the page visits of a website

Tool : Time Series Forecasting using Prophet in R.

Steps:

Read the data

Data Cleaning: Checking data types, date formats and missing data https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F56d7b1edf4f51157804e81b02c032e4d%2FPicture1.png?generation=1690271521103777&alt=media" alt="">

Run libraries (dplyr, ggplot2, tidyverse, lubridate, prophet, forecast)

Change the Date column from character vector to date and change data format using lubridate package

Rename the column "Date" to "ds" and "Visits" to "y".

Treat "Christmas" and "Black.Friday" as holiday events. As the data ranges from 2016 to 2020, there will be 5 Christmas and 5 Black Friday days.

We will look at the impact of Christmas 3 days prior and 3 days later from Christmas date on "Visits" and 3 days prior and 1 day later for Black Friday

We create two data frames called Christmas and Black.Friday and merge the two into a data frame called "holidays". https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fd07b366be2050fefe6a62563b6abac0c%2FPicture2.png?generation=1690272066356516&alt=media" alt="">

We create train and test data. In train data & test data, we select only 3 variables namely ds, y , Easter. In train data, ds contains data before 2020-12-01 and test data contains data equal to and after 2020-12-01 (31 days) data

Train Data

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F8f3f58fe40b29b276bb7103cb1dfdde1%2FPicture3.png?generation=1690272272038405&alt=media" alt="">

Test Data

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fb4362117f46aeb210dad23f07d3ecb39%2FPicture4.png?generation=1690272400355824&alt=media" alt="">

Use prophet model which will include multiple parameter. We are going with the default parameters. Thereafter, we add the external regressor "Easter".

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F7325be63d887372cc5764ddf29a94310%2FPicture5.png?generation=1690272892963939&alt=media" alt="">

We create the future data frame for forecasting and name the data frame "future". It will include "m" and 31 days of the test data. We then predict this future data frame and create a new data frame called "forecast".

Forecast data frame consists of 1827 rows and 34 variables. This shows the external Regressor (Easter) value is 0 through the entire time period. This shows that "Easter" has no impact or effect on "Visits".

yhat stands for the predicted value (predicted visits).

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fae5c9414d1b1bbb2670b372a326970a5%2FPicture6.png?generation=1690273558489681&alt=media" alt="">

We try to understand the impact of Holiday events "Christmas" and "Black.Friday"

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F5a36cc5308f9e46f0b63fa8e37c4b932%2FPicture7.png?generation=1690273814760538&alt=media" alt="">

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F8cc3dd0581db1e8b9d542d9a524abd39%2FPicture8.png?generation=1690273879506571&alt=media" alt="">

We plot the forecast.

plot(m,forecast) https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fa7968ff05abdd5b4e789f3723b41c4ed%2FPicture9.png?generation=1690274020880594&alt=media" alt="">

blue is predicted value(yhat) and black is actual value(y) and blue shaded regions are the yhat_upper and yhat_lower values

prophet_plot_components(m,forecast) https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F52408afb8c71118ef6729420085875e8%2FPicture10.png?generation=1690274184325240&alt=media" alt="">

Trend indicates that the page visits remained constant from Jan'16 to Mid'17 and thereafter there was an upswing from Mid'19 to End of 2020

From Holidays, we can make out that Christmas had a negative effect on page visits whereas Black Friday had a positive effect on page visits

Weekly seasonality indicates that page visits tend to remain the highest from Monday to Thursday and starts going down thereafter

Yearly seasonality indicates that page visits are the highest in Apr and then starts going down thereafter with

Oct having reaching the bottom point

External regressor "Easter" has no impact on page visits

plot(m,forecast) + add_changepoints_to_plot(m)

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F1253a0e381ae04d3156a4b098dafb2ca%2FPicture11.png?generation=1690274373570449&alt=media" alt="">

Trend which is indicated by the red line starts moving upwards from Mid 2019 to 2020 onwards

We check for acc...
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Zeyad Usf (2024). Convert Text to Pandas [Dataset]. https://www.kaggle.com/datasets/zeyadusf/convert-text-to-pandas

Convert Text to Pandas

convert Text 2 Pandas

Explore at:

zip(4333134 bytes)Available download formats

Dataset updated

Sep 22, 2024

Authors

Zeyad Usf

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

kaggle notebook
Github Repo

I found two datasets about converting text with context to pandas code on Hugging Face, but the challenge is in the context. The context in both datasets is different which reduces the results of the model. First let's mention the data I found and then show examples, solution and some other problems.

Rahima411/text-to-pandas:
- The data is divided into Train with 57.5k and Test with 19.2k.
- The data has two columns as you can see in the example:
  - "Input": Contains the context and the question together, in the context it shows the metadata about the data frame.
  - "Pandas Query": Pandas code txt Input | Pandas Query -----------------------------------------------------------|------------------------------------------- Table Name: head (age (object), head_id (object)) | result = management['head.age'].unique() Table Name: management (head_id (object), | temporary_acting (object)) | What are the distinct ages of the heads who are acting? |
hiltch/pandas-create-context:
- It contains 17k rows with three columns:
  - question : text .
  - context : Code to create a data frame with column names, unlike the first data set which contains the name of the data frame, column names and data type.
  - answer : Pandas code.

      question           |            context             |       answer 
----------------------------------------|--------------------------------------------------------|---------------------------------------
What was the lowest # of total votes?  | df = pd.DataFrame(columns=['_number_of_total_votes']) | df['_number_of_total_votes'].min()

As you can see, the problem with this data is that they are not similar as inputs and the structure of the context is different . My solution to this problem was: - Convert the first data set to become like the second in the context. I chose this because it is difficult to get the data type for the columns in the second data set. It was easy to convert the structure of the context from this shape Table Name: head (age (object), head_id (object)) to this head = pd.DataFrame(columns=['age','head_id']) through this code that I wrote. - Then separate the question from the context. This was easy because if you look at the data, you will find that the context always ends with "(" and then a blank and then the question. You will find all of this in this code. - You will also notice that more than one code or line can be returned to the context, and this has been engineered into the code. ```py def extract_table_creation(text:str)->(str,str): """ Extracts DataFrame creation statements and questions from the given text.

Args:
  text (str): The input text containing table definitions and questions.

Returns:
  tuple: A tuple containing a concatenated DataFrame creation string and a question.
"""
# Define patterns
table_pattern = r'Table Name: (\w+) \(([\w\s,()]+)\)'
column_pattern = r'(\w+)\s*\((object|int64|float64)\)'

# Find all table names and column definitions
matches = re.findall(table_pattern, text)

# Initialize a list to hold DataFrame creation statements
df_creations = []

for table_name, columns_str in matches:
  # Extract column names
  columns = re.findall(column_pattern, columns_str)
  column_names = [col[0] for col in columns]

  # Format DataFrame creation statement
  df_creation = f"{table_name} = pd.DataFrame(columns={column_names})"
  df_creations.append(df_creation)

# Concatenate all DataFrame creation statements
df_creation_concat = '

'.join(df_creations)

# Extract and clean the question
question = text[text.rindex(')')+1:].strip()

return df_creation_concat, question

After both datasets were similar in structure, they were merged into one set and divided into _72.8K_ train and _18.6K_ test. We analyzed this dataset and you can see it all through the **[`notebook`](https://www.kaggle.com/code/zeyadusf/text-2-pandas-t5#Exploratory-Data-Analysis(EDA))**, but we found some problems in the dataset as well, such as
> - `Answer` : `df['Id'].count()` has been repeated, but this is possible, so we do not need to dispense with these rows.
> - `Context` : We see that it contains `147` rows that do not contain any text. We will see Through the experiment if this will affect the results negatively or positively.
> - `Question` : It is ...

Clear search

Close search

Google apps

Main menu

Convert Text to Pandas

Google Data Analytics Case Study Cyclistic

Introduction

Scenario

Ask

Guiding Question:

Prepare

Guiding Question:

Process

Guiding Question:

Analyze Phase:

Guiding Questions:

Share

Guiding Quesions:

Data from: Generalizable EHR-R-REDCap pipeline for a national...

Data from: An assessment of wheat yield sensitivity and breeding gains in...

Case study: Cyclistic bike-share analysis

Introduction

Scenario

****Primary Stakeholders:****

ASK

Limitations

Process

Analyze

Share

Act

Recommendations

Data from: How scientists perceive the evolutionary origin of human traits:...

Social Contacts

Inspiration

Context

Content

Acknowledgements

References

Related resources

Divvy Tripdata 2022_04 To 2023_03

FacialRecognition

head(d.train)

install.packages('foreach')

save(d.train, im.train, d.test, im.test, file='data.Rd')

load('data.Rd')

install.packages('reshape2')

Time Series Forecasting Using Prophet in R

Convert Text to Pandas

convert Text 2 Pandas

Primary Stakeholders: