import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from numpy import NaN
from glob import glob
import re
pd.set_option('max_columns', 200)
pd.set_option('max_rows', 300)
pd.set_option('display.expand_frame_repr', True)
Course Description
Pandas DataFrames are the most widely used in-memory representation of complex data collections within Python. Whether in finance, scientific fields, or data science, a familiarity with Pandas is essential. This course teaches you to work with real-world data sets containing both string and numeric data, often structured around time series. You will learn powerful analysis, selection, and visualization techniques in this course.
In this chapter, you will be introduced to Panda's DataFrames. You will use Pandas to import and inspect a variety of datasets, ranging from population data obtained from The World Bank to monthly stock data obtained via Yahoo! Finance. You will also practice building DataFrames from scratch, and become familiar with Pandas' intrinsic data visualization capabilities.
AAPL = pd.read_csv(r'DataCamp-master/11-pandas-foundations/_datasets/AAPL.csv',
index_col='Date', parse_dates=True)
AAPL.head()
type(AAPL)
AAPL.shape
AAPL.columns
type(AAPL.columns)
AAPL.index
type(AAPL.index)
# Start of the DataFrame to the 5th row, inclusive of all columns
AAPL.iloc[:5,:]
# Start at the 5th last row to the end of the DataFrame using a negative index
AAPL.iloc[-5:,:]
AAPL.head()
AAPL.tail()
AAPL.info()
AAPL.Close.plot(kind='line')
# Add first subplot
plt.subplot(2, 1, 1)
AAPL.Close.plot(kind='line')
# Add title and specify axis labels
plt.title('Close')
plt.ylabel('Value - $')
plt.xlabel('Year')
# Add second subplot
plt.subplot(2, 1, 2)
AAPL.Volume.plot(kind='line')
# Add title and specify axis labels
plt.title('Volume')
plt.ylabel('Number of Shares')
plt.xlabel('Year')
# Display the plots
plt.tight_layout()
plt.show()
AAPL.iloc[::3, -1] = np.nan # every 3rd row of Volume is now NaN
AAPL.head(7)
AAPL.info()
low = AAPL.Low
type(low)
low.head()
lows = low.values
type(lows)
lows[0:5]
You can use the DataFrame methods .head()
and .tail()
to view the first few and last few rows of a DataFrame. In this exercise, we have imported pandas as pd
and loaded population data from 1960 to 2014 as a DataFrame df
. This dataset was obtained from the World Bank.
Your job is to use df.head()
and df.tail()
to verify that the first and last rows match a file on disk. In later exercises, you will see how to extract values from DataFrames with indexing, but for now, manually copy/paste or type values into assignment statements where needed. Select the correct answer for the first and last values in the 'Year'
and 'Total Population'
columns.
Instructions
Possible Answers
wb_df = pd.read_csv(r'DataCamp-master/11-pandas-foundations/_datasets/world_ind_pop_data.csv')
wb_df.head()
wb_df.tail()
Pandas is aware of the data types in the columns of your DataFrame. It is also aware of null and NaN
('Not-a-Number') types which often indicate missing data. In this exercise, we have imported pandas as pd
and read in the world population data which contains some NaN
values, a value often used as a place-holder for missing or otherwise invalid data entries. Your job is to use df.info()
to determine information about the total count of non-null
entries and infer the total count of 'null'
entries, which likely indicates missing data. Select the best description of this data set from the following:
Instructions
Possible Answers
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13374 entries, 0 to 13373
Data columns (total 5 columns):
CountryName 13374 non-null object
CountryCode 13374 non-null object
Year 13374 non-null int64
Total Population 9914 non-null float64
Urban population (% of total) 13374 non-null float64
dtypes: float64(2), int64(1), object(2)
memory usage: 522.5+ KB
wb_df.info()
Pandas depends upon and interoperates with NumPy, the Python library for fast numeric array computations. For example, you can use the DataFrame attribute .values
to represent a DataFrame df
as a NumPy array. You can also pass pandas data structures to NumPy methods. In this exercise, we have imported pandas as pd
and loaded world population data every 10 years since 1960 into the DataFrame df
. This dataset was derived from the one used in the previous exercise.
Your job is to extract the values and store them in an array using the attribute .values
. You'll then use those values as input into the NumPy np.log10()
method to compute the base 10 logarithm of the population values. Finally, you will pass the entire pandas DataFrame into the same NumPy np.log10()
method and compare the results.
Instructions
numpy
using the standard alias np
.df
to an array np_vals
using the attribute values
.np_vals
into the NumPy method log10()
and store the results in np_vals_log10
.df
DataFrame into the NumPy method log10()
and store the results in df_log10
.print()
code to see the type()
of the variables that you created.pop_df = pd.read_csv(r'DataCamp-master/11-pandas-foundations/_datasets/world_population.csv')
pop_df.info()
# Create array of DataFrame values: np_vals
np_vals = pop_df.values
np_vals
# Create new array of base 10 logarithm values: np_vals_log10
np_vals_log10 = np.log10(np_vals)
np_vals_log10
# Create array of new DataFrame by passing df to np.log10(): df_log10
pop_df_log10 = np.log10(pop_df)
pop_df_log10
# Print original and new data containers
[print(x, 'has type', type(eval(x))) for x in ['np_vals', 'np_vals_log10', 'pop_df', 'pop_df_log10']]
As a data scientist, you'll frequently interact with NumPy arrays, pandas Series, and pandas DataFrames, and you'll leverage a variety of NumPy and pandas methods to perform your desired computations. Understanding how NumPy and pandas work together will prove to be very useful.
pd.read_csv()
data = {'weekday': ['Sun', 'Sun', 'Mon', 'Mon'],
'city': ['Austin', 'Dallas', 'Austin', 'Dallas'],
'visitors': [139, 237, 326, 456],
'signups': [7, 12, 3, 5]}
users = pd.DataFrame(data)
users
cities = ['Austin', 'Dallas', 'Austin', 'Dallas']
signups = [7, 12, 3, 5]
weekdays = ['Sun', 'Sun', 'Mon', 'Mon']
visitors = [139, 237, 326, 456]
list_labels = ['city', 'signups', 'visitors', 'weekday']
list_cols = [cities, signups, visitors, weekdays] # list of lists
zipped = list(zip(list_labels, list_cols)) # tuples
zipped
data2 = dict(zipped)
users2 = pd.DataFrame(data2)
users2
users['fees'] = 0 # Broadcasts value to entire column
users
heights = [59.0, 65.2, 62.9, 65.4, 63.7, 65.7, 64.1]
data = {'height': heights, 'sex': 'M'} # M is broadcast to the entire column
results = pd.DataFrame(data)
results
results.columns = ['height (in)', 'sex']
results.index = ['A', 'B', 'C', 'D', 'E', 'F', 'G']
results
In this exercise, you're going to make a pandas DataFrame of the top three countries to win gold medals since 1896 by first building a dictionary. list_keys
contains the column names 'Country'
and 'Total'
. list_values
contains the full names of each country and the number of gold medals awarded. The values have been taken from Wikipedia.
Your job is to use these lists to construct a list of tuples, use the list of tuples to construct a dictionary, and then use that dictionary to construct a DataFrame. In doing so, you'll make use of the list()
, zip()
, dict()
and pd.DataFrame()
functions. Pandas has already been imported as pd.
Note: The zip() function in Python 3 and above returns a special zip object, which is essentially a generator. To convert this zip
object into a list, you'll need to use list()
. You can learn more about the zip()
function as well as generators in Python Data Science Toolbox (Part 2).
Instructions
list_keys
and list_values
together into one list of (key, value) tuples. Be sure to convert the zip
object into a list, and store the result in zipped
.zipped
using print()
. This has been done for you.zipped
. Store the result as data
.df
.list_keys = ['Country', 'Total']
list_values = [['United States', 'Soviet Union', 'United Kingdom'], [1118, 473, 273]]
zipped = list(zip(list_keys, list_values)) # tuples
zipped
data = dict(zipped)
data
data_df = pd.DataFrame.from_dict(data)
data_df
You can use the DataFrame attribute df.columns
to view and assign new string labels to columns in a pandas DataFrame.
In this exercise, we have imported pandas as pd
and defined a DataFrame df
containing top Billboard hits from the 1980s (from Wikipedia). Each row has the year, artist, song name and the number of weeks at the top. However, this DataFrame has the column labels a, b, c, d
. Your job is to use the df.columns
attribute to re-assign descriptive column labels.
Instructions
'year'
, 'artist'
, 'song'
, 'chart weeks'
, and assign it to list_labels
.df.columns
.billboard_values = np.array([['1980', 'Blondie', 'Call Me', '6'],
['1981', 'Chistorpher Cross', 'Arthurs Theme', '3'],
['1982', 'Joan Jett', 'I Love Rock and Roll', '7']]).transpose()
billboard_keys = ['a', 'b', 'c', 'd']
billboard_zipped = list(zip(billboard_keys, billboard_values))
billboard_zipped
billboard_dict = dict(billboard_zipped)
billboard_dict
billboard = pd.DataFrame.from_dict(billboard_dict)
billboard
# Build a list of labels: list_labels
list_labels = ['year', 'artist', 'song', 'chart weeks']
# Assign the list of labels to the columns attribute: df.columns
billboard.columns = list_labels
billboard
You can implicitly use 'broadcasting', a feature of NumPy, when creating pandas DataFrames. In this exercise, you're going to create a DataFrame of cities in Pennsylvania that contains the city name in one column and the state name in the second. We have imported the names of 15 cities as the list cities
.
Your job is to construct a DataFrame from the list of cities and the string 'PA'
.
Instructions
cities = ['Manheim', 'Preston park', 'Biglerville',
'Indiana', 'Curwensville', 'Crown',
'Harveys lake', 'Mineral springs', 'Cassville',\
'Hannastown', 'Saltsburg', 'Tunkhannock',
'Pittsburgh', 'Lemasters', 'Great bend']
# Make a string with the value 'PA': state
state = 'PA'
# Construct a dictionary: data
data = {'state': state, 'city': cities}
# Construct a DataFrame from dictionary data: df
pa_df = pd.DataFrame.from_dict(data)
# Print the DataFrame
print(pa_df)
Format: Comma Separated values (adapted for import in spreadsheets)
The separator is the semicolon ';'.
Contents:
Column 1-3: Gregorian calendar date
- Year
- Month
- Day
Column 4: Date in fraction of year.
Column 5: Daily total sunspot number. A value of -1 indicates that no number is available for that day (missing value).
Column 6: Daily standard deviation of the input sunspot numbers from individual stations.
Column 7: Number of observations used to compute the daily value.
Column 8: Definitive/provisional indicator. '1' indicates that the value is definitive. '0' indicates that the value is still provisional.
filepath = r'data/silso_sunspot_data_1818-2019.csv'
sunspots = pd.read_csv(filepath, sep=';')
sunspots.info()
sunspots.iloc[10:20, :]
sunspots = pd.read_csv(filepath, sep=';', header=None)
sunspots.iloc[10:20, :]
col_names = ['year', 'month', 'day', 'dec_date',
'tot_sunspots', 'daily_std', 'observations', 'definite']
sunspots = pd.read_csv(filepath, sep=';', header=None, names=col_names)
sunspots.iloc[10:20, :]
sunspots = pd.read_csv(filepath, sep=';',
header=None,
names=col_names,
na_values='-1')
sunspots.iloc[10:20, :]
sunspots = pd.read_csv(filepath, sep=';',
header=None,
names=col_names,
na_values=' -1')
sunspots.iloc[10:20, :]
sunspots.info()
sunspots = pd.read_csv(filepath, sep=';',
header=None,
names=col_names,
na_values={'tot_sunspots':[' -1'],
'daily_std':['-1']})
sunspots.iloc[10:20, :]
sunspots = pd.read_csv(filepath, sep=';',
header=None,
names=col_names,
na_values={'tot_sunspots':[' -1'],
'daily_std':['-1']},
parse_dates=[[0, 1, 2]])
sunspots.iloc[10:20, :]
sunspots.info()
sunspots.index = sunspots['year_month_day']
sunspots.index.name = 'date'
sunspots.iloc[10:20, :]
sunspots.info()
cols = ['tot_sunspots', 'daily_std', 'observations', 'definite']
sunspots = sunspots[cols]
sunspots.iloc[10:20, :]
out_csv = 'sunspots.csv'
sunspots.to_csv(out_csv)
out_tsv = 'sunspots.tsv'
sunspots.to_csv(out_tsv, sep='\t')
out_xlsx = 'sunspots.xlsx'
sunspots.to_excel(out_xlsx)
In previous exercises, we have preloaded the data for you using the pandas function read_csv()
. Now, it's your turn! Your job is to read the World Bank population data you saw earlier into a DataFrame using read_csv()
. The file is available in the variable data_file
.
The next step is to reread the same file, but simultaneously rename the columns using the names
keyword input parameter, set equal to a list of new column labels. You will also need to set header=0
to rename the column labels.
Finish up by inspecting the result with df.head()
and df.info()
in the IPython Shell (changing df
to the name of your DataFrame variable).
pandas
has already been imported and is available in the workspace as pd
.
Instructions
data_file = 'DataCamp-master/11-pandas-foundations/_datasets/world_population.csv'
# Read in the file: df1
df1 = pd.read_csv(data_file)
# Create a list of the new column labels: new_labels
new_labels = ['year', 'population']
# Read in the file, specifying the header and names parameters: df2
df2 = pd.read_csv(data_file, header=0, names=new_labels)
# Print both the DataFrames
df1.head()
df2.head()
Not all data files are clean and tidy. Pandas provides methods for reading those not-so-perfect data files that you encounter far too often.
In this exercise, you have monthly stock data for four companies downloaded from Yahoo Finance. The data is stored as one row for each company and each column is the end-of-month closing price. The file name is given to you in the variable file_messy
.
In addition, this file has three aspects that may cause trouble for lesser tools: multiple header lines, comment records (rows) interleaved throughout the data rows, and space delimiters instead of commas.
Your job is to use pandas to read the data from this problematic file_messy
using non-default input options with read_csv()
so as to tidy up the mess at read time. Then, write the cleaned up data to a CSV file with the variable file_clean
that has been prepared for you, as you might do in a real data workflow.
You can learn about the option input parameters needed by using help()
on the pandas function pd.read_csv()
.
Instructions
# Read the raw file as-is: df1
file_messy = 'DataCamp-master/11-pandas-foundations/_datasets/messy_stock_data.tsv'
df1 = pd.read_csv(file_messy)
# Print the output of df1.head()
df1.head()
# Read in the file with the correct parameters: df2
df2 = pd.read_csv(file_messy, delimiter=' ', header=3, comment='#')
# Print the output of df2.head()
df2.head()
# Save the cleaned up DataFrame to a CSV file without the index
df2.to_csv(file_clean, index=False)
# Save the cleaned up DataFrame to an excel file without the index
df2.to_excel('file_clean.xlsx', index=False)
cols = ['date', 'open', 'high', 'low', 'close', 'adj_close', 'volume']
aapl = pd.read_csv(r'DataCamp-master/11-pandas-foundations/_datasets/AAPL.csv',
names=cols,
index_col='date',
parse_dates=True,
header=0,
na_values='null')
aapl.head()
aapl.info()
aapl.tail()
close_arr = aapl['close'].values
type(close_arr)
plt.plot(close_arr)
close_series = aapl['close']
type(close_series)
plt.plot(close_series)
close_series.plot()
aapl.plot()
plt.plot(aapl)
aapl.plot()
plt.yscale('log')
plt.show()
aapl['open'].plot(color='b', style='.-', legend=True)
aapl['close'].plot(color='r', style='.', legend=True)
plt.axis(('2000', '2001', 0, 10))
plt.show()
aapl.loc['2001':'2004', ['open', 'close', 'high', 'low']].plot()
plt.savefig('aapl.png')
plt.savefig('aapl.jpg')
plt.savefig('aapl.pdf')
plt.show()
Data visualization is often a very effective first step in gaining a rough understanding of a data set to be analyzed. Pandas provides data visualization by both depending upon and interoperating with the matplotlib library. You will now explore some of the basic plotting mechanics with pandas as well as related matplotlib options. We have pre-loaded a pandas DataFrame df
which contains the data you need. Your job is to use the DataFrame method df.plot()
to visualize the data, and then explore the optional matplotlib input parameters that this .plot()
method accepts.
The pandas .plot()
method makes calls to matplotlib to construct the plots. This means that you can use the skills you've learned in previous visualization courses to customize the plot. In this exercise, you'll add a custom title and axis labels to the figure.
Before plotting, inspect the DataFrame in the IPython Shell using df.head()
. Also, use type(df)
and note that it is a single column DataFrame.
Instructions
data_file = 'DataCamp-master/11-pandas-foundations/_datasets/weather_data_austin_2010.csv'
df = pd.read_csv(data_file, usecols=['Temperature'])
df.info()
df.head()
# Create a plot with color='red'
df.plot(color='r')
# Add a title
plt.title('Temperature in Austin')
# Specify the x-axis label
plt.xlabel('Hours since midnight August 1, 2010')
# Specify the y-axis label
plt.ylabel('Temperature (degrees F)')
# Display the plot
plt.show()
Comparing data from several columns can be very illuminating. Pandas makes doing so easy with multi-column DataFrames. By default, calling df.plot()
will cause pandas to over-plot all column data, with each column as a single line. In this exercise, we have pre-loaded three columns of data from a weather data set - temperature, dew point, and pressure - but the problem is that pressure has different units of measure. The pressure data, measured in Atmospheres, has a different vertical scaling than that of the other two data columns, which are both measured in degrees Fahrenheit.
Your job is to plot all columns as a multi-line plot, to see the nature of vertical scaling problem. Then, use a list of column names passed into the DataFrame df[column_list]
to limit plotting to just one column, and then just 2 columns of data. When you are finished, you will have created 4 plots. You can cycle through them by clicking on the 'Previous Plot' and 'Next Plot' buttons.
As in the previous exercise, inspect the DataFrame df
in the IPython Shell using the .head()
and .info()
methods.
Instructions
data_file = 'DataCamp-master/11-pandas-foundations/_datasets/weather_data_austin_2010.csv'
df = pd.read_csv(data_file, parse_dates=[3], index_col='Date')
df.head()
# Plot all columns (default)
df.plot()
plt.show()
# Plot all columns as subplots
df.plot(subplots=True)
plt.show()
# Plot just the Dew Point data
column_list1 = ['DewPoint']
df[column_list1].plot()
plt.show()
# Plot the Dew Point and Temperature data, but not the Pressure data
column_list2 = ['Temperature','DewPoint']
df[column_list2].plot()
plt.show()
Having learned how to ingest and inspect your data, you will next explore it visually as well as quantitatively. This process, known as exploratory data analysis (EDA), is a crucial component of any data science project. Pandas has powerful methods that help with statistical and visual EDA. In this chapter, you will learn how and when to apply these techniques.
data_file = 'DataCamp-master/11-pandas-foundations/_datasets/iris.csv'
iris = pd.read_csv(data_file)
iris.shape
iris.head()
iris.plot(x='sepal length (cm)', y='sepal width (cm)')
iris.plot(x='sepal length (cm)', y='sepal width (cm)',
kind='scatter')
plt.xlabel('sepal length (cm)')
plt.ylabel('sepal width (cm)')
iris.plot(y='sepal length (cm)',
kind='box')
plt.ylabel('sepal length (cm)')
iris.plot(y='sepal length (cm)',
kind='hist')
plt.xlabel('sepal length (cm)')
iris.plot(y='sepal length (cm)',
kind='hist',
bins=30,
range=(4, 8),
density=True)
plt.xlabel('sepal length (cm)')
iris.plot(y='sepal length (cm)',
kind='hist',
bins=30,
range=(4, 8),
density=True,
cumulative=True)
plt.xlabel('sepal length (cm)')
plt.title('Cumulative Distribution Function (CDF)')
In the previous chapter, you saw that the .plot()
method will place the Index values on the x-axis by default. In this exercise, you'll practice making line plots with specific columns on the x and y axes.
You will work with a dataset consisting of monthly stock prices in 2015 for AAPL, GOOG, and IBM. The stock prices were obtained from Yahoo Finance. Your job is to plot the 'Month' column on the x-axis and the AAPL and IBM prices on the y-axis using a list of column names.
All necessary modules have been imported for you, and the DataFrame is available in the workspace as df. Explore it using methods such as .head()
, .info()
, and .describe()
to see the column names.
Instructions
values = [['Jan', 117.160004, 534.5224450000002, 153.309998],
['Feb', 128.46000700000002, 558.402511, 161.940002],
['Mar', 124.43, 548.002468, 160.5],
['Apr', 125.150002, 537.340027, 171.28999299999995],
['May', 130.279999, 532.1099849999998, 169.649994],
['Jun', 125.43, 520.51001, 162.660004],
['Jul', 121.300003, 625.6099849999998, 161.990005],
['Aug', 112.760002, 618.25, 147.889999],
['Sep', 110.300003, 608.419983, 144.970001],
['Oct', 119.5, 710.8099980000002, 140.080002],
['Nov', 118.300003, 742.599976, 139.419998],
['Dec', 105.260002, 758.880005, 137.619995]]
values = np.array(values).transpose()
cols = ['Month', 'AAPL', 'GOOG', 'IBM']
data_zipped = list(zip(cols, values))
data_dict = dict(data_zipped)
df = pd.DataFrame.from_dict(data_dict, dtype='float')
df
df.info()
# Create a list of y-axis column names: y_columns
y_columns = ['AAPL', 'IBM']
# Generate a line plot
df.plot(x='Month', y=y_columns)
# Add the title
plt.title('Monthly stock prices')
# Add the y-axis label
plt.ylabel('Price ($US)')
# Display the plot
plt.show()
Pandas scatter plots are generated using the kind='scatter'
keyword argument. Scatter plots require that the x and y columns be chosen by specifying the x
and y
parameters inside .plot()
. Scatter plots also take an s
keyword argument to provide the radius of each circle to plot in pixels.
In this exercise, you're going to plot fuel efficiency (miles-per-gallon) versus horse-power for 392 automobiles manufactured from 1970 to 1982 from the UCI Machine Learning Repository.
The size of each circle is provided as a NumPy array called sizes
. This array contains the normalized 'weight'
of each automobile in the dataset.
All necessary modules have been imported and the DataFrame is available in the workspace as df.
Instructions
data_file = 'DataCamp-master/11-pandas-foundations/_datasets/auto-mpg.csv'
df = pd.read_csv(data_file)
df.head()
df.info()
sizes = np.array([ 51.12044694, 56.78387977, 49.15557238, 49.06977358,
49.52823321, 78.4595872 , 78.93021696, 77.41479205,
81.52541106, 61.71459825, 52.85646225, 54.23007578,
58.89427963, 39.65137852, 23.42587473, 33.41639502,
32.03903011, 27.8650165 , 18.88972581, 14.0196956 ,
29.72619722, 24.58549713, 23.48516821, 20.77938954,
29.19459189, 88.67676838, 79.72987328, 79.94866084,
93.23005042, 18.88972581, 21.34122243, 20.6679223 ,
28.88670381, 49.24144612, 46.14174741, 45.39631334,
45.01218186, 73.76057586, 82.96880195, 71.84547684,
69.85320595, 102.22421043, 93.78252358, 110. ,
36.52889673, 24.14234281, 44.84805372, 41.02504618,
20.51976563, 18.765772 , 17.9095202 , 17.75442285,
13.08832041, 10.83266174, 14.00441945, 15.91328975,
21.60597587, 18.8188451 , 21.15311208, 24.14234281,
20.63083317, 76.05635059, 80.05816704, 71.18975117,
70.98330444, 56.13992036, 89.36985382, 84.38736544,
82.6716892 , 81.4149056 , 22.60363518, 63.06844313,
69.92143863, 76.76982089, 69.2066568 , 35.81711267,
26.25184749, 36.94940537, 19.95069229, 23.88237331,
21.79608472, 26.1474042 , 19.49759118, 18.36136808,
69.98970461, 56.13992036, 66.21810474, 68.02351436,
59.39644014, 102.10046481, 82.96880195, 79.25686195,
74.74521151, 93.34830013, 102.05923292, 60.7883734 ,
40.55589449, 44.7388015 , 36.11079464, 37.9986264 ,
35.11233175, 15.83199594, 103.96451839, 100.21241654,
90.18186347, 84.27493641, 32.38645967, 21.62494928,
24.00218436, 23.56434276, 18.78345471, 22.21725537,
25.44271071, 21.36007926, 69.37650986, 76.19877818,
14.51292942, 19.38962134, 27.75740889, 34.24717407,
48.10262495, 29.459795 , 32.80584831, 55.89556844,
40.06360581, 35.03982309, 46.33599903, 15.83199594,
25.01226779, 14.03498009, 26.90404245, 59.52231336,
54.92349014, 54.35035315, 71.39649768, 91.93424995,
82.70879915, 89.56285636, 75.45251972, 20.50128352,
16.04379287, 22.02531454, 11.32159874, 16.70430249,
18.80114574, 18.50153068, 21.00322336, 25.79385418,
23.80266582, 16.65430211, 44.35746794, 49.815853 ,
49.04119063, 41.52318884, 90.72524338, 82.07906251,
84.23747672, 90.29816462, 63.55551901, 63.23059357,
57.92740995, 59.64831981, 38.45278922, 43.19643409,
41.81296121, 19.62393488, 28.99647648, 35.35456858,
27.97283229, 30.39744886, 20.57526193, 26.96758278,
37.07354237, 15.62160631, 42.92863291, 30.21771564,
36.40567571, 36.11079464, 29.70395123, 13.41514444,
25.27829944, 20.51976563, 27.54281821, 21.17188565,
20.18836167, 73.97101962, 73.09614831, 65.35749368,
73.97101962, 43.51889468, 46.80945169, 37.77255674,
39.6256851 , 17.24230306, 19.49759118, 15.62160631,
13.41514444, 55.49963323, 53.18333207, 55.31736854,
42.44868923, 13.86730874, 16.48817545, 19.33574884,
27.3931002 , 41.31307817, 64.63368105, 44.52069676,
35.74387954, 60.75655952, 79.87569835, 68.46177648,
62.35745431, 58.70651902, 17.41217694, 19.33574884,
13.86730874, 22.02531454, 15.75091031, 62.68013142,
68.63071356, 71.36201911, 76.80558184, 51.58836621,
48.84134317, 54.86301837, 51.73502816, 74.14661842,
72.22648148, 77.88228247, 78.24284811, 15.67003285,
31.25845963, 21.36007926, 31.60164234, 17.51450098,
17.92679488, 16.40542438, 19.96892459, 32.99310928,
28.14577056, 30.80379718, 16.40542438, 13.48998471,
16.40542438, 17.84050478, 13.48998471, 47.1451025 ,
58.08281541, 53.06435374, 52.02897659, 41.44433489,
36.60292926, 30.80379718, 48.98404972, 42.90189859,
47.56635225, 39.24128299, 54.56115914, 48.41447259,
48.84134317, 49.41341845, 42.76835191, 69.30854366,
19.33574884, 27.28640858, 22.02531454, 20.70504474,
26.33555201, 31.37264569, 33.93740821, 24.08222494,
33.34566004, 41.05118927, 32.52595611, 48.41447259,
16.48817545, 18.97851406, 43.84255439, 37.22278157,
34.77459916, 44.38465193, 47.00510227, 61.39441929,
57.77221268, 65.12675249, 61.07507305, 79.14790534,
68.42801405, 54.10993164, 64.63368105, 15.42864956,
16.24054679, 15.26876826, 29.68171358, 51.88189829,
63.32798377, 42.36896092, 48.6988448 , 20.15170555,
19.24612787, 16.98905358, 18.88972581, 29.68171358,
28.03762169, 30.35246559, 27.20120517, 19.13885751,
16.12562794, 18.71277385, 16.9722369 , 29.85984799,
34.29495526, 37.54716158, 47.59450219, 19.93246832,
30.60028577, 26.90404245, 24.66650366, 21.36007926,
18.5366546 , 32.64243213, 18.5366546 , 18.09999962,
22.70075058, 36.23351603, 43.97776651, 14.24983724,
19.15671509, 14.17291518, 35.25757392, 24.38356372,
26.02234705, 21.83420642, 25.81458463, 28.90864169,
28.58044785, 30.91715052, 23.6833544 , 12.82391671,
14.63757021, 12.89709155, 17.75442285, 16.24054679,
17.49742615, 16.40542438, 20.42743834, 17.41217694,
23.58415722, 19.96892459, 20.33531923, 22.99334585,
28.47146626, 28.90864169, 43.43816712, 41.57579979,
35.01567018, 35.74387954, 48.5565546 , 57.77221268,
38.98605581, 49.98882458, 28.25412762, 29.01845599,
23.88237331, 27.60710798, 26.54539622, 31.14448175,
34.17556473, 16.3228815 , 17.0732619 , 16.15842026,
18.80114574, 18.80114574, 19.42557798, 20.2434083 ,
20.98452475, 16.07650192, 16.07650192, 16.57113469,
36.11079464, 37.84783835, 27.82194848, 33.46359332,
29.5706502 , 23.38638738, 36.23351603, 32.40968826,
18.88972581, 21.92965639, 28.68963762, 30.80379718])
# Generate a scatter plot
df.plot(kind='scatter', x='hp', y='mpg', s=sizes)
# Add the title
plt.title('Fuel efficiency vs Horse-power')
# Add the x-axis label
plt.xlabel('Horse-power')
# Add the y-axis label
plt.ylabel('Fuel efficiency (mpg)')
# Display the plot
plt.show()
While pandas can plot multiple columns of data in a single figure, making plots that share the same x and y axes, there are cases where two columns cannot be plotted together because their units do not match. The .plot()
method can generate subplots for each column being plotted. Here, each plot will be scaled independently.
In this exercise your job is to generate box plots for fuel efficiency (mpg) and weight from the automobiles data set. To do this in a single figure, you'll specify subplots=True
inside .plot()
to generate two separate plots.
All necessary modules have been imported and the automobiles dataset is available in the workspace as df
.
Instructions
# Make a list of the column names to be plotted: cols
cols = ['weight', 'mpg']
# Generate the box plots
df[cols].plot(kind='box', subplots=True)
# Display the plot
plt.show()
Pandas relies on the .hist()
method to not only generate histograms, but also plots of probability density functions (PDFs) and cumulative density functions (CDFs).
In this exercise, you will work with a dataset consisting of restaurant bills that includes the amount customers tipped.
The original dataset is provided by the Seaborn package.
Your job is to plot a PDF and CDF for the fraction column of the tips dataset. This column contains information about what fraction
of the total bill is comprised of the tip.
Remember, when plotting the PDF, you need to specify normed=True
in your call to .hist()
, and when plotting the CDF, you need to specify cumulative=True
in addition to normed=True
.
All necessary modules have been imported and the tips dataset is available in the workspace as df
. Also, some formatting code has been written so that the plots you generate will appear on separate rows.
Instructions
data_file = 'DataCamp-master/11-pandas-foundations/_datasets/tips.csv'
df = pd.read_csv(data_file)
df.head()
# This formats the plots such that they appear on separate rows
fig, axes = plt.subplots(nrows=2, ncols=1)
# Plot the PDF
df.fraction.plot(ax=axes[0], kind='hist', bins=30, density=True, range=(0,.3))
# Plot the CDF
df.fraction.plot(ax=axes[1], kind='hist', bins=30, density=True, cumulative=True, range=(0,.3))
Describe
iris.describe() # summary statistics
iris['sepal length (cm)'].count() # Applied to Series
iris['sepal width (cm)'].count() # Applied to Series
iris[['petal length (cm)', 'petal width (cm)']].count() # Applied to DataFrame
type(iris[['petal length (cm)', 'petal width (cm)']].count()) # Returns series
iris['sepal length (cm)'].mean() # Applied to Series
iris.mean() # Applied to entire DataFrame
iris.std()
iris['sepal width (cm)'].plot(kind='hist', bins=30)
iris.median()
q = 0.5
iris.quantile(q)
q = [0.25, 0.75]
iris.quantile(q)
iris.min()
iris.max()
iris.plot(kind='box')
plt.ylabel('[cm]')
From the automobiles data set, which value corresponds to the median value of the 'mpg'
column? Your job is to select the 'mpg'
column and call the .median()
method on it. The automobile DataFrame has been provided as df
.
data_file = 'DataCamp-master/11-pandas-foundations/_datasets/auto-mpg.csv'
df = pd.read_csv(data_file)
df.head()
df.median()
In this exercise, you will investigate statistics of the percentage of Bachelor's degrees awarded to women from 1970 to 2011. Data is recorded every year for 17 different fields. This data set was obtained from the Digest of Education Statistics.
Your job is to compute the minimum and maximum values of the 'Engineering'
column and generate a line plot of the mean value of all 17 academic fields per year. To perform this step, you'll use the .mean()
method with the keyword argument axis='columns'
. This computes the mean across all columns per row.
The DataFrame has been pre-loaded for you as df
with the index set to 'Year'
.
Instructions
data_file = 'DataCamp-master/11-pandas-foundations/_datasets/percent-bachelors-degrees-women-usa.csv'
df = pd.read_csv(data_file, index_col='Year')
df.head()
# Print the minimum value of the Engineering column
df.Engineering.min()
# Print the maximum value of the Engineering column
df.Engineering.max()
# Construct the mean percentage per year: mean
mean = df.mean(axis='columns')
mean.head()
# Plot the average percentage per year
mean.plot()
In many data sets, there can be large differences in the mean and median value due to the presence of outliers.
In this exercise, you'll investigate the mean, median, and max fare prices paid by passengers on the Titanic and generate a box plot of the fare prices. This data set was obtained from Vanderbilt University.
All necessary modules have been imported and the DataFrame is available in the workspace as df
.
Instructions
data_file = 'DataCamp-master/11-pandas-foundations/_datasets/titanic.csv'
df = pd.read_csv(data_file)
df.head(3)
df.fare.describe()
df.fare.plot(kind='box')
In this exercise, you'll investigate the probabilities of life expectancy in countries around the world. This dataset contains life expectancy for persons born each year from 1800 to 2015. Since country names change or results are not reported, not every country has values. This dataset was obtained from Gapminder.
First, you will determine the number of countries reported in 2015. There are a total of 260 unique countries in the entire dataset. Then, you will compute the 5th and 95th percentiles of life expectancy over the entire dataset. Finally, you will make a box plot of life expectancy every 50 years from 1800 to 2000. Notice the large change in the distributions over this period.
The dataset has been pre-loaded into a DataFrame called df
.
Instructions
data_file = 'DataCamp-master/11-pandas-foundations/_datasets/life_expectancy_at_birth.csv'
df = pd.read_csv(data_file)
df.head(3)
# Print the number of countries reported in 2015
df['2015'].count()
# Print the 5th and 95th percentiles
df.quantile([0.05, 0.95])
# Generate a box plot
years = ['1800','1850','1900','1950','2000']
df[years].plot(kind='box')
Let's use the mean and standard deviation to explore differences in temperature distributions in Pittsburgh in 2013. The data has been obtained from Weather Underground.
In this exercise, you're going to compare the distribution of daily temperatures in January and March. You'll compute the mean and standard deviation for these two months. You will notice that while the mean values are similar, the standard deviations are quite different, meaning that one month had a larger fluctuation in temperature than the other.
The DataFrames have been pre-loaded for you as january
, which contains the January data, and march
, which contains the March data.
Instructions
jan_values = np.array([['2013-01-01', 28],
['2013-01-02', 21],
['2013-01-03', 24],
['2013-01-04', 28],
['2013-01-05', 30],
['2013-01-06', 34],
['2013-01-07', 29],
['2013-01-08', 31],
['2013-01-09', 36],
['2013-01-10', 34],
['2013-01-11', 47],
['2013-01-12', 55],
['2013-01-13', 62],
['2013-01-14', 44],
['2013-01-15', 30],
['2013-01-16', 32],
['2013-01-17', 32],
['2013-01-18', 24],
['2013-01-19', 42],
['2013-01-20', 35],
['2013-01-21', 18],
['2013-01-22', 9],
['2013-01-23', 11],
['2013-01-24', 16],
['2013-01-25', 16],
['2013-01-26', 23],
['2013-01-27', 23],
['2013-01-28', 40],
['2013-01-29', 59],
['2013-01-30', 58],
['2013-01-31', 32]]).transpose()
cols = ['Date', 'Temperature']
jan_zip = list(zip(cols, jan_values))
jan_dict = dict(jan_zip)
january = pd.DataFrame.from_dict(jan_dict).astype({'Temperature': np.int64})
january.head()
mar_values = np.array([['2013-03-01', 28],
['2013-03-02', 26],
['2013-03-03', 24],
['2013-03-04', 28],
['2013-03-05', 32],
['2013-03-06', 34],
['2013-03-07', 36],
['2013-03-08', 32],
['2013-03-09', 40],
['2013-03-10', 55],
['2013-03-11', 55],
['2013-03-12', 40],
['2013-03-13', 32],
['2013-03-14', 30],
['2013-03-15', 38],
['2013-03-16', 36],
['2013-03-17', 32],
['2013-03-18', 34],
['2013-03-19', 36],
['2013-03-20', 32],
['2013-03-21', 22],
['2013-03-22', 28],
['2013-03-23', 34],
['2013-03-24', 34],
['2013-03-25', 32],
['2013-03-26', 34],
['2013-03-27', 34],
['2013-03-28', 37],
['2013-03-29', 43],
['2013-03-30', 43],
['2013-03-31', 44]]).transpose()
mar_zip = list(zip(cols, mar_values))
mar_dict = dict(mar_zip)
march = pd.DataFrame.from_dict(mar_dict).astype({'Temperature': np.int64})
march.head()
# Print the mean of the January and March data
january.mean()
march.mean()
# Print the standard deviation of the January and March data
january.std()
march.std()
iris.species.describe()
iris.species.unique()
indices = iris['species'] == 'setosa'
setosa = iris.loc[indices,:] # extract new DataFrame
indices = iris['species'] == 'versicolor'
versicolor = iris.loc[indices,:] # extract new DataFrame
indices = iris['species'] == 'virginica'
virginica = iris.loc[indices,:] # extract new DataFrame
setosa['species'].unique()
versicolor['species'].unique()
virginica['species'].unique()
setosa.head(2)
versicolor.head(2)
virginica.head(2)
iris.plot(kind='hist',
bins=50,
range=(0, 8),
alpha=0.3)
plt.title('Entire Iris Dataset')
plt.xlabel('[cm]')
setosa.plot(kind='hist',
bins=50,
range=(0, 8),
alpha=0.3)
plt.title('Setosa Dataset')
plt.xlabel('[cm]')
versicolor.plot(kind='hist',
bins=50,
range=(0, 8),
alpha=0.3)
plt.title('Versicolor Dataset')
plt.xlabel('[cm]')
virginica.plot(kind='hist',
bins=50,
range=(0, 8),
alpha=0.3)
plt.title('Virginica Dataset')
plt.xlabel('[cm]')
describe_all = iris.describe()
describe_all
describe_setosa = setosa.describe()
describe_setosa
describe_versicolor = versicolor.describe()
describe_versicolor
describe_virginica = virginica.describe()
describe_virginica
error_setosa = 100 * np.abs(describe_setosa - describe_all)
error_setosa = error_setosa / describe_setosa
error_setosa
error_versicolor = 100 * np.abs(describe_versicolor - describe_all)
error_versicolor = error_versicolor / describe_versicolor
error_versicolor
error_virginica = 100 * np.abs(describe_virginica - describe_all)
error_virginica = error_virginica / describe_virginica
error_virginica
How many automobiles were manufactured in Asia in the automobile dataset? The DataFrame has been provided for you as df
. Use filtering and the .count()
member method to determine the number of rows where the 'origin'
column has the value 'Asia'
.
As an example, you can extract the rows that contain 'US'
as the country of origin using df[df['origin'] == 'US']
.
data_file = 'DataCamp-master/11-pandas-foundations/_datasets/auto-mpg.csv'
df = pd.read_csv(data_file)
df.head(3)
df[df['origin'] == 'Asia'].origin.count()
Let's use population filtering to determine how the automobiles in the US differ from the global average and standard deviation. How does the distribution of fuel efficiency (MPG) for the US differ from the global average and standard deviation?
In this exercise, you'll compute the means and standard deviations of all columns in the full automobile dataset. Next, you'll compute the same quantities for just the US population and subtract the global values from the US values.
All necessary modules have been imported and the DataFrame has been pre-loaded as df
.
Instructions
# Compute the global mean and global standard deviation: global_mean, global_std
global_mean = df.mean()
global_std = df.std()
# Filter the US population from the origin column: us
us = df[df['origin'] == 'US']
# Compute the US mean and US standard deviation: us_mean, us_std
us_mean = us.mean()
us_std = us.std()
# Print the differences
print(us_mean - global_mean)
print(us_std - global_std)
Population filtering can be used alongside plotting to quickly determine differences in distributions between the sub-populations. You'll work with the Titanic dataset.
There were three passenger classes on the Titanic, and passengers in each class paid a different fare price. In this exercise, you'll investigate the differences in these fare prices.
Your job is to use Boolean filtering and generate box plots of the fare prices for each of the three passenger classes. The fare prices are contained in the 'fare'
column and passenger class information is contained in the 'pclass'
column.
When you're done, notice the portions of the box plots that differ and those that are similar.
The DataFrame has been pre-loaded for you as titanic
.
Instructions
data_file = 'DataCamp-master/11-pandas-foundations/_datasets/titanic.csv'
titanic = pd.read_csv(data_file)
titanic.head(3)
# Display the box plots on 3 separate rows and 1 column
fig, axes = plt.subplots(nrows=3, ncols=1)
# Generate a box plot of the fare prices for the First passenger class
titanic.loc[titanic['pclass'] == 1].plot(ax=axes[0], y='fare', kind='box')
# Generate a box plot of the fare prices for the Second passenger class
titanic.loc[titanic['pclass'] == 2].plot(ax=axes[1], y='fare', kind='box')
# Generate a box plot of the fare prices for the Third passenger class
titanic.loc[titanic['pclass'] == 3].plot(ax=axes[2], y='fare', kind='box')
plt.tight_layout()
In this chapter, you will learn how to manipulate and visualize time series data using Pandas. You will become familiar with concepts such as upsampling, downsampling, and interpolation. You will practice using Pandas' method chaining to efficiently filter your data and perform time series analyses. From stock prices to flight timings, time series data are found in a wide variety of domains and being able to effectively work with such data can be an invaluable skill.
sales = pd.read_csv('data/sales_data/sales-feb-2015.csv',
parse_dates=True,
index_col='Date')
sales.head()
sales.info()
sales.loc['2015-02-19 10:59:00', 'Company']
sales.loc['2015-02-05']
sales.loc['2015-02'].head()
sales.loc['2015-2-16':'2015-2-20']
evening_2_11 = pd.to_datetime(['2015-2-11 20:03',
'2015-2-11 21:00',
'2015-2-11 22:50',
'2015-2-11 23:00'])
evening_2_11
sales.reindex(evening_2_11)
sales.reindex(evening_2_11, method='ffill')
sales.reindex(evening_2_11, method='bfill')
For this exercise, we have read in the same data file using three different approaches:
df1 = pd.read_csv(filename)
df2 = pd.read_csv(filename, parse_dates=['Date'])
df3 = pd.read_csv(filename, index_col='Date', parse_dates=True)
Use the .head()
and .info()
methods in the IPython Shell to inspect the DataFrames. Then, try to index each DataFrame with a datetime string. Which of the resulting DataFrames allows you to easily index and slice data by dates using, for example, df1.loc['2010-Aug-01']
?
data_file = 'DataCamp-master/11-pandas-foundations/_datasets/weather_data_austin_2010.csv'
df1 = pd.read_csv(data_file)
df2 = pd.read_csv(data_file, parse_dates=['Date'])
df3 = pd.read_csv(data_file, index_col='Date', parse_dates=True)
df1.head()
df2.head()
df3.head()
datatime slicing allowed when index is datetime
df1.loc['2010-Aug-01']
df2.loc['2010-Aug-01']
df3.loc['2010-Aug-01'].head()
The pandas Index is a powerful way to handle time series data, so it is valuable to know how to build one yourself. Pandas provides the pd.to_datetime()
function for just this task. For example, if passed the list of strings ['2015-01-01 091234','2015-01-01 091234']
and a format
specification variable, such as format='%Y-%m-%d %H%M%S
, pandas will parse the string into the proper datetime elements and build the datetime objects.
In this exercise, a list of temperature data and a list of date strings has been pre-loaded for you as temperature_list
and date_list
respectively. Your job is to use the .to_datetime()
method to build a DatetimeIndex out of the list of date strings, and to then use it along with the list of temperature data to build a pandas Series.
Instructions
date_file = 'data/date_list.csv'
date_df = pd.read_csv(date_file, header=None)
date_df[0] = date_df[0].map(lambda x: x.lstrip(" '").rstrip("',"))
date_df.head()
date_list = list(date_df[0])
date_list[:10]
temp_list = np.random.uniform(low=41.8, high=95.3, size=8759)
temp_list
# Prepare a format string: time_format
time_format = '%Y-%m-%d %H:%M'
# Convert date_list into a datetime object: my_datetimes
my_datetimes = pd.to_datetime(date_list, format=time_format)
my_datetimes
# Construct a pandas Series using temperature_list and my_datetimes: time_series
time_series = pd.Series(temp_list, index=my_datetimes)
time_series.head()
Pandas time series support "partial string" indexing. What this means is that even when passed only a portion of the datetime, such as the date but not the time, pandas is remarkably good at doing what one would expect. Pandas datetime indexing also supports a wide variety of commonly used datetime string formats, even when mixed.
In this exercise, a time series that contains hourly weather data has been pre-loaded for you. This data was read using the parse_dates=True
option in read_csv()
with index_col="Dates"
so that the Index is indeed a DatetimeIndex
.
All data from the 'Temperature'
column has been extracted into the variable ts0
. Your job is to use a variety of natural date strings to extract one or more values from ts0
.
After you are done, you will have three new variables - ts1
, ts2
, and ts3
. You can slice these further to extract only the first and last entries of each. Try doing this after your submission for more practice.
Instructions
# Extract the hour from 9pm to 10pm on '2010-10-11': ts1
ts1 = time_series.loc['2010-10-11 21:00:00':'2010-10-11 22:00:00']
ts1.head()
# Extract '2010-07-04' from ts0: ts2
ts2 = time_series.loc['2010-07-04']
ts2.head()
# Extract data from '2010-12-15' to '2010-12-31': ts3
ts3 = time_series.loc['2010-12-15':'2010-12-31']
ts3.head()
Reindexing is useful in preparation for adding or otherwise combining two time series data sets. To reindex the data, we provide a new index and ask pandas to try and match the old data to the new index. If data is unavailable for one of the new index dates or times, you must tell pandas how to fill it in. Otherwise, pandas will fill with NaN
by default.
In this exercise, two time series data sets containing daily data have been pre-loaded for you, each indexed by dates. The first, ts1
, includes weekends, but the second, ts2
, does not. The goal is to combine the two data sets in a sensible way. Your job is to reindex the second data set so that it has weekends as well, and then add it to the first. When you are done, it would be informative to inspect your results.
Instructions
ts1_index = pd.DatetimeIndex(['2016-07-01', '2016-07-02', '2016-07-03', '2016-07-04',
'2016-07-05', '2016-07-06', '2016-07-07', '2016-07-08',
'2016-07-09', '2016-07-10', '2016-07-11', '2016-07-12',
'2016-07-13', '2016-07-14', '2016-07-15', '2016-07-16',
'2016-07-17'])
ts1_index
ts1_values = np.array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16])
ts1_values
ts1 = pd.Series(ts1_values, index=ts1_index)
ts1.head()
ts2_index = pd.DatetimeIndex(['2016-07-01', '2016-07-04', '2016-07-05', '2016-07-06',
'2016-07-07', '2016-07-08', '2016-07-11', '2016-07-12',
'2016-07-13', '2016-07-14', '2016-07-15'])
ts2_values = np.array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
ts2 = pd.Series(ts2_values, index=ts2_index)
ts2.head()
# Reindex without fill method: ts3
ts3 = ts2.reindex(ts1.index)
ts3
# Reindex with fill method, using forward fill: ts4
ts4 = ts2.reindex(ts1.index, method='ffill')
ts4
# Combine ts1 + ts2: sum12
sum12 = ts1 + ts2
sum12
# Combine ts1 + ts3: sum13
sum13 = ts1 + ts3
sum13
# Combine ts1 + ts4: sum14
sum14 = ts1 + ts4
sum14
sales = pd.read_csv('data/sales_data/sales-feb-2015.csv',
parse_dates=True,
index_col='Date')
sales.head()
mean()
sum()
count()
# etc.
daily_mean = sales.resample('D').mean()
daily_mean.head()
daily_mean.loc['2015-2-2']
sales.loc['2015-2-2', 'Units']
sales.loc['2015-2-2', 'Units'].mean()
sales.resample('D').sum().head()
sales.resample('D').sum().max()
sales.resample('W').count()
%%html
<style>
table {margin-left: 0 !important;}
</style>
Input | Description |
---|---|
'min', 'T' | minute |
'H' | hour |
'D' | day |
'B' | business day |
'W' | week |
'M' | month |
'Q' | quarter |
'A' | year |
sales.loc[:, 'Units'].resample('2W').sum()
two_days = sales.loc['2015-2-4':'2015-2-5', 'Units']
two_days
two_days.resample('4H').ffill()
Pandas provides methods for resampling time series data. When downsampling or upsampling, the syntax is similar, but the methods called are different. Both use the concept of 'method chaining' - df.method1().method2().method3()
- to direct the output from one method call to the input of the next, and so on, as a sequence of operations, one feeding into the next.
For example, if you have hourly data, and just need daily data, pandas will not guess how to throw out the 23 of 24 points. You must specify this in the method. One approach, for instance, could be to take the mean, as in df.resample('D').mean()
.
In this exercise, a data set containing hourly temperature data has been pre-loaded for you. Your job is to resample the data using a variety of aggregation methods to answer a few questions.
Instructions
df = pd.read_csv('DataCamp-master/11-pandas-foundations/_datasets/weather_data_austin_2010.csv',
parse_dates=True,
index_col='Date')
df.head()
# Downsample to 6 hour data and aggregate by mean: df1
df1 = df.Temperature.resample('6H').mean()
df1.head()
# Downsample to daily data and count the number of data points: df2
df2 = df.Temperature.resample('D').count()
df2.head()
With pandas, you can resample in different ways on different subsets of your data. For example, resampling different months of data with different aggregations. In this exercise, the data set containing hourly temperature data from the last exercise has been pre-loaded.
Your job is to resample the data using a variety of aggregation methods. The DataFrame is available in the workspace as df
. You will be working with the 'Temperature'
column.
Instructions
# Extract temperature data for August: august
august = df.loc['2010-08', 'Temperature']
august.head()
# Downsample to obtain only the daily highest temperatures in August: august_highs
august_highs = august.resample('D').max()
august_highs.head()
# Extract temperature data for February: february
february = august = df.loc['2010-02', 'Temperature']
february.head()
# Downsample to obtain the daily lowest temperatures in February: february_lows
february_lows = february.resample('D').min()
february_lows.head()
Rolling mean and frequency In this exercise, some hourly weather data is pre-loaded for you. You will continue to practice resampling, this time using rolling means.
Rolling means (or moving averages) are generally used to smooth out short-term fluctuations in time series data and highlight long-term trends. You can read more about them here.
To use the .rolling()
method, you must always use method chaining, first calling .rolling()
and then chaining an aggregation method after it. For example, with a Series hourly_data
, hourly_data.rolling(window=24).mean()
would compute new values for each hourly point, based on a 24-hour window stretching out behind each point. The frequency of the output data is the same: it is still hourly. Such an operation is useful for smoothing time series data.
Your job is to resample the data using the combination of .rolling()
and .mean()
. You will work with the same DataFrame df
from the previous exercise.
Instructions
# Extract data from 2010-Aug-01 to 2010-Aug-15: unsmoothed
unsmoothed = df['Temperature']['2010-Aug-01':'2010-Aug-15']
unsmoothed.head()
# Apply a rolling mean with a 24 hour window: smoothed
smoothed = df['Temperature']['2010-Aug-01':'2010-Aug-15'].rolling(window=24).mean()
smoothed.iloc[20:30]
# Create a new DataFrame with columns smoothed and unsmoothed: august
august = pd.DataFrame({'smoothed':smoothed, 'unsmoothed':unsmoothed})
august.head()
# Plot both smoothed and unsmoothed data using august.plot().
august.plot()
As of pandas version 0.18.0, the interface for applying rolling transformations to time series has become more consistent and flexible, and feels somewhat like a groupby
(If you do not know what a groupby
is, don't worry, you will learn about it in the next course!).
You can now flexibly chain together resampling and rolling operations. In this exercise, the same weather data from the previous exercises has been pre-loaded for you. Your job is to extract one month of data, resample to find the daily high temperatures, and then use a rolling and aggregation operation to smooth the data.
Instructions
# Extract the August 2010 data: august
august = df['Temperature']['2010-08']
august.head()
# Resample to daily data, aggregating by max: daily_highs
daily_highs = august.resample('D').max()
daily_highs.head()
# Use a rolling 7-day window with method chaining to smooth the daily high temperatures in August
daily_highs_smoothed = daily_highs.rolling(window=7).mean()
daily_highs_smoothed.head(10)
sales = pd.read_csv('data/sales_data/sales-feb-2015.csv',
parse_dates=['Date'])
sales.head()
sales['Company'].str.upper().head()
sales['Product'].str.contains('ware').head()
print(True + False)
print(True + True)
print(False + False)
sales['Product'].str.contains('ware').sum()
sales['Date'].dt.hour.head()
central = sales['Date'].dt.tz_localize('US/Central')
central.head()
central.dt.tz_convert('US/Eastern').head()
sales['Date'].dt.tz_localize('US/Central').dt.tz_convert('US/Eastern').head()
population = pd.read_csv('DataCamp-master/11-pandas-foundations/_datasets/world_population.csv',
parse_dates=True,
index_col= 'Date')
population
population.resample('A').first().head(11)
population.resample('A').first().interpolate('linear').head(11)
We've seen that pandas supports method chaining. This technique can be very powerful when cleaning and filtering data.
In this exercise, a DataFrame containing flight departure data for a single airline and a single airport for the month of July 2015 has been pre-loaded. Your job is to use .str()
filtering and method chaining to generate summary statistics on flight delays each day to Dallas.
Instructions
df = pd.read_csv('DataCamp-master/11-pandas-foundations/_datasets/austin_airport_departure_data_2015_july.csv',
skiprows=15,
parse_dates=True,
index_col='Date (MM/DD/YYYY)')
df.head()
# Strip extra whitespace from the column names: df.columns
print(f'Before: \n {df.columns}')
df.columns = df.columns.str.strip()
print(f'After: \n {df.columns}')
# Extract data for which the destination airport is Dallas: dallas
dallas = df['Destination Airport'].str.contains('DAL')
dallas.head()
# Compute the total number of Dallas departures each day: daily_departures
daily_departures = dallas.resample('D').sum()
daily_departures.head()
# Generate the summary statistics for daily Dallas departures: stats
stats = daily_departures.describe()
stats
One common application of interpolation in data analysis is to fill in missing data.
In this exercise, noisy measured data that has some dropped or otherwise missing values has been loaded. The goal is to compare two time series, and then look at summary statistics of the differences. The problem is that one of the data sets is missing data at some of the times. The pre-loaded data ts1
has value for all times, yet the data set ts2
does not: it is missing data for the weekends.
Your job is to first interpolate to fill in the data for all days. Then, compute the differences between the two data sets, now that they both have full support for all times. Finally, generate the summary statistics that describe the distribution of differences.
Instructions
ts1_index = pd.DatetimeIndex(['2016-07-01', '2016-07-02', '2016-07-03', '2016-07-04',
'2016-07-05', '2016-07-06', '2016-07-07', '2016-07-08',
'2016-07-09', '2016-07-10', '2016-07-11', '2016-07-12',
'2016-07-13', '2016-07-14', '2016-07-15', '2016-07-16',
'2016-07-17'])
ts1_index
ts1_values = np.array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16])
ts1_values
ts1 = pd.Series(ts1_values, index=ts1_index)
ts1.head()
ts2_index = pd.DatetimeIndex(['2016-07-01', '2016-07-04', '2016-07-05', '2016-07-06',
'2016-07-07', '2016-07-08', '2016-07-11', '2016-07-12',
'2016-07-13', '2016-07-14', '2016-07-15'])
ts2_values = np.array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
ts2 = pd.Series(ts2_values, index=ts2_index)
ts2.head()
# Reset the index of ts2 to ts1, and then use linear interpolation to fill in the NaNs: ts2_interp
ts2_interp = ts2.reindex(ts1.index).interpolate(how='linear')
ts2_interp
# Compute the absolute difference of ts1 and ts2_interp: differences
differences = np.abs(ts1 - ts2_interp)
differences
# Generate and print summary statistics of the differences
differences.describe()
Time zone handling with pandas typically assumes that you are handling the Index of the Series. In this exercise, you will learn how to handle timezones that are associated with datetimes in the column data, and not just the Index.
You will work with the flight departure dataset again, and this time you will select Los Angeles ('LAX'
) as the destination airport.
Here we will use a mask to ensure that we only compute on data we actually want. To learn more about Boolean masks, click here!
Instructions
df = pd.read_csv('DataCamp-master/11-pandas-foundations/_datasets/austin_airport_departure_data_2015_july.csv',
skiprows=15,
parse_dates=True)
df.columns = df.columns.str.strip()
df.head()
# Build a Boolean mask to filter out all the 'LAX' departure flights: mask
mask = df['Destination Airport'] == 'LAX'
# Use the mask to subset the data: la
la = df[mask]
la.head()
# Combine two columns of data to create a datetime series: times_tz_none
times_tz_none = pd.to_datetime(la['Date (MM/DD/YYYY)'] + ' ' + la['Wheels-off Time'])
times_tz_none.head()
# Localize the time to US/Central: times_tz_central
times_tz_central = times_tz_none.dt.tz_localize('US/Central')
times_tz_central.head()
# Convert the datetimes from US/Central to US/Pacific
times_tz_pacific = times_tz_central.dt.tz_convert('US/Pacific')
times_tz_pacific.head()
Topics
sp500 = pd.read_csv('data/sp500_2010-01-01_-_2015-12-31.csv',
parse_dates=True,
index_col= 'Date')
sp500.head()
sp500['Close'].plot()
sp500['Close'].plot(title='S&P 500')
plt.ylabel('Closing Price (US Dollars)')
sp500.loc['2012-4-1':'2012-4-7', 'Close'].plot(title='S&P 500')
In [11]: plt.ylabel('Closing Price (US Dollars)')
sp500.loc['2012-4', 'Close'].plot(style='k.-', title='S&P500')
plt.ylabel('Closing Price (US Dollars)')
Color | Marker | Line |
---|---|---|
b: blue | o: circle | : dotted |
g: green | *: star | -: dashed |
r: red | s: square | |
c: cyan | +: plus |
sp500['Close'].plot(kind='area', title='S&P 500')
plt.ylabel('Closing Price (US Dollars)')
sp500.loc['2012', ['Close','Volume']].plot(title='S&P 500')
sp500.loc['2012', ['Close','Volume']].plot(subplots=True)
Pandas handles datetimes not only in your data, but also in your plotting.
In this exercise, some time series data has been pre-loaded. However, we have not parsed the date-like columns nor set the index, as we have done for you in the past!
The plot displayed is how pandas renders data with the default integer/positional index. Your job is to convert the 'Date'
column from a collection of strings into a collection of datetime objects. Then, you will use this converted 'Date'
column as your new index, and re-plot the data, noting the improved datetime awareness. After you are done, you can cycle between the two plots you generated by clicking on the 'Previous Plot' and 'Next Plot' buttons.
Before proceeding, look at the plot shown and observe how pandas handles data with the default integer index. Then, inspect the DataFrame df
using the .head()
method in the IPython Shell to get a feel for its structure.
Instructions
df = pd.read_csv('DataCamp-master/11-pandas-foundations/_datasets/weather_data_austin_2010.csv',
usecols=[0, 3])
df.head()
# Plot the raw data before setting the datetime index
df.plot()
# Convert the 'Date' column into a collection of datetime objects: df.Date
df.Date = pd.to_datetime(df.Date)
df.Date.head()
# Set the index to be the converted 'Date' column
df.set_index('Date', inplace=True)
df.head()
# Re-plot the DataFrame to see that the axis is now datetime aware!
df.plot()
Now that you have set the DatetimeIndex in your DataFrame, you have a much more powerful and flexible set of tools to use when plotting your time series data. Of these, one of the most convenient is partial string indexing and slicing. In this exercise, we've pre-loaded a full year of Austin 2010 weather data, with the index set to be the datetime parsed 'Date'
column as shown in the previous exercise.
Your job is to use partial string indexing of the dates, in a variety of datetime string formats, to plot all the summer data and just one week of data together. After you are done, you can cycle between the two plots by clicking on the 'Previous Plot' and 'Next Plot' buttons.
First, remind yourself how to extract one month of temperature data using 'May 2010'
as a key into df.Temperature[]
, and call head()
to inspect the result: df.Temperature['May 2010'].head()
.
Instructions
df = pd.read_csv('DataCamp-master/11-pandas-foundations/_datasets/weather_data_austin_2010.csv',
parse_dates=True,
index_col='Date')
df.head()
# Plot the summer data
df.Temperature['2010-Jun':'2010-Aug'].plot()
# Plot the one week data
df.Temperature['2010-06-10':'2010-06-17'].plot()
Working with real-world weather and climate data, in this chapter you will bring together and apply all of the skills you have acquired in this course. You will use Pandas to manipulate the data into a form usable for analysis, and then systematically explore it using the techniques you learned in the prior chapters. Enjoy!
df_climate = pd.read_csv('DataCamp-master/11-pandas-foundations/_datasets/weather_data_austin_2010.csv',
parse_dates=True,
index_col='Date')
df_climate.head()
df = pd.read_csv('DataCamp-master/11-pandas-foundations/_datasets/NOAA_QCLCD_2011_hourly_13904.txt',
header=None)
df.head()
Now that you have identified the method to use to read the data, let's try to read one file. The problem with real data such as this is that the files are almost never formatted in a convenient way. In this exercise, there are several problems to overcome in reading the file. First, there is no header, and thus the columns don't have labels. There is also no obvious index column, since none of the data columns contain a full date or time.
Your job is to read the file into a DataFrame using the default arguments. After inspecting it, you will re-read the file specifying that there are no headers supplied.
The CSV file has been provided for you as the variable data_file
.
Instructions
data_file = 'DataCamp-master/11-pandas-foundations/_datasets/NOAA_QCLCD_2011_hourly_13904.txt'
# Read in the data file: df
df = pd.read_csv(data_file)
df.head()
# Read in the data file with header=None: df_headers
df_headers = pd.read_csv(data_file,
header=None)
df_headers.head()
After the initial step of reading in the data, the next step is to clean and tidy it so that it is easier to work with.
In this exercise, you will begin this cleaning process by re-assigning column names and dropping unnecessary columns.
pandas has been imported in the workspace as pd
, and the file NOAA_QCLCD_2011_hourly_13904.txt
has been parsed and loaded into a DataFrame df
. The comma separated string of column names, column_labels
, and list of columns to drop, list_to_drop
, have also been loaded for you.
Instructions
column_labels = 'Wban,date,Time,StationType,sky_condition,sky_conditionFlag,visibility,visibilityFlag,wx_and_obst_to_vision,wx_and_obst_to_visionFlag,dry_bulb_faren,dry_bulb_farenFlag,dry_bulb_cel,dry_bulb_celFlag,wet_bulb_faren,wet_bulb_farenFlag,wet_bulb_cel,wet_bulb_celFlag,dew_point_faren,dew_point_farenFlag,dew_point_cel,dew_point_celFlag,relative_humidity,relative_humidityFlag,wind_speed,wind_speedFlag,wind_direction,wind_directionFlag,value_for_wind_character,value_for_wind_characterFlag,station_pressure,station_pressureFlag,pressure_tendency,pressure_tendencyFlag,presschange,presschangeFlag,sea_level_pressure,sea_level_pressureFlag,record_type,hourly_precip,hourly_precipFlag,altimeter,altimeterFlag,junk'
list_to_drop = ['sky_conditionFlag',
'visibilityFlag',
'wx_and_obst_to_vision',
'wx_and_obst_to_visionFlag',
'dry_bulb_farenFlag',
'dry_bulb_celFlag',
'wet_bulb_farenFlag',
'wet_bulb_celFlag',
'dew_point_farenFlag',
'dew_point_celFlag',
'relative_humidityFlag',
'wind_speedFlag',
'wind_directionFlag',
'value_for_wind_character',
'value_for_wind_characterFlag',
'station_pressureFlag',
'pressure_tendencyFlag',
'pressure_tendency',
'presschange',
'presschangeFlag',
'sea_level_pressureFlag',
'hourly_precip',
'hourly_precipFlag',
'altimeter',
'record_type',
'altimeterFlag',
'junk']
# Split on the comma to create a list: column_labels_list
column_labels_list = column_labels.split(',')
column_labels_list
# Assign the new column labels to the DataFrame: df.columns
df.columns = column_labels_list
# Remove the appropriate columns: df_dropped
df_dropped = df.drop(list_to_drop, axis='columns')
df_dropped.head()
In order to use the full power of pandas time series, you must construct a DatetimeIndex
. To do so, it is necessary to clean and transform the date and time columns.
The DataFrame df_dropped
you created in the last exercise is provided for you and pandas has been imported as pd
.
Your job is to clean up the date
and Time
columns and combine them into a datetime collection to be used as the Index.
Instructions
# Convert the date column to string: df_dropped['date']
df_dropped['date'] = df_dropped.date.astype(str)
# Pad leading zeros to the Time column: df_dropped['Time']
df_dropped['Time'] = df_dropped['Time'].apply(lambda x:'{:0>4}'.format(x))
# Concatenate the new date and Time columns: date_string
date_string = df_dropped['date'] + df_dropped['Time']
date_string.head()
# Convert the date_string Series to datetime: date_times
date_times = pd.to_datetime(date_string, format='%Y%m%d%H%M')
date_times.head()
# Set the index to be the new date_times container: df_clean
df_clean = df_dropped.set_index(date_times)
df_clean.head()
The numeric columns contain missing values labeled as 'M'. In this exercise, your job is to transform these columns such that they contain only numeric values and interpret missing data as NaN.
The pandas function pd.to_numeric() is ideal for this purpose: It converts a Series of values to floating-point values. Furthermore, by specifying the keyword argument errors='coerce', you can force strings like 'M' to be interpreted as NaN.
A DataFrame df_clean is provided for you at the start of the exercise, and as usual, pandas has been imported as pd.
Instructions
# Print the dry_bulb_faren temperature between 8 AM and 9 AM on June 20, 2011
df_clean.loc['2011-6-20 08:00:00':'2011-6-20 09:00:00', 'dry_bulb_faren']
# Convert the dry_bulb_faren column to numeric values: df_clean['dry_bulb_faren']
df_clean['dry_bulb_faren'] = pd.to_numeric(df_clean['dry_bulb_faren'], errors='coerce')
df_clean.dry_bulb_faren.head()
# Print the transformed dry_bulb_faren temperature between 8 AM and 9 AM on June 20, 2011
df_clean.loc['2011-6-20 08:00:00':'2011-6-20 09:00:00', 'dry_bulb_faren']
# Convert the wind_speed and dew_point_faren columns to numeric values
df_clean['wind_speed'] = pd.to_numeric(df_clean['wind_speed'], errors='coerce')
df_clean['dew_point_faren'] = pd.to_numeric(df_clean['dew_point_faren'], errors='coerce')
df_clean[['wind_speed', 'dew_point_faren']].head()
climate2010['2010-05-31 22:00:00'] # datetime
climate2010['2010-06-01'] # Entire day
climate2010['2010-04'] # Entire month
climate2010['2010-09':'2010-10'] # 2 months
Now that you have the data read and cleaned, you can begin with statistical EDA. First, you will analyze the 2011 Austin weather data.
Your job in this exercise is to analyze the 'dry_bulb_faren' column and print the median temperatures for specific time ranges. You can do this using partial datetime string selection.
The cleaned dataframe is provided in the workspace as df_clean.
Instructions
# Print the median of the dry_bulb_faren column
df_clean.dry_bulb_faren.median()
# Print the median of the dry_bulb_faren column for the time range '2011-Apr':'2011-Jun'
df_clean.loc['2011-Apr':'2011-Jun', 'dry_bulb_faren'].median()
# Print the median of the dry_bulb_faren column for the month of January
df_clean.loc['2011-Jan', 'dry_bulb_faren'].median()
You're now ready to compare the 2011 weather data with the 30-year normals reported in 2010. You can ask questions such as, on average, how much hotter was every day in 2011 than expected from the 30-year average?
The DataFrames df_clean
and df_climate
from previous exercises are available in the workspace.
Your job is to first resample df_clean
and df_climate
by day and aggregate the mean temperatures. You will then extract the temperature related columns from each - 'dry_bulb_faren'
in df_clean
, and 'Temperature'
in df_climate
- as NumPy arrays and compute the difference.
Notice that the indexes of df_clean
and df_climate
are not aligned - df_clean
has dates in 2011, while df_climate
has dates in 2010. This is why you extract the temperature columns as NumPy arrays. An alternative approach is to use the pandas .reset_index()
method to make sure the Series align properly. You will practice this approach as well.
Instructions
# Downsample df_clean by day and aggregate by mean: daily_mean_2011
daily_mean_2011 = df_clean.resample('D').mean()
daily_mean_2011.head()
# Extract the dry_bulb_faren column from daily_mean_2011 using .values: daily_temp_2011
daily_temp_2011 = daily_mean_2011.dry_bulb_faren.values
daily_temp_2011[0:10]
# Downsample df_climate by day and aggregate by mean: daily_climate
daily_climate = df_climate.resample('D').mean()
daily_climate.head()
# Extract the Temperature column from daily_climate using .reset_index(): daily_temp_climate
daily_temp_climate = daily_climate.reset_index()['Temperature']
daily_temp_climate.head()
# Compute the difference between the two arrays and print the mean difference
difference = daily_temp_2011 - daily_temp_climate
difference.mean()
On average, how much hotter is it when the sun is shining? In this exercise, you will compare temperatures on sunny days against temperatures on overcast days.
Your job is to use Boolean selection to filter out sunny and overcast days, and then compute the difference of the mean daily maximum temperatures between each type of day.
The DataFrame df_clean
from previous exercises has been provided for you. The column 'sky_condition'
provides information about whether the day was sunny ('CLR'
) or overcast ('OVC'
).
Instructions 1/3
df_clean.head(3)
# Using df_clean, when is sky_condition 'CLR'?
is_sky_clear = df_clean['sky_condition']=='CLR'
is_sky_clear.head()
# Filter df_clean using is_sky_clear
sunny = df_clean[is_sky_clear].copy()
sunny = sunny.apply(lambda col: pd.to_numeric(col, errors='coerce')).dropna(how='all', axis=1)
sunny.head(3)
# Resample sunny by day then calculate the max
# Additional cleaning was done in the previous cell, to remove all non-numeric values
# Resample doesn't work if the column has strings
sunny_daily_max = sunny.resample('1D').max()
sunny_daily_max.head(3)
Instructions 2/3
# Using df_clean, when does sky_condition contain 'OVC'?
is_sky_overcast = df_clean['sky_condition'].str.contains('OVC')
# Filter df_clean using is_sky_overcast
overcast = df_clean[is_sky_overcast].copy()
overcast = overcast.apply(lambda col: pd.to_numeric(col, errors='coerce')).dropna(how='all', axis=1)
# Resample overcast by day then calculate the max
overcast_daily_max = overcast.resample('D').max()
overcast_daily_max.head(3)
Instructions 3/3
# Calculate the mean of sunny_daily_max
sunny_daily_max_mean = sunny_daily_max.mean()
sunny_daily_max_mean
# Calculate the mean of overcast_daily_max
overcast_daily_max_mean = overcast_daily_max.mean()
overcast_daily_max_mean
# Print the difference (sunny minus overcast)
sunny_daily_max_mean - overcast_daily_max_mean
The average daily maximum dry bulb temperature was 6.5 degrees Fahrenheit higher on sunny days compared to overcast days.
df_climate.Temperature.loc['2010-07'].plot(title='Temperature (July 2010)')
df_climate.DewPoint.plot(kind='hist', bins=30,
title='Dew Point Distribution (2010)',
ec='black',
color='g')
df_climate.DewPoint.plot(kind='box',
title='Dew Point Distribution (2010)')
df_climate.plot(kind='hist',
density=True,
subplots=True,
ec='Black')
Is there a correlation between temperature and visibility? Let's find out.
In this exercise, your job is to plot the weekly average temperature and visibility as subplots. To do this, you need to first select the appropriate columns and then resample by week, aggregating the mean.
In addition to creating the subplots, you will compute the Pearson correlation coefficient using .corr()
. The Pearson correlation coefficient, known also as Pearson's r, ranges from -1 (indicating total negative linear correlation) to 1 (indicating total positive linear correlation). A value close to 1 here would indicate that there is a strong correlation between temperature and visibility.
The DataFrame df_clean
has been pre-loaded for you.
Instructions
df_clean.visibility = pd.to_numeric(df_clean.visibility, errors='coerce')
# Select the visibility and dry_bulb_faren columns and resample them: weekly_mean
weekly_mean = df_clean[['visibility', 'dry_bulb_faren']].resample('W').mean()
weekly_mean.head()
# Print the output of weekly_mean.corr()
weekly_mean.corr()
# Plot weekly_mean with subplots=True
weekly_mean.plot(subplots=True)
In a previous exercise, you analyzed the 'sky_condition'
column to explore the difference in temperature on sunny days compared to overcast days. Recall that a 'sky_condition'
of 'CLR'
represents a sunny day. In this exercise, you will explore sunny days in greater detail. Specifically, you will use a box plot to visualize the fraction of days that are sunny.
The 'sky_condition'
column is recorded hourly. Your job is to resample this column appropriately such that you can extract the number of sunny hours in a day and the number of total hours. Then, you can divide the number of sunny hours by the number of total hours, and generate a box plot of the resulting fraction.
As before, df_clean
is available for you in the workspace.
Instructions 1/3
# Using df_clean, when is sky_condition 'CLR'?
is_sky_clear = df_clean.sky_condition == 'CLR'
# Resample is_sky_clear by day
resampled = is_sky_clear.resample('D')
Instructions 2/3
# Calculate the number of sunny hours per day
sunny_hours = resampled.sum()
sunny_hours.head()
# Calculate the number of measured hours per day
total_hours = resampled.count()
total_hours.head()
# Calculate the fraction of hours per day that were sunny
sunny_fraction = sunny_hours/total_hours
sunny_fraction.head()
Instructions 3/3
sunny_fraction.plot(kind='box')
The weather in the dataset is typically sunny less than 40% of the time.
Dew point is a measure of relative humidity based on pressure and temperature. A dew point above 65 is considered uncomfortable while a temperature above 90 is also considered uncomfortable.
In this exercise, you will explore the maximum temperature and dew point of each month. The columns of interest are 'dew_point_faren'
and 'dry_bulb_faren'
. After resampling them appropriately to get the maximum temperature and dew point in each month, generate a histogram of these values as subplots. Uncomfortably, you will notice that the maximum dew point is above 65 every month!
df_clean
has been pre-loaded for you.
Instructions
# Resample dew_point_faren and dry_bulb_faren by Month, aggregating the maximum values: monthly_max
monthly_max = df_clean[['dew_point_faren', 'dry_bulb_faren']].resample('M').max()
monthly_max.head()
# Generate a histogram with bins=8, alpha=0.5, subplots=True
monthly_max.plot(kind='hist',
ec='black',
bins=8,
alpha=0.5,
subplots=True)
We already know that 2011 was hotter than the climate normals for the previous thirty years. In this final exercise, you will compare the maximum temperature in August 2011 against that of the August 2010 climate normals. More specifically, you will use a CDF plot to determine the probability of the 2011 daily maximum temperature in August being above the 2010 climate normal value. To do this, you will leverage the data manipulation, filtering, resampling, and visualization skills you have acquired throughout this course.
The two DataFrames df_clean
and df_climate
are available in the workspace. Your job is to select the maximum temperature in August in df_climate
, and then maximum daily temperatures in August 2011. You will then filter out the days in August 2011 that were above the August 2010 maximum, and use this to construct a CDF plot.
Once you've generated the CDF, notice how it shows that there was a 50% probability of the 2011 daily maximum temperature in August being 5 degrees above the 2010 climate normal value!
Instructions
# Extract the maximum temperature in August 2010 from df_climate: august_max
august_max = df_climate.Temperature.loc['2010-08'].max()
august_max
# Resample August 2011 temps in df_clean by day & aggregate the max value: august_2011
august_2011 = df_clean.dry_bulb_faren.loc['2011-08'].resample('D').max()
august_2011.head()
# Filter for days in august_2011 where the value exceeds august_max: august_2011_high
august_2011_high = august_2011.loc[august_2011 > august_max]
august_2011_high.head()
# Construct a CDF of august_2011_high
august_2011_high.plot(kind='hist', bins=25, density=True, cumulative=True, ec='black')
You can now...