Forum

Notifications
Clear all

AI4I-5-SUP-2: Syntax error for sklearn.pipeline.make_pipeline on VS Code  

   RSS

0
Hey there, looking for some guidance here as to what the error is, apologies for the formatting as I can't seem to format as code for only selected portions of text

I think it's either because I didn't set up the VS Code environment correctly so I can't import the libraries like sklearn or it's because of a syntax error that I missed out

Based on https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html?highlight=make_pipeline#sklearn.pipeline.make_pipeline the syntax should be:

from sklearn.naive_bayes import GaussianNB from sklearn.preprocessing import StandardScaler make_pipeline(StandardScaler(), GaussianNB(priors=None))

What I typed in the exercise is:

model = make_pipeline(preprocess, LinearRegression())

The error I get at VS Code's terminal is:

Traceback (most recent call last):
File "C:\Users\Darius\AppData\Local\Programs\Python\Python37-32\lib\runpy.py", line 193, in _run_module_as_main "__main__", mod_spec)
File "C:\Users\Darius\AppData\Local\Programs\Python\Python37-32\lib\runpy.py", line 85, in _run_code exec(code, run_globals)
File "c:\Users\Darius\.vscode\extensions\ms-python.python-2020.5.80290\pythonFiles\lib\python\debugpy\wheels\debugpy\__main__.py", line 45, in cli.main()
File "c:\Users\Darius\.vscode\extensions\ms-python.python-2020.5.80290\pythonFiles\lib\python\debugpy\wheels\debugpy/..\debugpy\server\cli.py", line 430, in main run()
File "c:\Users\Darius\.vscode\extensions\ms-python.python-2020.5.80290\pythonFiles\lib\python\debugpy\wheels\debugpy/..\debugpy\server\cli.py", line 267, in run_file runpy.run_path(options.target, run_name=compat.force_str("__main__"))
File "C:\Users\Darius\AppData\Local\Programs\Python\Python37-32\lib\runpy.py", line 261, in run_path code, fname = _get_code_from_file(run_name, path_name)
File "C:\Users\Darius\AppData\Local\Programs\Python\Python37-32\lib\runpy.py", line 236, in _get_code_from_file code = compile(f.read(), fname, 'exec')
File "c:\Projects\ai4i\ai4i-5\ai4i-5-sup-2\learnai_regression2\Exercises\Ex_LinearRegression_start.py", line 104
model = make_pipeline(preprocess, LinearRegression())
^
SyntaxError: invalid syntax

When I mouseover the line, I get the following error on VS Code:

unexpected token 'model'Python(parser-16)

unexpected token 'model'Python(parser-16)

invalid syntax (, line 104)pylint(syntax-error)

3 Answers
0

Hi @darius-low!

Are you referring to a specific lesson or is this a question from your own exploration?

Normally a SyntaxError happens when there's a missing quote or indentation or comma. I suspect the error is due to code that is *around* that line of code rather than that line of code itself (sometimes python errors can be tricky like that).

Maybe you want to look at the rest of the code around it, adjust it a bit, or paste the whole code here.

0

@siowy

This is a question that's part of the lesson in the post title. The lesson is: AI4I Batch 5's AI4I-5: Supervised Learning, SUP-2: Regression.

The question is:

To combine the transforms, you will need to use either Pipeline() or make_pipeline(), which is a simplified version of Pipeline. This will apply the defined transforms in sequence on the attributes.

Complete the code by combining the preprocessing steps with the ML algorithm. Replace the comment with a simple LinearRegression(). Verify that the library import for this algorithm has been done at the start of the script.

The code is referring to line 104 of the following code:

# AI Singapore

# Regression 2 Exercise

# Exercise: Building a Regression job template

import datetime as d

import joblib

# 1. Import required libraries

import numpy as np

import pandas as pd

from sklearn.base import BaseEstimator, TransformerMixin

from sklearn.compose import ColumnTransformer, make_column_transformer

from sklearn.impute import SimpleImputer

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error, r2_score

from sklearn.model_selection import cross_val_score, train_test_split

from sklearn.pipeline import Pipeline, make_pipeline

from sklearn.preprocessing import MinMaxScaler, OneHotEncoder, StandardScaler

# Information on Data

#  https://www.kaggle.com/c/home-data-for-ml-course/data

# Custom Classes and Functions

def display_df_info(df_name, my_df, v=False):

    """Convenience function to display information about a dataframe"""

    print("Data: {}".format(df_name))

    print("Shape (rows, cols) = {}".format(my_df.shape))

    print("First few rows...")

    print(my_df.head())

    # Optional: Display other optional information with the (v)erbose flag

    if v:

        print("Dataframe Info:")

        print(my_df.info())

class GetAge(BaseEstimator, TransformerMixin):

    """Custom Transformer: Calculate age (years only) relative to current year. Note that

    the col values will be replaced but the original col name remains. When the transformer is

    used in a pipeline, this is not an issue as the names are not used. However, if the data

    from the pipeline is to be converted back to a DataFrame, then the col name change should

    be done to reflect the correct data content."""

    def fit(self, X, y=None):

        return self

    def transform(self, X):

        current_year = int(d.datetime.now().year)

        """TASK: Replace the 'YearBuilt' column values with the calculated age (subtract the

        current year from the original values).

        """

        X.apply(lambda x: current_year - x)

        return X

def main():

    # DATA INPUT

    ############

    # TASK: Modify to path of file

    file_path = "C:\Projects\ai4i\ai4i-5\ai4i-5-sup-2\learnai_regression2\Exercises\train.csv"

    # TASK: Read in the input csv file using pandas

    input_data = pd.read_csv(file_path)

    display_df_info("Raw Input", input_data)

    # Seperate out the outcome variable from the loaded dataframe

    output_var_name = 'SalePrice'

    output_var = input_data[output_var_name]

    input_data.drop(output_var_name, axis=1, inplace=True)

    # DATA ENGINEERING / MODEL DEFINITION

    #####################################

    # Subsetting the columns: define features to keep

    feature_names = ['LotArea''YearBuilt''1stFlrSF''2ndFlrSF''FullBath''BedroomAbvGr''TotRmsAbvGrd''HouseStyle']  # TASK: Define the names of the columns to keep

    features = input_data[feature_names]

    display_df_info('Features before Transform', features, v=True)

    # Create the pipeline ...

    # 1. Pre-processing

    # Define variables made up of lists. Each list is a set of columns that will go through the same data transformations.

    numerical_features = ['LotArea''YearBuilt''1stFlrSF''2ndFlrSF''FullBath''BedroomAbvGr''TotRmsAbvGrd''HouseStyle']  # TASK: Define numerical column names

    categorical_features = ['HouseStyle']  # TASK: Define categorical column names

    """TASK:

    Define the data processing steps (transformers) to be applied to the numerical features in the dataset.

    At a minimum, use 2 transformers: GetAge() and one other. Combine them using make_pipeline() or Pipeline()

    """

    preprocess = make_column_transformer(

        ("""TASK: Define transformers"""(GetAge(), numerical_features),

        (StandardScaler(), numerical_features),

        (OneHotEncoder(), categorical_features)

    )

    # 2. Combine pre-processing with ML algorithm

    # TASK : replace with ML algorithm from scikit

    model = make_pipeline(preprocess, LinearRegression())

    # TRAINING

    ##########

    # Train/Test Split

    """TASK:

    Split the data in test and train sets by completing the train_test_split function below. Define a random_state value so that

    the experiment is repeatable.

    """

    x_train, x_test, y_train, y_test=train_test_split(

        input_data, output_var, test_size= 0.3, random_state=42)  # TASK: Complete the code

    # Train the pipeline

    model.fit(x_train, y_train)

    # Optional: Train with cross-validation and/or parameter grid search

    cv_scores=cross_val_score(model, input_data, output_var, cv=5)

    # SCORING/EVALUATION

    ####################

    # Fit the model on the test data

    pred_test=model.predict(x_test)

    # Display the results of the metrics

    """TASK:

    Calculate the RMSE and Coeff of Determination between the actual and predicted sale prices.

    Name your variables rmse and r2 respectively.

    """

    rmse=mean_squared_error(output_var, pred_test)

    r2=r2_score(output_var, pred_test)

    print("Results on Test Data")

    print("####################")

    print("RMSE: {:.2f}".format(rmse))

    print("R2 Score: {:.5f}".format(r2))

    # Compare actual vs predicted values

    """TASK:

    Create a new dataframe which combines the actual and predicted Sale Prices from the test dataset. You

    may also add columns with other information such as difference, abs diff, %tage difference etc.

    Name your variable compare

    """

    compare=pd.DataFrame((output_var, pred_test))

    display_df_info('Actual vs Predicted Comparison', compare)

    # Save the model

    with open('my_model_lr.joblib''wb'as fo:

        joblib.dump(model, fo)

if __name__ == '__main__':

    main()

 

0

As suspected, the error is in the portion just above line 104. There are unclosed brackets in the call to make_column_transformer, partially because of 

("""TASK: Define transformers"""
 
in one of the lines. Just remove that and you should be fine. See the attached picture. I have "Rainbow brackets" extension installed on vscode, so matching pairs of brackets have the same color and you can see that the orange brackets have no match.
image

 

This post was modified 5 months ago 2 times by Yi Sheng

@siowy
Thanks !

I've installed the extension and it really makes a world of difference to spotting these syntax errors (:

Sry for the delay- I've been waiting for the notification email but it didn't trigger because I had this browser window open all this while.

No problem, glad to have helped!

@siowy so i'm stuck again on the next step, i think it is better for my learning if ask here by articulating my train of thought instead of checking the answer. is this a good approach to learning even though it slows down my progression through this course ?

anyway, where i'm stuck is as below:

my understanding of how to define getage() is incorrect and i can't figure it out on my own.

the error i'm getting:

Exception has occurred: TypeError
All estimators should implement fit and transform, or can be 'drop' or 'passthrough' specifiers. '(GetAge(), ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd', 'HouseStyle'])' (type ) doesn't.

"""TASK: Replace the 'YearBuilt' column values with the calculated age (subtract the
current year from the original values).
"""

to complete the above task, my understanding is to deduct current_year from each individual row in the 'YearBuilt' column, however i don't know the correct syntax to define it correctly.

i tried to use the below but i'm quite sure the ['YearBuilt'] portion is wrong:

X = current_year - ['YearBuilt']

could i have a hint/the documentation page to search for the right syntax ?

thanks again !

Share:

Delete your account