Teradata Notes

  

These are my personal notes that I use as a quick help in my work.
You are welcome to read them.

Contents of current page Top-level home page
 
Index  Java Internet Oracle Notes
Linux Basics Web Basics SQL Notes
Informatica Servlets Apache BkpRstore SQL*Plus
Teradata   LDAP Storage PL/SQL
Windows     Tables OEM
UML   Net8 Portal
SQL Server Python perl Performance OLAP
Vmware Visual Basic PHP/MySQL User Mgmt  
Git        
More technical pages here

Contents

  • Python and Teradataml

     

     


    Admin

     

    See if database is running:
    pdestate -a

    start database:
    /etc/init.d/tpa start

    select * from dbc.databases; – I notice that we have schemas, accounts, and owners
    select * from dbc.tables where tablename like '...';
    select * from dbc.users;

    HELP SESSION: to get some information on the current session

     

     


    Documentation

     

    https://www.teradata.com/University/Overview

     

     


    Teradata Studio UI

    fonts, etc: menu > window > preferences > general > appearance
    There is a reset button too

    The DBC user/schema has many useful views.

    Some "schemas" – things listed in the explorer – are users, and some are databases.

    To see only the databases, and not the users: menu > window > preferences > teradata datatools > data source explorer load.
    The icon distinguishes between users and databases, or you can do right-click and see the DDL

    The pattern I see is:

    Result set viewer: wrench on the resultset bar, or menu > window > preferences > terdata datatools > result set viewer
    alternate rows in color: enable
    max display row count: 500

    On history, useful columns are

    note: "notes" allows adding a note to the history

    Enable suggestions: menu > SQL > code assist autoactivation

    projects:
    stored in the workspace
    See where the projects are stored: right-click on the project in the explorer

    Ctrl+/ to add or remove commenting

     

     


    SQL Editor

    DATABASE db-nm – sets the current database to db-nm
    SELECT DATABASE – shows current database

    ctrl+space: templates
    Add templates to preferences > data management > sql development > sql editor > templates

    Query pause button: red button in upper right

     

     


    Databases

    CREATE DATABASE dbname 
        AS 
        PERM=0;`  -- minimum
    
    CREATE DATABASE dbname 
        FROM parentdb 
        AS    -- generally parentdb is DBC
        PERM = 6400000000
        SPOOL = 39662995374
        TEMPORARY = 39662995374  -- not mandatory
        FALLBACK                 -- not mandatory
        ACCOUNT= 'accountname'   -- not mandatory
        NO AFTER JOURNAL         -- not mandatory
        NO BEFORE JOURNAL;       -- not mandatory

    (It looks like FALLBACK if there is PERM>0, otherwise NO FALLBACK)

    Permanent space is space available for permanent objects
    Spool space is the part of permanent space available for intermediate results
    Temp space is the unused part of the permanent space, used for Global Temporary tables

     

     


    Connections

    TMODE=TERA TERA is case INsensitive on comparing strings
    TMODE=ANSI ANSI is case sensitive on comparing strings

     

     


    Tables

    SHOW TABLE tbl-nm; Show the DDL statement (error if a view)
    SHOW VIEW vw-nm; Show the DDL statement
    HELP TABLE vw-nm; Return list of columns as result set, for table or view. This includes comments.
    HELP STATS tbl-nm; Return statistics for table or view

    Copy a table:

    CREATE TABLE db.tablename   -- use "..." to surround names if necessary
        AS db.sourcetablename 
        WITH DATA;
    CREATE TABLE db.empty_table  -- creates an empty table
        AS db.sourcetablename 
        WITH NO DATA;            -- notice "NO"

    Create based on a query

    CREATE TABLE tablename 
        AS (SELECT ...) 
        WITH [NO] DATA;

    Minimal table creation:

    CREATE TABLE db.tb_nm                   -- best practice: put database name
        ( a VARCHAR(10) TITLE 'the name' NOT NULL
        , n INTEGER TITLE 'num' FORMAT 'zzz9'
        , ind CHAR(1) TITLE 'Important'
        )
        UNIQUE PRIMARY INDEX( a );

    Additional configurations:

    CREATE [SET|MULTISET] TABLE db.tb_nm    -- multiset accepts duplicates
        , [NO] FALLBACK                     -- a copy is in another AMP
        , NO BEFORE JOURNAL
        , NO AFTER JOURNAL
        , CHECKSUM = [ALL|NONE|DEFAULT]
        , DATABLOCKSIZE = ...               -- small for OLTP, large for OLAP, usually specified by DBA
        , MERGEBLOCKRATIO = 50 PERCENT      -- less used
        , FREESPACE = 50 PERCENT            -- less used
        , MAP = TD_MAP1            -- less used, see maps with   select MapName from dbc.maps;
        ( ....
        )
        UNIQUE PRIMARY INDEX( a );

    With journal:

    CREATE [SET|MULTISET] TABLE db.tb_nm    -- multiset accepts duplicates
        , BEFORE JOURNAL
        , AFTER JOURNAL
        , WITH JOURNAL = tb_nm_2

    SHOW TABLE tb_nm; – show the DDL

    DROP TABLE tb_nm;

    SELECT databasename, tablename, columnname FROM dbc.columns WHERE UPPER(columnname) LIKE UPPER('%event%');
    SELECT databasename, tablename FROM dbc.tables WHERE UPPER(tablename) LIKE UPPER('%event%');
    SELECT databasename FROM dbc.databases;
    SELECT * FROM dbc.databases;

     

     


    Table Types

    Derived table: basically a named table in a query. Lives during a query

    Volatile table:
    lives in a user session (not in data dictionary)
    CREATE [SET|MULTISET] VOLATILE TABLE tb_nm AS (SELECT ....) WITH DATA;
    By default, the rows are dropped on commit. To keep:
    CREATE [SET|MULTISET] VOLATILE TABLE tb_nm AS (SELECT ....) [WITH DATA] ON COMMIT PRESERVE ROWS;
    In spool space

    Global temporary table:
    lives during a user session, and visible to other sessions (in data dictionary, other users can see)
    CREATE [SET|MULTISET] GLOBAL TEMPORARY TABLE tb_nm AS (SELECT ....) WITH DATA;
    By default, the rows are dropped on commit. To keep:
    CREATE [SET|MULTISET] GLOBAL TEMPORARY TABLE tb_nm AS (SELECT ....) [WITH DATA] ON COMMIT PRESERVE ROWS;

    SET tables do NOT contain duplicate records
    MULTISET tables contain duplicate records

     

     


    Columns

    ALTER TABLE tb_nm ADD col_nm data_type; 
    ALTER TABLE tb_nm RENAME col_nm TO col_nm2;
    ALTER TABLE tb_nm DROP col_nm;

    BYTEINT (-128 to 127)
    SMALLINT (2 times bigger)
    INT, INTEGER (again twice as big, about -2*10**9 to 2*10**9)
    BIGINT (8 bytes long)
    DECIMAL(m,n) (1 to 16 bytes), m = total # digits, n = precision
        Does rounding for precision, but fails when too many digits before decimal
    NUMBER(m,n) (1 to 16 bytes), m = total # digits, n = precision. Idem about number of digits.
    FLOAT,REAL (8 bytes, IEEE format, from 10**-308 to 10**308)
    CHAR (fixed)
    VARCHAR (n): variable, with n up to 64K characters
    LONG VARCHAR: more than 64K char
    DATE YYYYMMDD 4 bytes
    TIME 6 or 8 bytes long
    TIME WITH ZONE
    TIMESTAMP date and time together
    TIMESTAMP WITH ZONE
    Also large objects, geospatial, XML

    Dates stored as integers:
    ((YEAR - 1900) * 10000) + (MONTH * 100) + DAY
    Now:     CURRENT_DATE (just date)
        CURRENT_TIME (just time)
        CURRENT_TIMESTAMP (date and time down to sub-sec)
    CAST('28/04/2022' as TIMESTAMP FORMAT 'DD/MM/YYYY')

    – Additional column attributes:
    NOT NULL
    DEFAULT val
    WITH DEFAULT -- 0 for numeric, space for char
    FORMAT 'YYYY/MM/DD' -- remember: single quotes
    UPPERCASE
    CASESPECIFIC -- force case to matter for comparisons
    PRIMARY KEY

    COMPRESS -- compresses nulls
    COMPRESS 'abc' -- compresses nulls and ‘abc’
    COMPRESS ('abc', 'def') -- compresses nulls, ‘abc’ and ‘def’
    COMPRESS USING user-defined-function
    DECOMPRESS USING user-defined-function

    Multi-Value Compression (MVC)
    Take up to 255 distinct values out of the stored rows, and store in the table header in each AMP.
    Cannot do this on a primary index column

     

     


    Indexes

    Unique primary index and non unique primary index (duplicate values allowed)
    The primary index determines which AMP executes the query (data distribution) via a hashing algorithm
    UNIQUE PRIMARY INDEX (col[s]) -- UPI
    PRIMARY INDEX (col[s]) -- NUPI
    To change the primary index, you have to drop and re-create the table

    Secondary indexes are non-primary indexes. They can be unique or not
    They are not involved in data distribution
    UNIQUE INDEX (col[s]) -- USI
    INDEX (col[s]) -- NUSI

     

     


    Macros

    Users need only EXEC privilege, not privileges on the individual objects

    Single transaction.

    REPLACE MACRO abc(p1 INTEGER)    -- for VARCHAR, you have to give a length
    AS 
    (
      SELECT ... WHERE col = :p1  ;   -- ";" after each statement
    );
    EXEC abc(123);
    SHOW MACRO abc;
    DROP MACRO abc;

    I can do DDL, but it has to be the last statement

    It is not necessary to use CREATE the first time, therefore use REPLACE instead

     

     


    Stored procedures

    CREATE PROCEDURE proc_name (pr 1 data_type, p2 data_type ) 
    BEGIN 
       <SQL or SPL statements>
      SELECT ... WHERE col = :p1  ;
    END;
    CALL proc_name;
    CALL proc_name(1,2);

     

     

    REPLACE PROCEDURE  <scm>.<pnm>
       (
        IN var_in   VARCHAR(30)
       ,OUT var_out VARCHAR(250)
       )
    SQL SECURITY INVOKER
    MAIN: BEGIN
     
      DECLARE   a_var VARCHAR(250); 
      DECLARE   i INTEGER;
      SET a_var  = ' ';
      SET i  = 1;
     
     
     
    DECLARE cr CURSOR FOR
             SELECT         ...
             FROM   ...
                    ;
     
    for_loop: FOR
     tab_cr AS cr
        CURSOR FOR
             SELECT         ...
             FROM   ...   -- repeat what is above
                    ;
      DO 
         ...
      END FOR for_loop;
     
      IF ( SQLSTATE <> '00000') THEN
         SET var_out = 'A value';
         LEAVE MAIN;
      END IF;
     
      WHILE cond
      DO
         ....
      END WHILE;
     
      SELECT COUNT(*)
      INTO   i
      FROM   a_table;
    
      -- Transaction in Teradata session.  Otherwise, in ANSI, use COMMIT; and ROLLBACK;
      BT;
      update or insert
      ET;
     
     
      CLOSE cr;
     
      SET var_out = 'A value';
       
    END MAIN;
     
     
    -- This may not be needed because you can execute "in-line"
    CALL DBC.SYSEXECSQL('COLLECT STATS ON '||TRIM(a_db)||'.'||TRIM(a_table)||';');
     
    CREATE MULTISET VOLATILE VT_a_table AS
    (
    SELECT * FROM ...
    )
    WITH DATA PRIMARY INDEX ( the_id )
    ON COMMIT PRESERVE ROWS;
    
    
    DROP TABLE VT_a_table;   -- best practice to explicitely drop

     

     

    Functions

    select otranslate('abcd012zZaA09', 'abcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'aaaaaaaaaaaaaaaaaaaaaaaaaa9999999999AAAAAAAAAAAAAAAAAAAAAAAAAA')

     

     


    OLAP functions

    <aggregate function> OVER  
    ([PARTITION BY] 
    [ORDER BY columnname]
    [ROWS BETWEEN UNBOUDED PRECEDING AND UNBOUNDED FOLLOWING) 

    <aggregate function> can be sum, count, max, min, avg

     

     


    Select

    NOT =
    LIKE, NOT LIKE, LIKE ANY ('...', '...'), LIKE SOME ('...', '...'), LIKE ALL ('...', '...')
    LIKE 'ABC\_DEF' ESCAPE '\'
    "%" for 0 or many, "_" for one character
    CASESPECIFIC or NOT CASESPECIFIC: WHERE col (CASESPECIFIC) LIKE 'S%'

    SELECT TOP n ...
    TOP n PERCENT: takes n% rows
    TOP n WITH TIES

    SAMPLE n: takes n rows
    SAMPLE 0.4: takes 40% of rows
    To get a fully random sample:
    SAMPLE RANDOMIZED ALLOCATION 0.1

    CROSS JOIN: no ON clause. "Cross join" is a cartesian join

    RANK() OVER(PARTITION BY ... ORDER BY ... DESC)

    QUALIFY: a filter on ordered window (analytical) function:
    Seems to be the equivalent of "WHERE" or "HAVING" for analytical functions
    Example: see the top for each partition:
        RANK() OVER(PARTITION BY ... ORDER BY ... DESC) col_alias
        QUALIFY col_alias = 1

    ROW_NUMBER() does not have ties, RANK() has ties

    TYPE(col) returns data type of column

    CONCAT(col, col) or ||
    SUBSTRING(S, start_num, length)
    SUBSTR: idem
    SUBSTRING(str FROM n FOR m) old syntax??
    TRIM, RTRIM, LTRIM
    UPPER, LOWER
    INDEX(str, substr): returns starting pos of substr in str. 0 if not found\POSITION: same as INDEX
    LENGTH
    OTRANSLATE(str, 'abc', 'xyz') a–>x, b–>y, c–>z
    OTRANSLATE(str, 'abc', '') remove
    OREPLACE(str, o, n) replace every occurrence of o with n
    CAST(x AS INTEGER) verify with TYPE(CAST(... as type))
    CAST(str AS DATE FORMAT 'YYYY-MM-DD') B=blank
    also: TO_CHAR, TO_NUMBER, TO_DATE

    CASE WHEN condition THEN val
         ELSE val
    END 
    CASE value WHEN value THEN val
               ELSE val
    END 

    COALESCE(val, val) first non null value in list

    NULLIF(expr1, expr2)

    Dates:
    CAST(str AS DATE FORMAT 'YYYY-MM-DD') B=blank
    TO_DATE(str, 'YYYY-MM-DD')
    CURRENT_DATE + INTERVAL '3' YEAR
    CURRENT_DATE + INTERVAL '3-01' YEAR TO MONTH or CURRENT_DATE + INTERVAL '03' YEAR + INTERVAL '1' MONTH
    CURRENT_TIMESTAMP + INTERVAL '01 05:10' DAY TO MINUTE This format can be confusing: it is better to do INTERVAL ... + INTERVAL
    WHERE dt_fld BETWEEN (CURRENT_DATE - INTERVAL '4' DAY) AND CURRENT_DATE last 4 days
    EXTRACT(YEAR FROM CURRENT_DATE) – month, day
    EXTRACT(HOUR FROM CURRENT_TIMESTAMP) – minute, second
    interval: see doc

    Time: 6 bytes, stored as three fields, with 4 bytes for seconds with up to 6 digits after decimal
    CAST (col AS TIME WITH TIME ZONE)
    SET TIME ZONE INTERVAL -'4' HOUR
    Get just the time:
    SUBSTRING(TO_CHAR(CAST(EVENTTIME AS TIME)),1,8)
    Format: HH, MI, B (blank or space), T=AM/PM, SS, SSDS(one decimal point), Z=timezone
    FORMAT 'HHhMImSSs' –> 14h32h24s

    Timestamp: 4 bytes for date, 6 for time

    Python and Teradataml

    https://docs.teradata.com/r/Teradata-Package-for-Python-User-Guide/November-2021/DataFrames-for-Tables-and-Views/DataFrame-Manipulation/Filtering-Rows-and-Columns/filter-Method

    https://docs.teradata.com/r/Getting-Started-Guide-for-Data-Scientists-Using-Python-with-Vantage/December-2020
    https://docs.teradata.com/r/Teradata-Package-for-Python-Function-Reference/November-2021
    https://docs.teradata.com/r/Teradata-Package-for-Python-User-Guide/November-2021

     

     


    Basic Start

    import teradataml as t
    # or: import * from teradataml 
    eng = t.create_context(host='192.168.1.223', username='dbc', password='dbc', database="hr")
    t.db_list_tables("schema name", "%ou%", "[all|table|view|volatile|temp]")  # all parameters are optional
    
    t.db_list_tables("schema_name", "%part_of_table%")

     

     


    Connections

    cn=t.get_connection()
    for row in cn.execute("select * from dbc.databases where databasename like '%H%';"):
        print(row)
    
    def e(s):
        for r in t.get_connection().execute(s):
            print(r)
    e("select trim(databasename), trim(tablename) from dbc.tables where databasename like '%H%' order by 1,2")
    # this is very flexible, and is not as picky as creating a dataframe

     

     


    DataFrames

    df = t.DataFrame.from_table(t.in_schema("dbc", "databases"))
    
    df = t.DataFrame.from_query("select sn, city, upper(state) as st from hr.ville_temperature;")
    # This creates a temporary table (derived?) and requires the appropriate permissions
    
    
    
    df.info()
    df.describe()
    df.dtypes
    df.columns
    df.index  # null if no primary index on the table
    df.shape[0]  # number of rows
    df.sample(3)
    df.count(distinct=True)
    
    df[df.TableKind=="T"].select(["TableKind","tbname","dbname"]).groupby(["TableKind","dbname"]).count().sort(["TableKind","dbname"])
    
    df[(df.col==4) & ~(df.col5==5 | df.col6==4)].select(["col1", "col2"]).groupby(["col1"]).max()

     

     


    DataFrame Methods

    assign():   adds new columns.  df.assign(new_col = ...., new_col2 = ...).  Add "drop_columns=True" to remove original columns
    concat():   concatenate two DataFrams, like a union operation
    describe():     generation statistics (profiling)
    drop():     drop columns: df.drop(["col1", "col2], axis=1); drop rows: df.drop(df.col==xyz, axis=0), df.drop([index_val1, index_val2], axis=0)
    dropna():   looks similar to dropna in spark
    filter():   get a subset of rows, or columns
    get():      get columns from DataFrame
    get_values():   get values in the form of a numpy array
    groupby():  .groupby("col") or .groupby(["col1", "col2"]).  Note:   returns all columns
    head():     first n rows
    join():     with how taking left, inner, right, full, or cross.  df1.join(other = df2, on = [df1.col1==df2.col1b,"col2=col3"], how = "left", lsuffix = "t1", rsuffix = "t2")
    map_row():  apply a function to the rows
    map_partition():    apply a function to the rows
    merge():    with how taking inner, left, right, or full (not cross).  df1.merge(right = df2, on = ["col1"=="col1b", "col2=col3"], how = "left", use_index = True, lsuffix = "t1", rsuffix = "t2"). Alternate for on:    on = [df1.col1==df2.col1b, df1.col3!=df2.col4]
    sample():   .sample(n=10), .sample(frac=0.2), .sample(randomize = True)
    select():   select("col"), .select(["col1", "col2"]), .select([["col1", "col2"]]) 
    set_index():    .set_index("col"), .set_index(["col1", "col2"]).  Add a column to an index: .set_index("col3", append = True, drop = True)
    show_query():   show the underlying query.  Without parameter, although df.show_query(full_query = True) may be necessary
    sort():     .sort("col"), .sort(["col1", "col2"], ascending=True)
    sort_index():   sort
    squeeze():  only for one-dimensional objects
    tail():     last n rows
    
    df.info()
    df.keys(), same as .columns
    df.columns: list of column names
    df.dtypes: column names and types
    df.index: usually the primary index of the underying table or view
    df.shape
    df.size: number of elements=rows*columns.  Get the number of rows with .shape[0]
    df.tdtypes: the column names and the data types of the underlying table
    
    df.to_sql('the_name', primary_index='col', set_table=True, if_exists="replace")
    
    DataFrame.from_table(t.in_schema("dbc", "databases"))

     

     


    Other Misc Notes

    Vantage is the new packaging
    licensing : developer (no fees), base, advanced, enterprise
    Interfaces: TD studio (also has an admin pane), TD studio express, TD SQL Asst (old), BTEQ (download under TD tools and utilities)

    joins: left outer join, right outer join, full outer join
    cross join: every row on left with every row on right

    COLLECT [SUMMARY] STATISTICS INDEX (indexname) COLUMN (columnname) ON <tablename>;

    COLLECT STATS ON VT_the_name_of_the_volitile_table PRIMARY INDEX(... key columns ...);
    Three ways to collect statistics:

    HELP STATISTICS <table name>; Show statistics

    EXPLAIN select * from ... where .. gives the explain plan

    JOIN INDEX = materialized view, for improving performance

    virtual box:
    to make db available to host, go to the settings of the vm, and select "bridge adaptor" then check the ip address with ifconfig on the vm

     

     


    Additional topics

    Additional topics that need further reading:

    git

    echo "# iom" >> README.md
    git init
    git add README.md
    git commit -m "first commit"
    git branch -M main
    git remote add origin https://github.com/yekesys/iom.git
    git push -u origin main

    create token:

    Unstage:
    git restore --staged .